qstr abuse in git-cifs
static int cifs_ci_compare(struct dentry *dentry, struct qstr *a, struct qstr *b) { struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls; if ((a->len == b->len) && (nls_strnicmp(codepage, a->name, b->name, a->len) == 0)) { /* * To preserve case, don't let an existing negative dentry's * case take precedence. If a is not a negative dentry, this * should have no side effects */ memcpy(a->name, b->name, a->len); return 0; } return 1; } produces fs/cifs/dir.c: In function 'cifs_ci_compare': fs/cifs/dir.c:596: warning: passing argument 1 of '__constant_memcpy' discards qualifiers from pointer target type fs/cifs/dir.c:596: warning: passing argument 1 of '__memcpy' discards qualifiers from pointer target type I suspect that bad things are happening in there. It's strange for a "comparison" function to go and alter one of the things which it's comparing, too. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote: > > On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > > > Ok, so it's probably a side effect of the writeback changes. > > > > > > Attached are two patches (two because one was in a separate patchset as > > > a standalone change) that should prevent async writeback from blocking > > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > > > Can you see if this fixes the problem? > > > > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied > > raid5-patch > > Applying your two patches ontop of that does not fix the stalls. > > So you are having RAID5 problems as well? The first 2.6.24-rc1-git-kernel that I patched with your patches did not boot for me. (Oops send in one of my previous mails) But given that the stacktrace was not xfs related and I had seen this patch on the lkml, I tried to fix this Oops this way. I did not have troubles with the RAID5 otherwise. > I'm struggling to understand what possible changed in XFS or writeback that > would lead to stalls like this, esp. as you appear to be removing files when > the stalls occur. Rather than vmstat, can you use something like iostat to > show how busy your disks are? i.e. are we seeing RMW cycles in the raid5 or > some such issue. Will do this this evening. > OOC, what is the 'xfs_info ' output for your filesystem? meta-data=/dev/mapper/root isize=256agcount=32, agsize=4731132 blks = sectsz=512 attr=1 data = bsize=4096 blocks=151396224, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 > > vmstat 10 output from unmerging (uninstalling) a kernel: > > 1 0 0 3512188332 19264400 18512 368 735 10 3 85 > > 1 > > -> emerge starts to remove the kernel source files > > 3 0 0 3506624332 1928360015 9825 2458 8307 7 12 81 > > 0 > > 0 0 0 3507212332 19283600 0 554 630 1233 0 1 99 > > 0 > > 0 0 0 3507292332 19283600 0 537 580 1328 0 1 99 > > 0 > > 0 0 0 3507168332 19283600 0 633 626 1380 0 1 99 > > 0 > > 0 0 0 3507116332 19283600 0 1510 768 2030 1 2 97 > > 0 > > 0 0 0 3507596332 19283600 0 524 540 1544 0 0 99 > > 0 > > procs ---memory-- ---swap-- -io -system-- > > cpu > > r b swpd free buff cache si sobibo in cs us sy id > > wa > > 0 0 0 3507540332 19283600 0 489 551 1293 0 0 99 > > 0 > > 0 0 0 3507528332 19283600 0 527 510 1432 1 1 99 > > 0 > > 0 0 0 3508052332 19284000 0 2088 910 2964 2 3 95 > > 0 > > 0 0 0 3507888332 19284000 0 442 565 1383 1 1 99 > > 0 > > 0 0 0 3508704332 19284000 0 497 529 1479 0 0 99 > > 0 > > 0 0 0 3508704332 19284000 0 594 595 1458 0 0 99 > > 0 > > 0 0 0 3511492332 19284000 0 2381 1028 2941 2 3 95 > > 0 > > 0 0 0 3510684332 19284000 0 699 600 1390 0 0 99 > > 0 > > 0 0 0 3511636332 19284000 0 741 661 1641 0 0 > > 100 0 > > 0 0 0 3524020332 19284000 0 2452 1080 3910 2 3 95 > > 0 > > 0 0 0 3524040332 19284400 0 530 617 1297 0 0 99 > > 0 > > 0 0 0 3524128332 19284400 0 812 674 1667 0 1 99 > > 0 > > 0 0 0 3527000332 19367200 339 721 754 1681 3 2 93 > > 1 > > -> emerge is finished, no dirty or writeback data in /proc/meminfo > > At this point, can you run a "sync" and see how long that takes to > complete? Already tried that: http://lkml.org/lkml/2007/11/2/178 See the logs from the second unmerge in the second half of the mail. The sync did not stop this writeout, but returned immediately. > The only thing I can think that woul dbe written out after > this point is inodes, but even then it seems to go on for a long, > long time and it really doesn't seem like XFS is holding up the > inode writes. Yes, I completly agree that this is much to long. Thats why I included the after-emerge-finished parts of the logs. But I still partly suspect xfs, because the xfssyncd shows up when I hip SysRq+W. > Another option is to use blktrace/blkparse to determine which process is > issuing this I/O. > > > 0 0 0 3583780332 19506000 0 494 555 1080 0 1 99 > > 0 > > 0 0
Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing
On Mon, Nov 05, 2007 at 09:06:36PM -0800, Andrew Morton wrote: > > Any objections to exporting the inode_lock spin lock? > > If so, how should modules _safely_ access the s_inode list? > That's going to make hch unhappy. That's going to make me just as unhappy, especially since it's pointless; instead of the entire sorry mess we should just bump sb->s_active to pin the superblock down (we know that it's active at that point, so it's just an atomic_inc(); no games with locking, etc., are needed) and call deactivate_super() on the way out. And deactivate_super() is exported already. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing
On Sat, 03 Nov 2007 07:09:25 -0400 Steve Dickson <[EMAIL PROTECTED]> wrote: > The following patch stops NFS sillyname renames and umounts from racing. (appropriate cc's added) > I have a test script does the following: > 1) start nfs server > 2) mount loopback > 3) open file in background > 4) remove file > 5) stop nfs server > 6) kill -9 process which has file open > 7) restart nfs server > 8) umount looback mount. > > After umount I got the "VFS: Busy inodes after unmount" message > because the processing of the rename has not finished. > > Below is a patch that the uses the new silly_count mechanism to > synchronize sillyname processing and umounts. The patch introduces a > nfs_put_super() routine that waits until the nfsi->silly_count count > is zero. > > A side-effect of finding and waiting for all the inode to > find the sillyname processing, is I need to traverse > the sb->s_inodes list in the supper block. To do that > safely the inode_lock spin lock has to be held. So for > modules to be able to "see" that lock I needed to > EXPORT_SYMBOL_GPL() it. > > Any objections to exporting the inode_lock spin lock? > If so, how should modules _safely_ access the s_inode list? > > steved. > > > Author: Steve Dickson <[EMAIL PROTECTED]> > Date: Wed Oct 31 12:19:26 2007 -0400 > > Close a unlink/sillyname rename and umount race by added a > nfs_put_super routine that will run through all the inode > currently on the super block, waiting for those that are > in the middle of a sillyname rename or removal. > > This patch stop the infamous "VFS: Busy inodes after unmount... " > warning during umounts. > > Signed-off-by: Steve Dickson <[EMAIL PROTECTED]> > > diff --git a/fs/inode.c b/fs/inode.c > index ed35383..da9034a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtable __read_mostly; >* the i_state of an inode while it is in use.. >*/ > DEFINE_SPINLOCK(inode_lock); > +EXPORT_SYMBOL_GPL(inode_lock); That's going to make hch unhappy. Your email client is performing space-stuffing. See http://mbligh.org/linuxdocs/Email/Clients/Thunderbird > static struct file_system_type nfs_fs_type = { > .owner = THIS_MODULE, > @@ -223,6 +225,7 @@ static const struct super_operations nfs_sops = { > .alloc_inode= nfs_alloc_inode, > .destroy_inode = nfs_destroy_inode, > .write_inode= nfs_write_inode, > + .put_super = nfs_put_super, > .statfs = nfs_statfs, > .clear_inode= nfs_clear_inode, > .umount_begin = nfs_umount_begin, > @@ -1767,6 +1770,30 @@ static void nfs4_kill_super(struct super_block *sb) > nfs_free_server(server); > } > > +void nfs_put_super(struct super_block *sb) This was (correctly) declared to be static. We should define it that way too (I didn't know you could do this, actually). > +{ > + struct inode *inode; > + struct nfs_inode *nfsi; > + /* > + * Make sure there are no outstanding renames > + */ > +relock: > + spin_lock(&inode_lock); > + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > + nfsi = NFS_I(inode); > + if (atomic_read(&nfsi->silly_count) > 0) { > + /* Keep this inode around during the wait */ > + atomic_inc(&inode->i_count); > + spin_unlock(&inode_lock); > + wait_event(nfsi->waitqueue, > + atomic_read(&nfsi->silly_count) == 1); > + iput(inode); > + goto relock; > + } > + } > + spin_unlock(&inode_lock); > +} That's an O(n^2) search. If it is at all possible to hit a catastrophic slowdown in here, you can bet that someone out there will indeed hit it in real life. I'm too lazy to look, but we might need to check things like I_FREEING and I_CLEAR before taking a ref on this inode. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote: > On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > > Ok, so it's probably a side effect of the writeback changes. > > > > Attached are two patches (two because one was in a separate patchset as > > a standalone change) that should prevent async writeback from blocking > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > > Can you see if this fixes the problem? > > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch > Applying your two patches ontop of that does not fix the stalls. So you are having RAID5 problems as well? I'm struggling to understand what possible changed in XFS or writeback that would lead to stalls like this, esp. as you appear to be removing files when the stalls occur. Rather than vmstat, can you use something like iostat to show how busy your disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. OOC, what is the 'xfs_info ' output for your filesystem? > vmstat 10 output from unmerging (uninstalling) a kernel: > 1 0 0 3512188332 19264400 18512 368 735 10 3 85 1 > -> emerge starts to remove the kernel source files > 3 0 0 3506624332 1928360015 9825 2458 8307 7 12 81 0 > 0 0 0 3507212332 19283600 0 554 630 1233 0 1 99 0 > 0 0 0 3507292332 19283600 0 537 580 1328 0 1 99 0 > 0 0 0 3507168332 19283600 0 633 626 1380 0 1 99 0 > 0 0 0 3507116332 19283600 0 1510 768 2030 1 2 97 0 > 0 0 0 3507596332 19283600 0 524 540 1544 0 0 99 0 > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 0 0 0 3507540332 19283600 0 489 551 1293 0 0 99 0 > 0 0 0 3507528332 19283600 0 527 510 1432 1 1 99 0 > 0 0 0 3508052332 19284000 0 2088 910 2964 2 3 95 0 > 0 0 0 3507888332 19284000 0 442 565 1383 1 1 99 0 > 0 0 0 3508704332 19284000 0 497 529 1479 0 0 99 0 > 0 0 0 3508704332 19284000 0 594 595 1458 0 0 99 0 > 0 0 0 3511492332 19284000 0 2381 1028 2941 2 3 95 0 > 0 0 0 3510684332 19284000 0 699 600 1390 0 0 99 0 > 0 0 0 3511636332 19284000 0 741 661 1641 0 0 100 > 0 > 0 0 0 3524020332 19284000 0 2452 1080 3910 2 3 95 0 > 0 0 0 3524040332 19284400 0 530 617 1297 0 0 99 0 > 0 0 0 3524128332 19284400 0 812 674 1667 0 1 99 0 > 0 0 0 3527000332 19367200 339 721 754 1681 3 2 93 1 > -> emerge is finished, no dirty or writeback data in /proc/meminfo At this point, can you run a "sync" and see how long that takes to complete? The only thing I can think that woul dbe written out after this point is inodes, but even then it seems to go on for a long, long time and it really doesn't seem like XFS is holding up the inode writes. Another option is to use blktrace/blkparse to determine which process is issuing this I/O. > 0 0 0 3583780332 19506000 0 494 555 1080 0 1 99 0 > 0 0 0 3584352332 19506000 099 347 559 0 0 99 0 > 0 0 0 3585232332 19506000 011 301 621 0 0 99 0 > -> disks go idle. > > So these patches do not seem to be the source of these excessive disk > writes... Well, the patches I posted should prevent blocking in the places that it was seen, so if that does not stop the slowdowns then either the writeback code is not feeding us inodes fast enough or the block device below is having some kind of problem Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
On Tue, 06 Nov 2007 13:00:41 +0900 Tetsuo Handa <[EMAIL PROTECTED]> wrote: > Hello. > > I found that accessing namespace_sem from security_inode_create() > causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y . > > sounds like you have an AB-BA deadlock... -- If you want to reach me at my work email, use [EMAIL PROTECTED] For development, discussion and tips for power savings, visit http://www.lesswatts.org - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Problem with accessing namespace_sem from LSM.
Hello. I found that accessing namespace_sem from security_inode_create() causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y . === [ INFO: possible circular locking dependency detected ] --- klogd/1798 is trying to acquire lock: (&namespace_sem){}, at: [] _aa_perm_dentry+0x80/0x184 [apparmor] but task is already holding lock: (&inode->i_mutex){--..}, at: [] mutex_lock+0x12/0x15 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&inode->i_mutex){--..}: [] lock_acquire+0x4b/0x6a [] __mutex_lock_slowpath+0xb0/0x1f6 [] mutex_lock+0x12/0x15 [] graft_tree+0x5c/0xd4 [] do_add_mount+0x84/0x100 [] do_mount+0x602/0x659 [] sys_mount+0x64/0x9b [] sysenter_past_esp+0x56/0x99 -> #0 (&namespace_sem){}: [] lock_acquire+0x4b/0x6a [] down_read+0x1e/0x31 [] _aa_perm_dentry+0x80/0x184 [apparmor] [] aa_perm_dentry+0x62/0xa4 [apparmor] [] apparmor_inode_create+0x40/0x63 [apparmor] [] vfs_create+0x84/0x13e [] open_namei+0x169/0x635 [] do_filp_open+0x20/0x36 [] do_sys_open+0x40/0xbb [] sys_open+0x16/0x18 [] sysenter_past_esp+0x56/0x99 other info that might help us debug this: 1 lock held by klogd/1798: #0: (&inode->i_mutex){--..}, at: [] mutex_lock+0x12/0x15 stack backtrace: [] show_trace+0xd/0x10 [] dump_stack+0x19/0x1b [] print_circular_bug_tail+0x59/0x64 [] __lock_acquire+0x7ea/0x973 [] lock_acquire+0x4b/0x6a [] down_read+0x1e/0x31 [] _aa_perm_dentry+0x80/0x184 [apparmor] [] aa_perm_dentry+0x62/0xa4 [apparmor] [] apparmor_inode_create+0x40/0x63 [apparmor] [] vfs_create+0x84/0x13e [] open_namei+0x169/0x635 [] do_filp_open+0x20/0x36 [] do_sys_open+0x40/0xbb [] sys_open+0x16/0x18 [] sysenter_past_esp+0x56/0x99 If this warning is true, AppArmor shipped with OpenSuSE 10.1 and 10.2 is affected. - Kernel 2.6.16.53-0.16 for OpenSuSE 10.1 - do_add_mount() { /* in fs/namespace.c */ down_write(&namespace_sem); graft_tree() { mutex_lock(&nd->dentry->d_inode->i_mutex); ... mutex_unlock(&nd->dentry->d_inode->i_mutex); } up_write(&namespace_sem); } open_namei() { /* in fs/namei.c */ mutex_lock(&dir->d_inode->i_mutex); vfs_create() { security_inode_create() { subdomain_inode_create() { /* in security/apparmor/lsm.c */ sd_perm_dentry() { /* in security/apparmor/main.c */ _sd_perm_dentry() { sd_path_begin() { /* in security/apparmor/inline.h */ sd_path_begin2() { down_read(&namespace_sem); } } ... sd_path_end() { up_read(&namespace_sem); } } } } } } mutex_unlock(&dir->d_inode->i_mutex); } - Kernel 2.6.18.8-0.7 for OpenSuSE 10.2 - do_add_mount() { /* in fs/namespace.c */ down_write(&namespace_sem); graft_tree() { mutex_lock(&nd->dentry->d_inode->i_mutex); ... mutex_unlock(&nd->dentry->d_inode->i_mutex); } up_write(&namespace_sem); } open_namei() { /* in fs/namei.c */ mutex_lock(&dir->d_inode->i_mutex); vfs_create() { security_inode_create() { apparmor_inode_create() { /* in security/apparmor/lsm.c */ aa_perm_dentry() { /* in security/apparmor/lsm.c */ _aa_perm_dentry() { aa_path_begin() { /* in security/apparmor/inline.h */ aa_path_begin2() { down_read(&namespace_sem); } } ... aa_path_end() { up_read(&namespace_sem); } } } } } } mutex_unlock(&dir->d_inode->i_mutex); } AppArmor shipped with OpenSuSE 10.3 and Ubuntu 7.10 will not be affected since kernel was modified to pass vfsmount parameter to VFS helper functions and LSM hooks. TOMOYO Linux 2.x (which is implemented using LSM) is also affected and I'm looking for solution. http://lkml.org/lkml/2007/11/5/55 Possible solution would be to pass vfsmount parameter to VFS helper functions and LSM hooks for all kernels. I do hope that "Pass struct vfsmount to ..." patches are merged into mainline kernel. Regards. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, 2 Nov 2007 18:33:29 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote: > > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > > > Interestingly, no background_writeout() appears, but only > > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > > block the process. > > > > Yeah, the background threshold is not (yet) scaled. So it can happen > > that the bdi_dirty limit is below the background limit. > > > > I'm curious though as to these stalls, though, I can't seem to think of > > what goes wrong.. esp since most writeback seems to happen from pdflush. > > Me confused too. The new debug patch will confirm whether emerge is > waiting in balance_dirty_pages(). > > > (or I'm totally misreading it - quite a possible as I'm still recovering > > from a serious cold and not all the green stuff has yet figured out its > > proper place wrt brain cells 'n stuff) > > Do take care of yourself. > > > > > I still have this patch floating around: > > I think this patch is OK for 2.6.24 :-) > > Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> I would prefer Tested-by: :( > > > > --- > > Subject: mm: speed up writeback ramp-up on clean systems > > > > We allow violation of bdi limits if there is a lot of room on the > > system. Once we hit half the total limit we start enforcing bdi limits > > and bdi ramp-up should happen. Doing it this way avoids many small > > writeouts on an otherwise idle system and should also speed up the > > ramp-up. Given the problems we're having in there I'm a bit reluctant to go tossing hastily put together and inadequately tested stuff onto the fire. And that's what this patch looks like to me. Wanna convince me otherwise? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANN] Squashfs 3.3 released
Phillip Lougher wrote: > Hi, > > I'm pleased to announce another release of Squashfs. This is the 22nd > release in just over five years. Thanks Phillip. A tiny bug[fix] I always forgot to send... In fs/squashfs/inode.c, constants TASK_UNINTERRUPTIBLE and TASK_INTERRUPTIBLE are used, but they aren't sometimes defined (declared in linux/sched.h): CC [M] fs/squashfs/inode.o fs/squashfs/inode.c: In function 'squashfs_get_cached_block': fs/squashfs/inode.c:367: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in this function) fs/squashfs/inode.c:367: error: (Each undeclared identifier is reported only once fs/squashfs/inode.c:367: error: for each function it appears in.) fs/squashfs/inode.c:367: warning: implicit declaration of function 'schedule' fs/squashfs/inode.c:404: error: 'TASK_INTERRUPTIBLE' undeclared (first use in this function) fs/squashfs/inode.c: In function 'release_cached_fragment': fs/squashfs/inode.c:499: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in this function) fs/squashfs/inode.c:499: error: 'TASK_INTERRUPTIBLE' undeclared (first use in this function) fs/squashfs/inode.c: In function 'get_cached_fragment': fs/squashfs/inode.c:522: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in this function) fs/squashfs/inode.c:559: error: 'TASK_INTERRUPTIBLE' undeclared (first use in this function) I'm not exactly sure which config option is "at blame" (this is an i486-based UP generic-hardware config), but I'm not interested to know, either. The following trivial patch fixes it once for all. --- linux-2.6.22.orig/fs/squashfs/inode.c 2007-07-12 14:57:22.0 +0400 +++ linux-2.6.22/fs/squashfs/inode.c2007-07-12 14:57:53.0 +0400 @@ -31,6 +31,7 @@ #include #include +#include #include #include "squashfs.h" It was needed for v3.2 too. Thanks. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Mon, 5 Nov 2007, Mel Gorman wrote: > The grow_dev_page() pages should be reclaimable even though migration > is not supported for those pages? They were marked movable as it was > useful for lumpy reclaim taking back pages for hugepage allocations and > the like. Would it make sense for memory unremove to attempt migration > first and reclaim second? Note that a page is still movable even if there is no file system method for migration available. In that case the page needs to be cleaned before it can be moved. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch added to -mm tree
On Mon, 5 November 2007 13:01:25 -0800, [EMAIL PROTECTED] wrote: > > The patch titled > Embed a struct path into struct nameidata instead of nd->{dentry,mnt} > has been added to the -mm tree. Its filename is > embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch > > *** Remember to use Documentation/SubmitChecklist when testing your code *** > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find > out what to do about this > > -- > Subject: Embed a struct path into struct nameidata instead of nd->{dentry,mnt} > From: Jan Blunck <[EMAIL PROTECTED]> > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. > > Signed-off-by: Jan Blunck <[EMAIL PROTECTED]> > Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]> > Acked-by: Christoph Hellwig <[EMAIL PROTECTED]> > Cc: Al Viro <[EMAIL PROTECTED]> > CC: > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Frowned-upon-by: Joern Engel <[EMAIL PROTECTED]> This patch changes some 400 lines, most if not all of which get longer and more complicated to read. 23 get sufficiently longer to require an additional linebreak. I can't remember complexity being invited into the kernel without good reasoning, yet the patch description is surprisingly low on reasoning: > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. The following two patches manage to remove 7 lines in total. In total 23 were added, 7 removed , 400+ made longer and more complicated. Is there another more favorable metric? Will this patchset prevent bugs? Shrink the kernel size? Anything? If churn is the only effect of this, please considere it NAKed again. Jörn -- A surrounded army must be given a way out. -- Sun Tzu - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
+ embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch added to -mm tree
The patch titled Embed a struct path into struct nameidata instead of nd->{dentry,mnt} has been added to the -mm tree. Its filename is embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: Embed a struct path into struct nameidata instead of nd->{dentry,mnt} From: Jan Blunck <[EMAIL PROTECTED]> Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. Signed-off-by: Jan Blunck <[EMAIL PROTECTED]> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]> Acked-by: Christoph Hellwig <[EMAIL PROTECTED]> Cc: Al Viro <[EMAIL PROTECTED]> CC: Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- fs/unionfs/inode.c | 12 fs/unionfs/main.c |9 - fs/unionfs/super.c | 17 - 3 files changed, 16 insertions(+), 22 deletions(-) diff -puN fs/unionfs/inode.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs fs/unionfs/inode.c --- a/fs/unionfs/inode.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs +++ a/fs/unionfs/inode.c @@ -168,10 +168,8 @@ static struct dentry *unionfs_lookup(str unionfs_read_lock(dentry->d_sb); /* save the dentry & vfsmnt from namei */ - if (nd) { - path_save.dentry = nd->dentry; - path_save.mnt = nd->mnt; - } + if (nd) + path_save = nd->path; /* * unionfs_lookup_backend returns a locked dentry upon success, @@ -180,10 +178,8 @@ static struct dentry *unionfs_lookup(str ret = unionfs_lookup_backend(dentry, nd, INTERPOSE_LOOKUP); /* restore the dentry & vfsmnt in namei */ - if (nd) { - nd->dentry = path_save.dentry; - nd->mnt = path_save.mnt; - } + if (nd) + nd->path = path_save; if (!IS_ERR(ret)) { if (ret) dentry = ret; diff -puN fs/unionfs/main.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs fs/unionfs/main.c --- a/fs/unionfs/main.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs +++ a/fs/unionfs/main.c @@ -227,11 +227,11 @@ void unionfs_reinterpose(struct dentry * int check_branch(struct nameidata *nd) { /* XXX: remove in ODF code -- stacking unions allowed there */ - if (!strcmp(nd->dentry->d_sb->s_type->name, UNIONFS_NAME)) + if (!strcmp(nd->path.dentry->d_sb->s_type->name, UNIONFS_NAME)) return -EINVAL; - if (!nd->dentry->d_inode) + if (!nd->path.dentry->d_inode) return -ENOENT; - if (!S_ISDIR(nd->dentry->d_inode->i_mode)) + if (!S_ISDIR(nd->path.dentry->d_inode->i_mode)) return -ENOTDIR; return 0; } @@ -372,8 +372,7 @@ static int parse_dirs_option(struct supe goto out; } - lower_root_info->lower_paths[bindex].dentry = nd.dentry; - lower_root_info->lower_paths[bindex].mnt = nd.mnt; + lower_root_info->lower_paths[bindex] = nd.path; set_branchperms(sb, bindex, perms); set_branch_count(sb, bindex, 0); diff -puN fs/unionfs/super.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs fs/unionfs/super.c --- a/fs/unionfs/super.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs +++ a/fs/unionfs/super.c @@ -202,8 +202,8 @@ static noinline int do_remount_mode_opti goto out; } for (idx = 0; idx < cur_branches; idx++) - if (nd.mnt == new_lower_paths[idx].mnt && - nd.dentry == new_lower_paths[idx].dentry) + if (nd.path.mnt == new_lower_paths[idx].mnt && + nd.path.dentry == new_lower_paths[idx].dentry) break; path_release(&nd); /* no longer needed */ if (idx == cur_branches) { @@ -245,8 +245,8 @@ static noinline int do_remount_del_optio goto out; } for (idx = 0; idx < cur_branches; idx++) - if (nd.mnt == new_lower_paths[idx].mnt && - nd.dentry == new_lower_paths[idx].dentry) + if (nd.path.mnt == new_lower_paths[idx].mnt && + nd.path.dentry == new_lower_paths[idx].dentry) break; path_release(&nd); /* no longer needed */ if (idx == cur_branches) { @@ -329,8 +329,8 @@ static noinline int do_remount_add_optio goto out; } for (idx = 0; idx < cur_branches; idx++) - if (nd.mnt == new_lower_paths[idx].mnt && - nd.dentry == new_lower_paths[idx].dentry) +
+ introduce-path_put.patch added to -mm tree
The patch titled Introduce path_put() has been added to the -mm tree. Its filename is introduce-path_put.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: Introduce path_put() From: Jan Blunck <[EMAIL PROTECTED]> * Add path_put() functions for releasing a reference to the dentry and vfsmount of a struct path in the right order * Switch from path_release(nd) to path_put(&nd->path) * Rename dput_path() to path_put_conditional() Signed-off-by: Jan Blunck <[EMAIL PROTECTED]> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]> Acked-by: Christoph Hellwig <[EMAIL PROTECTED]> Cc: Cc: Al Viro <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- arch/alpha/kernel/osf_sys.c |2 arch/mips/kernel/sysirix.c |6 - arch/parisc/hpux/sys_hpux.c |2 arch/powerpc/platforms/cell/spufs/syscalls.c |2 arch/sparc64/solaris/fs.c|4 - drivers/md/dm-table.c|2 drivers/mtd/mtdsuper.c |4 - fs/afs/mntpt.c |2 fs/autofs4/root.c|2 fs/block_dev.c |2 fs/coda/pioctl.c |4 - fs/compat.c |4 - fs/configfs/symlink.c|4 - fs/dquot.c |2 fs/ecryptfs/main.c |2 fs/exec.c|4 - fs/ext3/super.c |4 - fs/ext4/super.c |4 - fs/gfs2/ops_fstype.c |2 fs/inotify_user.c|4 - fs/namei.c | 56 + fs/namespace.c | 20 +++--- fs/nfs/namespace.c |2 fs/nfsctl.c |2 fs/nfsd/export.c | 10 +-- fs/nfsd/nfs4recover.c|2 fs/nfsd/nfs4state.c |2 fs/open.c| 22 +++--- fs/proc/base.c |2 fs/reiserfs/super.c |8 +- fs/revoke.c |2 fs/stat.c|6 - fs/utimes.c |2 fs/xattr.c | 16 ++-- fs/xfs/linux-2.6/xfs_ioctl.c |2 include/linux/namei.h|7 -- include/linux/path.h |2 kernel/audit_tree.c | 12 +-- kernel/auditfilter.c |4 - net/sunrpc/rpc_pipe.c|2 net/unix/af_unix.c |6 - 41 files changed, 125 insertions(+), 124 deletions(-) diff -puN arch/alpha/kernel/osf_sys.c~introduce-path_put arch/alpha/kernel/osf_sys.c --- a/arch/alpha/kernel/osf_sys.c~introduce-path_put +++ a/arch/alpha/kernel/osf_sys.c @@ -261,7 +261,7 @@ osf_statfs(char __user *path, struct osf retval = user_path_walk(path, &nd); if (!retval) { retval = do_osf_statfs(nd.path.dentry, buffer, bufsiz); - path_release(&nd); + path_put(&nd.path); } return retval; } diff -puN arch/mips/kernel/sysirix.c~introduce-path_put arch/mips/kernel/sysirix.c --- a/arch/mips/kernel/sysirix.c~introduce-path_put +++ a/arch/mips/kernel/sysirix.c @@ -711,7 +711,7 @@ asmlinkage int irix_statfs(const char __ } dput_and_out: - path_release(&nd); + path_put(&nd.path); out: return error; } @@ -1385,7 +1385,7 @@ asmlinkage int irix_statvfs(char __user error |= __put_user(0, &buf->f_fstr[i]); dput_and_out: - path_release(&nd); + path_put(&nd.path); out: return error; } @@ -1636,7 +1636,7 @@ asmlinkage int irix_statvfs64(char __use error |= __put_user(0, &buf->f_fstr[i]); dput_and_out: - path_release(&nd); + path_put(&nd.path); out: return error; } diff -puN arch/parisc/hpux/sys_hpux.c~introduce-path_put arch/parisc/hpux/sys_hpux.c --- a/arch/parisc/hpux/sys_hpux.c~introduce-path_put +++ a/arch/parisc/hpux/sys_hpux.c @@ -222,7 +222,7 @@ asmlinkage long hpux_statfs(const char _ error = vfs_statfs_hpux(nd.path.dentry, &tmp); if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) error = -EFAULT; - path_release(&nd); + path_put(&nd.path); } return error; } di
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 5 Nov 2007, Dave Hansen wrote: > > Actually, I think your s/while/if/ change is probably a decent fix. Any resemblance to a decent fix is purely coincidental. > Barring any other races, that loop should always have made progress on > mnt->__mnt_writers the way it is written. If we get to: > > > lock_and_coalesce_cpu_mnt_writer_counts(); > ->HERE > > mnt_unlock_cpus(); > > and don't have a positive mnt->__mnt_writers, we know something is going > badly. We WARN_ON() there, which should at least give an earlier > warning that the system is not doing well. But it doesn't fix the > inevitable. Could you try the attached patch and see if it at least > warns you earlier? Thanks, Dave, yes, that gives me a nice warning: leak detected on mount(c25ebd80) writers count: -65537 WARNING: at fs/namespace.c:249 handle_write_count_underflow() [] show_trace_log_lvl+0x1b/0x2e [] show_trace+0x16/0x1b [] dump_stack+0x19/0x1e [] handle_write_count_underflow+0x4c/0x60 [] mnt_drop_write+0x69/0x8e [] __fput+0xff/0x162 [] fput+0x2e/0x33 [] unionfs_file_release+0xc2/0x1c5 [] __fput+0x8f/0x162 [] fput+0x2e/0x33 [] filp_close+0x50/0x5d [] sys_close+0x74/0xb4 [] sysenter_past_esp+0x5f/0x85 and the test then goes quietly on its way instead of hanging. Though I imagine, with your patch or mine, that it's then making an unfortunate frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type thereafter. But it's hardly your responsibility to optimize for bugs elsewhere. The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to #define MNT_IMBALANCED_WRITE_COUNT 0x400 /* just for debugging */ > > I have a decent guess what the bug is, too. In the unionfs code: I'll let Erez take it from there... Hugh - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[3/4] Distributed storage. Algorithms.
Mirror and linear data stripping algorithms for DST. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..cb77b57 --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,104 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n->st; + + dprintk("%s: disk_size: %llu, node_size: %llu.\n", + __func__, st->disk_size, n->size); + + mutex_lock(&st->tree_lock); + n->start = st->disk_size; + st->disk_size += n->size; + mutex_unlock(&st->tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req->node->bdev) { + generic_make_request(req->bio); + return 0; + } + + err = kst_check_permissions(req->state, req->bio); + if (err) + return err; + + return req->state->ops->push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, &st->node->flags); + else + clear_bit(DST_NODE_FROZEN, &st->node->flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg("alg_linear", &alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Evgeniy Polyakov <[EMAIL PROTECTED]>"); +MODULE_DESCRIPTION("Linear distributed algorithm."); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..1b55f4d --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1113 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include + +struct dst_mirror_node_data +{ + u64 age; +}; + +struct dst_mirror_priv +{ + unsigned intchunk_num; + + u64 last_start; + + spinlock_t backlog_lock; + struct list_headbacklog_list; + + struct dst_mirror_node_data old_data, new_data; + + unsigned long *chunk; +}; + +static struct dst_alg *alg_mirror; +static struct bio_set *dst_mirror_bio_set; + +static int dst_mirror_resync(struct dst_node *n, int ndp); + +static void dst_mirror_mark_sync(struct dst_node *n) +{ + if (test_bit(DST_NODE_NOTSYNC, &n->flags)) { + struct dst_mirror_priv *priv = n->priv; + + clear_bit(DST_NODE_NOTSYNC, &n->flags); + dprintk("%s: node: %p, %llu:%llu synchronization " + "has been completed.\n", + __func__, n, n->start, n->size); + priv->old_data.age = 0; + } +} + +static void dst_mirror_mark_notsync(struct dst_node *n) +{ + if (!test_bit(DST_NODE_NOTSYNC, &n->flags)) { +
[4/4] Distributed storage. Core interfaces.
This one contains core interfaces of the distributed storage, storage and node initialization and cleanup code, block layer callbacks and the like. It also contains Kconfig and Makefile changes. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source "drivers/block/dst/Kconfig" + source "drivers/s390/block/Kconfig" endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..d35e0cc --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,21 @@ +config DST + tristate "Distributed storage" + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_ALG_LINEAR + tristate "Linear distribution algorithm" + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate "Mirror distribution algorithm" + depends on DST + ---help--- + This module allows to create a mirror of the noes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..2b3ef10 --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1608 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = "Squizzed black-out of the dancing back-aching hippo"; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-800/clean + * /sys/devices/storage/n-800/dirty + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/n-0/clean + * /sys/devices/storage/n-0/dirty + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = "dst", + .match = &dst_dev_match, +}; + +static struct device dst_dev = { + .bus= &dst_dev_bus_type, + .release= &dst_dev_release +}; + +static void dst_node_release(struct device *dev) +{ +} + +static struct device dst_node_dev = { + .release= &dst_node_release +}; + +static void dst_free_alg(struct dst_alg *alg) +{ + kfree(alg); +} + +/* + * Algorithm is never freed directly, + * since its module reference counter is increased + * by storage when it is created - just
[2/4] Distributed storage. Network processing.
This file contains all bits needed for async non-blocking network processing of the block requests directed to DST. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..ba5e5ef --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1475 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr->sa_family, type, proto, &st->socket); + if (err) + goto err_out_exit; + + err = st->socket->ops->bind(st->socket, (struct sockaddr *)addr, + addr->sa_data_len); + + err = st->socket->ops->listen(st->socket, backlog); + if (err) + goto err_out_release; + + st->socket->sk->sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st->socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st->socket) { + sock_release(st->socket); + st->socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st->node->w; + unsigned long flags; + + spin_lock_irqsave(&w->ready_lock, flags); + if (list_empty(&st->ready_entry)) + list_add_tail(&st->ready_entry, &w->ready_list); + spin_unlock_irqrestore(&w->ready_lock, flags); + + wake_up(&w->wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)->st; + + st->whead = whead; + init_waitqueue_func_entry(&st->wait, kst_state_wake_callback); + add_wait_queue(whead, &st->wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st->whead) { + remove_wait_queue(st->whead, &st->wait); + st->whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(&req->request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(&st->request_list)) + req = list_entry(st->request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(&st->request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(&st->request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req->flags & DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, &st->request_list, request_list_entry) { + if (bio_rw(r->bio) != bio_rw(req->bio)) + continue; + + if (r->start >= req->start + req->size) + continue; + + if (r->start + r->size <= req->sta
[1/4] Distributed storage. Documentation.
DST documentation. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed +requests for remote nodes or when talking to remote size +of the local export node (state contains data related to data +transfers over the networ
[0/4] Distributed storage. Squizzed black-out of the dancing back-aching hippo.
Hi. I'm pleased to announce 7'th and the final release of the distributed storage subsystem (DST). It allows to form a storage on top of local and remote nodes and combine them in linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * added strong checksum support (Castagnoli crc) * extended autoconfiguration (added ability to request if remote side supports strong checksum and turn it on if needed) * documentation addon - sysfs files * added clean/dirty sysfs files which allows to mark node as clean (sinc) or dirty (not sync) * fair number of bug fixes (including really tricky bastards, which are unlikely to be found in real setups, but which were still bugs) * and the main one - added release name (it clearly shows my condition) Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst Thank you. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > Ok, so it's probably a side effect of the writeback changes. > > Attached are two patches (two because one was in a separate patchset as > a standalone change) that should prevent async writeback from blocking > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > Can you see if this fixes the problem? Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch Applying your two patches ontop of that does not fix the stalls. vmstat 10 output from unmerging (uninstalling) a kernel: 1 0 0 3512188332 19264400 18512 368 735 10 3 85 1 -> emerge starts to remove the kernel source files 3 0 0 3506624332 1928360015 9825 2458 8307 7 12 81 0 0 0 0 3507212332 19283600 0 554 630 1233 0 1 99 0 0 0 0 3507292332 19283600 0 537 580 1328 0 1 99 0 0 0 0 3507168332 19283600 0 633 626 1380 0 1 99 0 0 0 0 3507116332 19283600 0 1510 768 2030 1 2 97 0 0 0 0 3507596332 19283600 0 524 540 1544 0 0 99 0 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 0 0 3507540332 19283600 0 489 551 1293 0 0 99 0 0 0 0 3507528332 19283600 0 527 510 1432 1 1 99 0 0 0 0 3508052332 19284000 0 2088 910 2964 2 3 95 0 0 0 0 3507888332 19284000 0 442 565 1383 1 1 99 0 0 0 0 3508704332 19284000 0 497 529 1479 0 0 99 0 0 0 0 3508704332 19284000 0 594 595 1458 0 0 99 0 0 0 0 3511492332 19284000 0 2381 1028 2941 2 3 95 0 0 0 0 3510684332 19284000 0 699 600 1390 0 0 99 0 0 0 0 3511636332 19284000 0 741 661 1641 0 0 100 0 0 0 0 3524020332 19284000 0 2452 1080 3910 2 3 95 0 0 0 0 3524040332 19284400 0 530 617 1297 0 0 99 0 0 0 0 3524128332 19284400 0 812 674 1667 0 1 99 0 0 0 0 3527000332 19367200 339 721 754 1681 3 2 93 1 -> emerge is finished, no dirty or writeback data in /proc/meminfo 0 0 0 3571056332 19476800 111 639 632 1344 0 1 99 0 0 0 0 3571260332 19476800 0 757 688 1405 1 0 99 0 0 0 0 3571156332 19476800 0 753 641 1361 0 0 99 0 0 0 0 3571404332 19476800 0 766 653 1389 0 0 99 0 1 0 0 3571136332 19476800 6 764 669 1488 0 0 99 0 0 0 0 3571668332 19482400 0 764 657 1482 0 0 99 0 0 0 0 3571848332 19482400 0 673 659 1406 0 0 99 0 0 0 0 3571908332 1950520022 753 638 1500 0 1 99 0 0 0 0 3573052332 19505200 0 765 631 1482 0 1 99 0 0 0 0 3574144332 19505200 0 771 640 1497 0 0 99 0 0 0 0 3573468332 19505200 0 458 485 1251 0 0 99 0 0 0 0 3574184332 19505200 0 427 474 1192 0 0 100 0 0 0 0 3575092332 19505200 0 461 482 1235 0 0 99 0 0 0 0 3576368332 19505600 0 582 556 1310 0 0 99 0 0 0 0 3579300332 19505600 0 695 571 1402 0 0 99 0 0 0 0 3580376332 19505600 0 417 568 906 0 0 99 0 0 0 0 3581212332 19505600 0 421 559 977 0 1 99 0 0 0 0 3583780332 19506000 0 494 555 1080 0 1 99 0 0 0 0 3584352332 19506000 099 347 559 0 0 99 0 0 0 0 3585232332 19506000 011 301 621 0 0 99 0 -> disks go idle. So these patches do not seem to be the source of these excessive disk writes... Torsten - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote: > The second problem was a hang: all cpus in > handle_write_count_underflow > doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave > Hansen. At first I thought that was a locking problem in Dave's code, > but I now suspect it's that your unionfs reference counting is wrong > somewhere, and the error accumulates until __mnt_writers drops below > MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help > and we're stuck in that loop. I've never actually seen this happen in practice, but I do know exactly what you're talking about. > but I hope Dave can > also make handle_write_count_underflow more robust, it's unfortunate > if refcount errors elsewhere first show up as a hang there. Actually, I think your s/while/if/ change is probably a decent fix. Barring any other races, that loop should always have made progress on mnt->__mnt_writers the way it is written. If we get to: > lock_and_coalesce_cpu_mnt_writer_counts(); ->HERE > mnt_unlock_cpus(); and don't have a positive mnt->__mnt_writers, we know something is going badly. We WARN_ON() there, which should at least give an earlier warning that the system is not doing well. But it doesn't fix the inevitable. Could you try the attached patch and see if it at least warns you earlier? I have a decent guess what the bug is, too. In the unionfs code: > int init_lower_nd(struct nameidata *nd, unsigned int flags) > { > ... > #ifdef ALLOC_LOWER_ND_FILE > file = kzalloc(sizeof(struct file), GFP_KERNEL); > if (unlikely(!file)) { > err = -ENOMEM; > break; /* exit switch statement and thus return */ > } > nd->intent.open.file = file; > #endif /* ALLOC_LOWER_ND_FILE */ The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at __fput() time. Since that code never got a write on the mount, we'll see an imbalance if the file was opened for a write. I don't see this file's mnt set anywhere, so I'm not completely sure that this is it. In any case, rolling your own 'struct file' without using alloc_file() and friends is a no-no. BTW, I have some "debugging" code in my latest set of patches that I think should fix this kind of imbalance with the mnt->__mnt_writers(). It ensures that before we do that mnt_drop_write() at __fput() that we absolutely did a mnt_want_write() at some point in the 'struct file's life. -- Dave linux-2.6.git-dave/fs/namespace.c| 31 ++- linux-2.6.git-dave/include/linux/mount.h |1 + 2 files changed, 23 insertions(+), 9 deletions(-) diff -puN fs/namei.c~fix-naughty-loop fs/namei.c diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c --- linux-2.6.git/fs/namespace.c~fix-naughty-loop 2007-11-05 08:03:59.0 -0800 +++ linux-2.6.git-dave/fs/namespace.c 2007-11-05 08:35:06.0 -0800 @@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr */ static void handle_write_count_underflow(struct vfsmount *mnt) { - while (atomic_read(&mnt->__mnt_writers) < - MNT_WRITER_UNDERFLOW_LIMIT) { - /* -* It isn't necessary to hold all of the locks -* at the same time, but doing it this way makes -* us share a lot more code. -*/ - lock_and_coalesce_cpu_mnt_writer_counts(); - mnt_unlock_cpus(); + if (atomic_read(&mnt->__mnt_writers) >= + MNT_WRITER_UNDERFLOW_LIMIT) + return; + /* +* It isn't necessary to hold all of the locks +* at the same time, but doing it this way makes +* us share a lot more code. +*/ + lock_and_coalesce_cpu_mnt_writer_counts(); + /* +* If coalescing the per-cpu writer counts did not +* get us back to a positive writer count, we have +* a bug. +*/ + if ((atomic_read(&mnt->__mnt_writers) < 0) && + !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) { + printk("leak detected on mount(%p) writers count: %d\n", + mnt, atomic_read(&mnt->__mnt_writers)); + WARN_ON(1); + /* use the flag to keep the dmesg spam down */ + mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT; } + mnt_unlock_cpus(); } /** diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h --- linux-2.6.git/include/linux/mount.h~fix-naughty-loop2007-11-05 08:22:21.0 -0800 +++ linux-2.6.git-dave/include/linux/mount.h2007-11-05 08:28:20.0 -0800 @@ -32,6 +32,7 @@ struct mnt_namespace; #define MNT_READONLY 0x40/* does the user want this to be r/o? */ #define MNT_SHRINKABLE 0x100 +#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ #define MN
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
[Dave, I've Cc'ed you re handle_write_count_underflow, see below.] On Wed, 31 Oct 2007, Erez Zadok wrote: > > Hi Hugh, I've addressed all of your concerns and am happy to report that the > newly revised unionfs_writepage works even better, including under my > memory-pressure conditions. To summarize my changes since the last time: > > - I'm only masking __GFP_FS, not __GFP_IO > - using find_or_create_page to avoid locking issues around mapping mask > - handle for_reclaim case more efficiently > - using copy_highpage so we handle KM_USER* > - un/locking upper/lower page as/when needed > - updated comments to clarify what/why > - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used > to have it) > > Below is the newest version of unionfs_writepage. Let me know what you > think. > > I have to say that with these changes, unionfs appears visibly faster under > memory pressure. I suspect the for_reclaim handling is probably the largest > contributor to this speedup. That's good news, and that unionfs_writepage looks good to me - with three reservations I've not observed before. One, I think you would be safer to do a set_page_dirty(lower_page) before your clear_page_dirty_for_io(lower_page). I know that sounds silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io: there's a lot of subtlety hereabouts, and I think you'd be mimicing the usual path closer if you set_page_dirty first - there's nothing else doing it on that lower_page, is there? I'm not certain that you need to, but I think you'd do well to look into it and make up your own mind. Two, I'm unsure of the way you're clearing or setting PageUptodate on the upper page there. The rules for PageUptodate are fairly obvious when reading, but when a write fails, it's not so obvious. Again, I'm not saying what you've got is wrong (it may be unavoidable, to keep synch between lower and upper), but it deserves a second thought. Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower mount at the same time as modifying through the upper). I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9 and your 2 Nov 1-8/8 patches applied (rejects being patches which were already in 2.6.23-mm1). I was hoping to reproduce the BUG_ON(entry->val) that I fear from shmem_writepage(), before fixing it; but not seen that at all yet - that might be good news, but it's more likely I just haven't tried hard enough yet. For now I'm doing repeated make -j20 kernel builds, pushing into swap, in a unionfs mount of just a single dir on tmpfs. This has shown up several problems, two of which I've had to hack around to get further. The first: I very quickly hit "BUG: atomic counter underflow" from -mm's i386 atomic_dec_and_test: from filp_close calling unionfs_flush. I did a little test fork()ing while holding a file open on unionfs, and indeed it appears that your totalopens code is broken, being unaware of how fork() bumps up a file count without an open. That's rather basic, I'm puzzled that this has remained undiscovered until now - or perhaps it's just a recent addition. It looked to me as if the totalopens count was about avoiding some d_deleted processing in unionfs_flush, which actually should be left until unionfs_release (and that your unionfs_flush ought to be calling the lower level flush in all cases). To get going, I've been running with the quick hack patch below: but I've spent very little time thinking about it, plus it's a long time since I've done VFS stuff; so that patch may be nothing but an embarrassment that reflects neither your intentions nor the VFS rules! And it may itself be responsible for the further problems I've seen. The second problem was a hang: all cpus in handle_write_count_underflow doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave Hansen. At first I thought that was a locking problem in Dave's code, but I now suspect it's that your unionfs reference counting is wrong somewhere, and the error accumulates until __mnt_writers drops below MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help and we're stuck in that loop. My even greater hack to solve that one was to change Dave's "while" to "if"! Then indeed tests can run for some while. As I say, my suspicion is that the actual error is within unionfs (perhaps introduced by my first hack); but I hope Dave can also make handle_write_count_underflow more robust, it's unfortunate if refcount errors elsewhere first show up as a hang there. I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when I come back to this, since it's rather noisy at present. I've not checked whether its reports are peculiar to having tmpfs below or not. I get lots of "unionfs: new lower inode mtime" r
2008 Linux Storage and Filesystem Workshop
Hello everyone, The position statement submission system for the 2008 storage and filesystem workshop is now online. This is how you let us know you're interested in attending and what topics are most important for discussion. For all the details, please see: http://www.usenix.org/events/lsf08/ -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Mon, 5 Nov 2007 10:23:35 + [EMAIL PROTECTED] (Mel Gorman) wrote: > On (01/11/07 10:10), Badari Pulavarty didst pronounce: > > > > Hmpf, my first reply had a paragraph about the block device inode > > > pages, I noticed the phrase file data pages and deleted it ;) > > > > > > But, for the metadata buffers there's not much we can do. They > > > are included in a bunch of different lists and the patch would > > > be non-trivial. > > > > Unfortunately, these buffer pages are spread all around making > > those sections of memory non-removable. Of course, one can use > > ZONE_MOVABLE to make sure to guarantee the remove. But I am > > hoping we could easily group all these allocations and minimize > > spreading them around. Mel ? > > The grow_dev_page() pages should be reclaimable even though migration > is not supported for those pages? They were marked movable as it was > useful for lumpy reclaim taking back pages for hugepage allocations > and the like. Would it make sense for memory unremove to attempt > migration first and reclaim second? > In this case, reiserfs has the page pinned while it is doing journal magic. Not sure if ext3 has the same issues. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANN] Squashfs 3.3 released
On Mon, Nov 05, 2007 at 11:13:14AM +, Phillip Lougher wrote: > I'm pleased to announce another release of Squashfs. This is the 22nd > release in just over five years. Squashfs 3.3 has lots of nice > improvements, > both to the filesystem itself (bigger blocks and sparse files), but > also to the Squashfs-tools Mksquashfs and Unsquashfs. > > The next stage after this release is to fix the one remaining blocking issue > (filesystem endianness), and then try to get Squashfs mainlined into the > Linux kernel again. > that would be very cool! with my hat as debian kernel maintainer i'd be very relieved to see it mainlined. i don't know of any major distro that doesn't ship it. thanks -- maks - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANN] Squashfs 3.3 released
Hi, I'm pleased to announce another release of Squashfs. This is the 22nd release in just over five years. Squashfs 3.3 has lots of nice improvements, both to the filesystem itself (bigger blocks and sparse files), but also to the Squashfs-tools Mksquashfs and Unsquashfs. The next stage after this release is to fix the one remaining blocking issue (filesystem endianness), and then try to get Squashfs mainlined into the Linux kernel again. The list of changes from the change-log are as follows: 1. Filesystem improvements: 1.1. Maximum block size has been increased to 1Mbyte, and the default block size has been increased to 128 Kbytes. This improves compression. 1.2. Sparse files are now supported. Sparse files are files which have large areas of unallocated data commonly called holes. These files are now detected by Squashfs and stored more efficiently. This improves compression and read performance for sparse files. 2. Mksquashfs improvements: 2.1. Exclude files have been extended to use wildcard pattern matching and regular expressions. Support has also been added for non-anchored excludes, which means it is now possible to specify excludes which match anywhere in the filesystem (i.e. leaf files), rather than always having to specify exclude files starting from the root directory (anchored excludes). 2.2. Recovery files are now created when appending to existing Squashfs filesystems. This allows the original filesystem to be recovered if Mksquashfs aborts unexpectedly (i.e. power failure). 3. Unsquashfs improvements: 3.1. Multiple extract files can now be specified on the command line, and the files/directories to be extracted can now also be given in a file. 3.2. Extract files have been extended to use wildcard pattern matching and regular expressions. 3.3. Filename printing has been enhanced and Unquashfs can now display filenames with file attributes ('ls -l' style output). 3.4. A -stat option has been added which displays the filesystem superblock information. 3.5. Unsquashfs now supports 1.x filesystems. 4. Miscellaneous improvements/bug fixes: 4.1. Squashfs kernel code improved to use SetPageError in squashfs_readpage() if I/O error occurs. 4.2. Fixed Squashfs kernel code bug preventing file seeking beyond 2GB. 4.3. Mksquashfs now detects file size changes between first phase directory scan and second phase filesystem create. Regards Phillip Lougher - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Is it illegal to refer namespace_sem while inode's mutex held?
Hello. I'm running my LSM module on kernel 2.6.23 / Debian Sarge. I encountered the following warning message. It seems that calling down_read(&namespace_sem) is not permitted inside mutex_lock(&inode->i_mutex) , but I'm not sure. Is it illegal to refer namespace_sem while inode's mutex held? === [ INFO: possible circular locking dependency detected ] 2.6.23-tomoyo2.1 #27 --- rcS/1093 is trying to acquire lock: (&namespace_sem){}, at: [] m_start+0x11/0x20 but task is already holding lock: (&inode->i_mutex){--..}, at: [] open_namei+0xf2/0x522 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&inode->i_mutex){--..}: [] graft_tree+0x62/0xca [] check_prev_add+0xc4/0x1bc [] graft_tree+0x62/0xca [] check_prevs_add+0x56/0xcb [] validate_chain+0x2a2/0x31f [] __kernel_text_address+0x18/0x23 [] dump_trace+0x6f/0x87 [] __lock_acquire+0x6f2/0x762 [] validate_chain+0x275/0x31f [] lock_acquire+0x79/0x93 [] graft_tree+0x62/0xca [] __mutex_lock_slowpath+0xea/0x280 [] graft_tree+0x62/0xca [] graft_tree+0x62/0xca [] do_add_mount+0x8a/0xe7 [] do_mount+0x1a9/0x1c0 [] __alloc_pages+0x64/0x2b6 [] copy_mount_options+0x4d/0x97 [] sys_mount+0x79/0xb5 [] name_to_dev_t+0x4d/0x25d [] schedule_timeout+0x79/0x8d [] create_proc_entry+0x73/0x86 [] process_timeout+0x0/0x5 [] kernel_init+0x0/0xa3 [] prepare_namespace+0x86/0x18e [] sys_access+0x1f/0x23 [] kernel_init+0x99/0xa3 [] kernel_thread_helper+0x7/0x10 [] 0x -> #0 (&namespace_sem){}: [] check_prev_add+0x27/0x1bc [] check_prevs_add+0x56/0xcb [] validate_chain+0x2a2/0x31f [] __lock_acquire+0x6f2/0x762 [] __d_lookup+0xda/0xfa [] lock_acquire+0x79/0x93 [] m_start+0x11/0x20 [] down_read+0x3b/0x71 [] m_start+0x11/0x20 [] m_start+0x11/0x20 [] tmy_do_single_write_perm+0x7e/0xda [] vfs_create+0x83/0x105 [] open_namei_create+0x47/0x8a [] open_namei+0x15c/0x522 [] do_filp_open+0x25/0x39 [] _spin_unlock+0x14/0x1c [] get_unused_fd_flags+0xb0/0xba [] do_sys_open+0x44/0xc5 [] sys_open+0x1a/0x1c [] syscall_call+0x7/0xb [] 0x other info that might help us debug this: 1 lock held by rcS/1093: #0: (&inode->i_mutex){--..}, at: [] open_namei+0xf2/0x522 stack backtrace: [] print_circular_bug_tail+0x5f/0x67 [] check_prev_add+0x27/0x1bc [] check_prevs_add+0x56/0xcb [] validate_chain+0x2a2/0x31f [] __lock_acquire+0x6f2/0x762 [] __d_lookup+0xda/0xfa [] lock_acquire+0x79/0x93 [] m_start+0x11/0x20 [] down_read+0x3b/0x71 [] m_start+0x11/0x20 [] m_start+0x11/0x20 [] tmy_do_single_write_perm+0x7e/0xda [] vfs_create+0x83/0x105 [] open_namei_create+0x47/0x8a [] open_namei+0x15c/0x522 [] do_filp_open+0x25/0x39 [] _spin_unlock+0x14/0x1c [] get_unused_fd_flags+0xb0/0xba [] do_sys_open+0x44/0xc5 [] sys_open+0x1a/0x1c [] syscall_call+0x7/0xb === The location is tmy_do_single_write_perm() (whose call trace is open_namei() -> open_namei_create() -> security_inode_create()) in the following file http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/trunk/2.1.x/tomoyo-lsm/patches/tomoyo-hooks.diff?rev=653&root=tomoyo&view=markup Regards. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On (01/11/07 10:10), Badari Pulavarty didst pronounce: > On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote: > > On Thu, 01 Nov 2007 08:38:57 -0800 > > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote: > > > > On Wed, 31 Oct 2007 08:14:21 -0800 > > > > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > > > > > > I tried data=writeback mode and it didn't help :( > > > > > > > > Ouch, so much for the easy way out. > > > > > > > > > > > > > > unable to release the page 262070 > > > > > bh c000211b9408 flags 110029 count 1 private 0 > > > > > unable to release the page 262098 > > > > > bh c00020ec9198 flags 110029 count 1 private 0 > > > > > memory offlining 3f000 to 4 failed > > > > > > > > > > > > > The only other special thing reiserfs does with the page cache is > > > > file tails. I don't suppose all of these pages are index zero in > > > > files smaller than 4k? > > > > > > Ah !! I am so blind :( > > > > > > I have been suspecting reiserfs all along, since its executing > > > fallback_migrate_page(). Actually, these buffer heads are > > > backing blockdev. I guess these are metadata buffers :( > > > I am not sure we can do much with these.. > > > > Hmpf, my first reply had a paragraph about the block device inode > > pages, I noticed the phrase file data pages and deleted it ;) > > > > But, for the metadata buffers there's not much we can do. They are > > included in a bunch of different lists and the patch would > > be non-trivial. > > Unfortunately, these buffer pages are spread all around making > those sections of memory non-removable. Of course, one can use > ZONE_MOVABLE to make sure to guarantee the remove. But I am > hoping we could easily group all these allocations and minimize > spreading them around. Mel ? The grow_dev_page() pages should be reclaimable even though migration is not supported for those pages? They were marked movable as it was useful for lumpy reclaim taking back pages for hugepage allocations and the like. Would it make sense for memory unremove to attempt migration first and reclaim second? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html