Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: JB> Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not found JB> by __d_lookup() until it is rehashed which is implicit done by ->lookup(). that means we can have two processes allocated dentry for same name. they'll call ->lookup() each against own dentry, fill them and hash. so, we'll get two equal dentries. is that OK? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Quoting Alex Tomas <[EMAIL PROTECTED]>: > > Jan Blunck (JB) writes: > > JB> i_sem does NOT protect the dcache. Also not in real_lookup(). The lock > must be > JB> acquired for ->lookup() and because we might sleep on i_sem, we have to > get it > JB> early and check for repopulation of the dcache. > > dentry is part of dcache, right? i_sem protects dentry from being > returned with incomplete data inside, right? > Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not found by __d_lookup() until it is rehashed which is implicit done by ->lookup(). Jan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: JB> i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be JB> acquired for ->lookup() and because we might sleep on i_sem, we have to get it JB> early and check for repopulation of the dcache. dentry is part of dcache, right? i_sem protects dentry from being returned with incomplete data inside, right? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Quoting Alex Tomas <[EMAIL PROTECTED]>: > > Jan Blunck (JB) writes: > > >> 1) i_sem protects dcache too > > JB> Where? i_sem is the per-inode lock, and shouldn't be used else. > > read comments in fs/namei.c:read_lookup() > i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be acquired for ->lookup() and because we might sleep on i_sem, we have to get it early and check for repopulation of the dcache. > we've already done this for ext3. it works. > it speeds some loads up significantly. > especially on big directories. > and you can control this via mount option, > so almost zero cost for fs that doesn't support pdirops. Ok. Jan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: >> 1) i_sem protects dcache too JB> Where? i_sem is the per-inode lock, and shouldn't be used else. read comments in fs/namei.c:read_lookup() >> 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch) >> 3) I have pdirops patch for ext3, but it needs some cleaning ... JB> I think you didn't get my point. JB> 1) Your approach is duplicating the locking effort for regular filesystem JB> (like ext2): JB> a) locking with s_pdirops_sems JB> b) locking the low-level filesystem data JB> It's cool that it speeds up tmpfs, but I don't think that this legatimate the JB> doubled locking for every other filesystem. JB> I'm not sure that it also increases performance for regular filesystems, if you JB> do the locking right. we've already done this for ext3. it works. it speeds some loads up significantly. especially on big directories. and you can control this via mount option, so almost zero cost for fs that doesn't support pdirops. JB> 2) In my opinion, a superblock-wide semaphore array which allows 1024 JB> different (different names and different operations) accesses to ONE single JB> inode (which is the data it should protect) is not a good idea. yes, it has some weakness, i'm reworking vfs patch to avoid inter-inode collisions. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Quoting Alex Tomas <[EMAIL PROTECTED]>: > > 1) i_sem protects dcache too Where? i_sem is the per-inode lock, and shouldn't be used else. > 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch) > 3) I have pdirops patch for ext3, but it needs some cleaning ... I think you didn't get my point. 1) Your approach is duplicating the locking effort for regular filesystem (like ext2): a) locking with s_pdirops_sems b) locking the low-level filesystem data It's cool that it speeds up tmpfs, but I don't think that this legatimate the doubled locking for every other filesystem. I'm not sure that it also increases performance for regular filesystems, if you do the locking right. 2) In my opinion, a superblock-wide semaphore array which allows 1024 different (different names and different operations) accesses to ONE single inode (which is the data it should protect) is not a good idea. Jan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: JB> With luck you have s_pdirops_size (or 1024) different renames altering JB> concurrently one directory inode. Therefore you need a lock protecting JB> your filesystem data. This is basically the job done by i_sem. So in JB> my opinion you only move "The Problem" from the VFS to the lowlevel JB> filesystems. But then there is no need for i_sem or your JB> s_pdirops_sems anymore. 1) i_sem protects dcache too 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch) 3) I have pdirops patch for ext3, but it needs some cleaning ... thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Alex Tomas wrote: +static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name) +{ + if (IS_PDIROPS(dir)) { + struct super_block *sb; + /* name->hash expected to be already calculated */ + sb = dir->i_sb; + BUG_ON(sb->s_pdirops_sems == NULL); + return sb->s_pdirops_sems + name->hash % sb->s_pdirops_size; + } + return &dir->i_sem; +} + +static inline void lock_dir(struct inode *dir, struct qstr *name) +{ + down(lock_sem(dir, name)); +} + @@ -1182,12 +1204,26 @@ /* * p1 and p2 should be directories on the same fs. */ -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +struct dentry *lock_rename(struct dentry *p1, struct qstr *n1, +struct dentry *p2, struct qstr *n2) { struct dentry *p; if (p1 == p2) { - down(&p1->d_inode->i_sem); + if (IS_PDIROPS(p1->d_inode)) { + unsigned int h1, h2; + h1 = n1->hash % p1->d_inode->i_sb->s_pdirops_size; + h2 = n2->hash % p2->d_inode->i_sb->s_pdirops_size; + if (h1 < h2) { +lock_dir(p1->d_inode, n1); +lock_dir(p2->d_inode, n2); + } else if (h1 > h2) { +lock_dir(p2->d_inode, n2); +lock_dir(p1->d_inode, n1); + } else +lock_dir(p1->d_inode, n1); + } else + down(&p1->d_inode->i_sem); return NULL; } @@ -1195,31 +1231,35 @@ for (p = p1; p->d_parent != p; p = p->d_parent) { if (p->d_parent == p2) { - down(&p2->d_inode->i_sem); - down(&p1->d_inode->i_sem); + lock_dir(p2->d_inode, n2); + lock_dir(p1->d_inode, n1); return p; } } for (p = p2; p->d_parent != p; p = p->d_parent) { if (p->d_parent == p1) { - down(&p1->d_inode->i_sem); - down(&p2->d_inode->i_sem); + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); return p; } } - down(&p1->d_inode->i_sem); - down(&p2->d_inode->i_sem); + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); return NULL; } With luck you have s_pdirops_size (or 1024) different renames altering concurrently one directory inode. Therefore you need a lock protecting your filesystem data. This is basically the job done by i_sem. So in my opinion you only move "The Problem" from the VFS to the lowlevel filesystems. But then there is no need for i_sem or your s_pdirops_sems anymore. Regards, Jan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
fs/inode.c |1 fs/namei.c | 66 ++--- include/linux/fs.h | 11 3 files changed, 54 insertions(+), 24 deletions(-) Index: linux-2.6.10/fs/namei.c === --- linux-2.6.10.orig/fs/namei.c2005-01-28 19:32:13.0 +0300 +++ linux-2.6.10/fs/namei.c 2005-02-19 20:40:05.763437304 +0300 @@ -104,6 +104,28 @@ * any extra contention... */ +static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name) +{ + if (IS_PDIROPS(dir)) { + struct super_block *sb; + /* name->hash expected to be already calculated */ + sb = dir->i_sb; + BUG_ON(sb->s_pdirops_sems == NULL); + return sb->s_pdirops_sems + name->hash % sb->s_pdirops_size; + } + return &dir->i_sem; +} + +static inline void lock_dir(struct inode *dir, struct qstr *name) +{ + down(lock_sem(dir, name)); +} + +static inline void unlock_dir(struct inode *dir, struct qstr *name) +{ + up(lock_sem(dir, name)); +} + /* In order to reduce some races, while at the same time doing additional * checking and hopefully speeding things up, we copy filenames to the * kernel data space before using them.. @@ -380,7 +402,7 @@ struct dentry * result; struct inode *dir = parent->d_inode; - down(&dir->i_sem); + lock_dir(dir, name); /* * First re-do the cached lookup just in case it was created * while we waited for the directory semaphore.. @@ -406,7 +428,7 @@ else result = dentry; } - up(&dir->i_sem); + unlock_dir(dir, name); return result; } @@ -414,7 +436,7 @@ * Uhhuh! Nasty case: the cache was re-populated while * we waited on the semaphore. Need to revalidate. */ - up(&dir->i_sem); + unlock_dir(dir, name); if (result->d_op && result->d_op->d_revalidate) { if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) { dput(result); @@ -1182,12 +1204,26 @@ /* * p1 and p2 should be directories on the same fs. */ -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +struct dentry *lock_rename(struct dentry *p1, struct qstr *n1, + struct dentry *p2, struct qstr *n2) { struct dentry *p; if (p1 == p2) { - down(&p1->d_inode->i_sem); + if (IS_PDIROPS(p1->d_inode)) { + unsigned int h1, h2; + h1 = n1->hash % p1->d_inode->i_sb->s_pdirops_size; + h2 = n2->hash % p2->d_inode->i_sb->s_pdirops_size; + if (h1 < h2) { + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); + } else if (h1 > h2) { + lock_dir(p2->d_inode, n2); + lock_dir(p1->d_inode, n1); + } else + lock_dir(p1->d_inode, n1); + } else + down(&p1->d_inode->i_sem); return NULL; } @@ -1195,31 +1231,35 @@ for (p = p1; p->d_parent != p; p = p->d_parent) { if (p->d_parent == p2) { - down(&p2->d_inode->i_sem); - down(&p1->d_inode->i_sem); + lock_dir(p2->d_inode, n2); + lock_dir(p1->d_inode, n1); return p; } } for (p = p2; p->d_parent != p; p = p->d_parent) { if (p->d_parent == p1) { - down(&p1->d_inode->i_sem); - down(&p2->d_inode->i_sem); + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); return p; } } - down(&p1->d_inode->i_sem); - down(&p2->d_inode->i_sem); + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); return NULL; } -void unlock_rename(struct dentry *p1, struct dentry *p2) +void unlock_rename(struct dentry *p1, struct qstr *n1, + struct dentry *p2, struct qstr *n2) { - up(&p1->d_inode->i_sem); + unlock_dir(p1->d_inode, n1); if (p1 != p2) { - up(&p2->d_inode->i_sem); + unlock_dir(p2->d_inode, n2); up(&p1->d_inode->i_sb->s_vfs_rename_sem); + } else if (IS_PDIROPS(p1->d_inode) && + n1->hash != n2->hash) { + unlock_dir(p2->d_inode, n2); } } @@ -1386,13 +1426,13 @@ dir = nd->dentry; nd->flags &= ~LOOKUP_PARENT; - down(