Re: Massive slowdown when re-querying large nfs dir
On Tuesday November 6, [EMAIL PROTECTED] wrote: > > On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi <[EMAIL PROTECTED]> wrote: > > Al Boldi wrote: > > > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+ > > > entries) using a simple ls -l. > > > > > > On 2.6.23 client and server running userland rpc.nfs.V2: > > > first try: time -p ls -l <2k+ entry dir> in ~2.5sec > > > more tries: time -p ls -l <2k+ entry dir> in ~8sec > > > > > > first try: time -p ls -l <5k+ entry dir> in ~9sec > > > more tries: time -p ls -l <5k+ entry dir> in ~180sec > > > > > > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2: > > > first try: time -p ls -l <2k+ entry dir> in ~2.5sec > > > more tries: time -p ls -l <2k+ entry dir> in ~7sec > > > > > > first try: time -p ls -l <5k+ entry dir> in ~8sec > > > more tries: time -p ls -l <5k+ entry dir> in ~43sec > > > > > > Remounting the nfs-dir on the client resets the problem. > > > > > > Any ideas? > > > > Ok, I played some more with this, and it turns out that nfsV3 is a lot > > faster. But, this does not explain why the 2.4.31 kernel is still over > > 4-times faster than 2.6.23. > > > > Can anybody explain what's going on? > > > > Sure, Neil can! ;) Nuh. He said "userland rpc.nfs.Vx". I only do "kernel-land NFS". In these days of high specialisation, each line of code is owned by a different person, and finding the right person is hard I would suggest getting a 'tcpdump -s0' trace and seeing (with wireshark) what is different between the various cases. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote: > Ok, so it's not synchronous writes that we are doing - we're just > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > The "synchronous" nature appears to be coming from higher level > locking when reclaiming inodes (on the flush lock). It appears that > inode write clustering is failing completely so we are writing the > same block multiple times i.e. once for each inode in the cluster we > have to write. Works for me. The only remaining stalls are sub second and look completely valid, considering the amount of files being removed. iostat 10 from this test: 3 0 0 3500192332 20495600 105 8512 1809 6473 6 10 83 1 0 0 0 3500200332 20457600 0 4367 1355 3712 2 6 92 0 2 0 0 3504264332 20352800 0 6805 1912 4967 4 8 88 0 0 0 0 3511632332 20352800 0 2843 805 1791 2 4 94 0 0 0 0 3516852332 20351600 0 3375 879 2712 3 5 93 0 0 0 0 3530544332 20266800 186 776 488 1152 4 2 89 4 0 0 0 3574788332 20496000 226 326 358 787 0 1 98 0 0 0 0 3576820332 20496000 0 376 332 737 0 0 99 0 0 0 0 3578432332 20496000 0 356 293 606 1 1 99 0 0 0 0 3580192332 20496000 0 101 104 384 0 0 99 0 I'm pleased to note that this is now much faster again. Thanks! Tested-by: Torsten Kaiser <[EMAIL PROTECTED]> CC's please note: It looks like this was really a different problem then the 100% iowait that was seen with reiserfs. Also the one complete stall I have seen is probably something else. But I have not been able to reproduce this again with -mm and have never seen this on mainline, so I will just ignore that single event until I see it again. Torsten > --- > fs/xfs/xfs_iget.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c > === > --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 > +1100 > +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100 > @@ -248,7 +248,7 @@ finish_inode: > icl = NULL; > if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq, > first_index, 1)) { > - if ((iq->i_ino & mask) == first_index) > + if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) > icl = iq->i_cluster; > } > > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive slowdown when re-querying large nfs dir
> On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi <[EMAIL PROTECTED]> wrote: > Al Boldi wrote: > > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+ > > entries) using a simple ls -l. > > > > On 2.6.23 client and server running userland rpc.nfs.V2: > > first try: time -p ls -l <2k+ entry dir> in ~2.5sec > > more tries: time -p ls -l <2k+ entry dir> in ~8sec > > > > first try: time -p ls -l <5k+ entry dir> in ~9sec > > more tries: time -p ls -l <5k+ entry dir> in ~180sec > > > > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2: > > first try: time -p ls -l <2k+ entry dir> in ~2.5sec > > more tries: time -p ls -l <2k+ entry dir> in ~7sec > > > > first try: time -p ls -l <5k+ entry dir> in ~8sec > > more tries: time -p ls -l <5k+ entry dir> in ~43sec > > > > Remounting the nfs-dir on the client resets the problem. > > > > Any ideas? > > Ok, I played some more with this, and it turns out that nfsV3 is a lot > faster. But, this does not explain why the 2.4.31 kernel is still over > 4-times faster than 2.6.23. > > Can anybody explain what's going on? > Sure, Neil can! ;) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote: > On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote: > > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > > > Rather than vmstat, can you use something like iostat to show how busy > > > your > > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. > > > > Both "vmstat 10" and "iostat -x 10" output from this test: > > procs ---memory-- ---swap-- -io -system-- > > cpu > > r b swpd free buff cache si sobibo in cs us sy id > > wa > > 2 0 0 3700592 0 85424003183 108 244 2 1 95 > > 1 > > -> emerge reads something, don't knwo for sure what... > > 1 0 0 3665352 0 8794000 239 2 343 585 2 1 97 > > 0 > > > > > The last 20% of the btrace look more or less completely like this, no > > other programs do any IO... > > > > 253,03 104626 526.293450729 974 C WS 79344288 + 8 [0] > > 253,03 104627 526.293455078 974 C WS 79344296 + 8 [0] > > 253,0136469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] > > 253,0136470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] > ^^ > Apparently we are doing synchronous writes. That would explain why > it is slow. We shouldn't be doing synchronous writes here. I'll see if > I can reproduce this. > > > > Yes, I can reproduce the sync writes coming out of xfssyncd. I'll > look into this further and send a patch when I have something concrete. Ok, so it's not synchronous writes that we are doing - we're just submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. The "synchronous" nature appears to be coming from higher level locking when reclaiming inodes (on the flush lock). It appears that inode write clustering is failing completely so we are writing the same block multiple times i.e. once for each inode in the cluster we have to write. This must be a side effect of some other change as we haven't changed anything in the reclaim code recently. /me scurries off to run some tests Indeed it is. The patch below should fix the problem - the inode clusters weren't getting set up properly when inodes were being read in or allocated. This is a regression, introduced by this mod: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_iget.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100 @@ -248,7 +248,7 @@ finish_inode: icl = NULL; if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq, first_index, 1)) { - if ((iq->i_ino & mask) == first_index) + if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) icl = iq->i_cluster; } - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] add rwmem type backed by pages
This lets callers specify a region of memory to read from or write to with an array of page/offset/len tuples. There have been specific requests to do this from servers which want to do O_DIRECT from the kernel. (knfsd?) This could also be used by places which currently hold a kmap() and call fop->write. ecryptfs_write_lower_page_segment() is one such caller. --- fs/rwmem.c| 66 + include/linux/rwmem.h | 16 2 files changed, 82 insertions(+), 0 deletions(-) diff --git a/fs/rwmem.c b/fs/rwmem.c index 0433ba4..c87e8a4 100644 --- a/fs/rwmem.c +++ b/fs/rwmem.c @@ -90,3 +90,69 @@ struct rwmem_ops rwmem_iovec_ops = { .seg_bytes = rwmem_iovec_seg_bytes, .get_seg_pages = rwmem_iovec_get_seg_pages, }; + +void rwmem_pages_init(struct rwmem *rwm) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + struct pgol *pgol; + unsigned long i; + + rwm->total_bytes = 0; + rwm->nr_pages = rwm->nr_segs; + rwm->boundary_bits = 0; + + for (i = 0; i < rwm->nr_segs; i++) { + pgol = &rwp->pgol[i]; + + rwm->total_bytes += pgol->len; + rwm->boundary_bits |= pgol->offset | pgol->len; + } +} + +/* + * Returns the offset of the start of a segment within its first page. + */ +unsigned long rwmem_pages_seg_page_offset(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + BUG_ON(i >= rwm->nr_segs); + return rwp->pgol[i].offset; +} + +/* + * Returns the total bytes in the given segment. + */ +unsigned long rwmem_pages_seg_bytes(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + BUG_ON(i >= rwm->nr_segs); + return rwp->pgol[i].len; +} + +/* + * For now each page is its own seg. + */ +int rwmem_pages_get_seg_pages(struct rwmem *rwm, unsigned long i, + unsigned long *cursor, struct page **pages, + unsigned long max_pages, int write) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + int ret = 0; + + BUG_ON(i >= rwm->nr_segs); + BUG_ON(*cursor != 0); + + if (max_pages) { + pages[0] = rwp->pgol[i].page; + get_page(pages[0]); + ret = 1; + } + return ret; +} + +struct rwmem_ops rwmem_pages_ops = { + .init = rwmem_pages_init, + .seg_page_offset= rwmem_pages_seg_page_offset, + .seg_bytes = rwmem_pages_seg_bytes, + .get_seg_pages = rwmem_pages_get_seg_pages, +}; diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h index 666f9f4..47019f0 100644 --- a/include/linux/rwmem.h +++ b/include/linux/rwmem.h @@ -26,4 +26,20 @@ struct rwmem_iovec { }; struct rwmem_ops rwmem_iovec_ops; +/* + * How many times do we need this in subsystems before we make a universal + * struct? (bio_vec, skb_frag_struct, pipe_buffer) + */ +struct pgol { + struct page *page; + unsigned int offset; + unsigned int len; +}; + +struct rwmem_pages { + struct rwmemrwmem; + struct pgol *pgol; +}; +struct rwmem_ops rwmem_pages_ops; + #endif -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write
This adds a structure and interface to represent the segments of memory which are acting as the source or destination for a read or write operation. Callers would fill this structure and then pass it down the rw path. The intent is to let stages in the rw path make specific calls against this API and structure instead of working with, say, struct iovec natively. The main intent of this is to enable kernel calls into the rw path which specify memory with page/offset/len tuples. Another potential benefit of this is the reduction in iterations over iovecs at various points in the kernel. Each iov_length(iov) call, for example, could be translated into rwm->total_bytes. O_DIRECTs check of memory alignment is changed into a single test against rwm->boundary_bits. I imagine this might integrate well with the iov_iter interface, though I haven't examined that in any depth. --- fs/Makefile |2 +- fs/rwmem.c| 92 + include/linux/rwmem.h | 29 +++ 3 files changed, 122 insertions(+), 1 deletions(-) create mode 100644 fs/rwmem.c create mode 100644 include/linux/rwmem.h diff --git a/fs/Makefile b/fs/Makefile index 500cf15..c342365 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o + stack.o rwmem.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/rwmem.c b/fs/rwmem.c new file mode 100644 index 000..0433ba4 --- /dev/null +++ b/fs/rwmem.c @@ -0,0 +1,92 @@ +#include +#include +#include +#include +#include + +static inline unsigned long pages_spanned(unsigned long addr, + unsigned long bytes) +{ + return ((addr + bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) - + (addr >> PAGE_SHIFT); +} + +void rwmem_iovec_init(struct rwmem *rwm) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + struct iovec *iov; + unsigned long i; + + rwm->total_bytes = 0; + rwm->nr_pages = 0; + rwm->boundary_bits = 0; + + for (i = 0; i < rwm->nr_segs; i++) { + iov = &rwi->iov[i]; + + rwm->total_bytes += iov->iov_len; + rwm->nr_pages += pages_spanned((unsigned long)iov->iov_base, + iov->iov_len); + rwm->boundary_bits |= (unsigned long)iov->iov_base | + (unsigned long)iov->iov_len; + } +} + +/* + * Returns the offset of the start of a segment within its first page. + */ +unsigned long rwmem_iovec_seg_page_offset(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + BUG_ON(i >= rwm->nr_segs); + return (unsigned long)rwi->iov[i].iov_base & ~PAGE_MASK; +} + +/* + * Returns the total bytes in the given segment. + */ +unsigned long rwmem_iovec_seg_bytes(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + BUG_ON(i >= rwm->nr_segs); + return rwi->iov[i].iov_len; +} + +int rwmem_iovec_get_seg_pages(struct rwmem *rwm, unsigned long i, + unsigned long *cursor, struct page **pages, + unsigned long max_pages, int write) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + struct iovec *iov; + int ret; + + BUG_ON(i >= rwm->nr_segs); + iov = &rwi->iov[i]; + + if (*cursor == 0) + *cursor = (unsigned long)iov->iov_base; + + max_pages = min(pages_spanned(*cursor, iov->iov_len - + (*cursor - (unsigned long)iov->iov_base)), + max_pages); + + down_read(¤t->mm->mmap_sem); + ret = get_user_pages(current, current->mm, *cursor, max_pages, write, +0, pages, NULL); + up_read(¤t->mm->mmap_sem); + + if (ret > 0) { + *cursor += ret * PAGE_SIZE; + if (*cursor >= (unsigned long)iov->iov_base + iov->iov_len) + *cursor = ~0; + } + + return ret; +} + +struct rwmem_ops rwmem_iovec_ops = { + .init = rwmem_iovec_init, + .seg_page_offset= rwmem_iovec_seg_page_offset, + .seg_bytes = rwmem_iovec_seg_bytes, + .get_seg_pages = rwmem_iovec_get_seg_pages, +}; diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h new file mode 100644 index 000..666f9f4 --- /dev/null +++ b/include/linux/rwmem.h @@ -0,0 +1,29 @@ +#ifndef
[PATCH 2/4] dio: use rwmem to work with r/w memory arguments
This switches dio to work with the rwmem api to get memory pages for the IO instead of working with iovecs directly. It can use direct rwm struct accesses for some static universal properties of a set of memory segments that make up the buffer argument. It uses helper functions to work with the underlying data structures directly. --- fs/direct-io.c | 123 +--- 1 files changed, 55 insertions(+), 68 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index acf0da1..0d5ed41 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -34,7 +34,7 @@ #include #include #include -#include +#include #include /* @@ -105,11 +105,12 @@ struct dio { sector_t cur_page_block;/* Where it starts */ /* -* Page fetching state. These variables belong to dio_refill_pages(). +* Page fetching state. direct_io_worker() sets these for +* dio_refill_pages() who modifies them as it fetches. */ - int curr_page; /* changes */ - int total_pages;/* doesn't change */ - unsigned long curr_user_address;/* changes */ + struct rwmem *rwm; + unsigned long cur_seg; + unsigned long cur_seg_cursor; /* * Page queue. These variables belong to dio_refill_pages() and @@ -146,21 +147,11 @@ static inline unsigned dio_pages_present(struct dio *dio) */ static int dio_refill_pages(struct dio *dio) { + struct rwmem *rwm = dio->rwm; int ret; - int nr_pages; - - nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES); - down_read(¤t->mm->mmap_sem); - ret = get_user_pages( - current,/* Task for fault acounting */ - current->mm,/* whose pages? */ - dio->curr_user_address, /* Where from? */ - nr_pages, /* How many pages? */ - dio->rw == READ,/* Write to memory? */ - 0, /* force (?) */ - &dio->pages[0], - NULL); /* vmas */ - up_read(¤t->mm->mmap_sem); + + ret = rwm->ops->get_seg_pages(rwm, dio->cur_seg, &dio->cur_seg_cursor, + dio->pages, DIO_PAGES, dio->rw == READ); if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) { struct page *page = ZERO_PAGE(0); @@ -180,8 +171,6 @@ static int dio_refill_pages(struct dio *dio) } if (ret >= 0) { - dio->curr_user_address += ret * PAGE_SIZE; - dio->curr_page += ret; dio->head = 0; dio->tail = ret; ret = 0; @@ -938,11 +927,9 @@ out: */ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, - const struct iovec *iov, loff_t offset, unsigned long nr_segs, - unsigned blkbits, get_block_t get_block, dio_iodone_t end_io, - struct dio *dio) + struct rwmem *rwm, loff_t offset, unsigned blkbits, + get_block_t get_block, dio_iodone_t end_io, struct dio *dio) { - unsigned long user_addr; unsigned long flags; int seg; ssize_t ret = 0; @@ -966,44 +953,33 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, spin_lock_init(&dio->bio_lock); dio->refcount = 1; + dio->rwm = rwm; + /* * In case of non-aligned buffers, we may need 2 more * pages since we need to zero out first and last block. */ + dio->pages_in_io = rwm->nr_pages; if (unlikely(dio->blkfactor)) - dio->pages_in_io = 2; - - for (seg = 0; seg < nr_segs; seg++) { - user_addr = (unsigned long)iov[seg].iov_base; - dio->pages_in_io += - ((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE - - user_addr/PAGE_SIZE); - } + dio->pages_in_io += 2; - for (seg = 0; seg < nr_segs; seg++) { - user_addr = (unsigned long)iov[seg].iov_base; - dio->size += bytes = iov[seg].iov_len; + for (seg = 0; seg < rwm->nr_segs; seg++) { + dio->size += bytes = rwm->ops->seg_bytes(rwm, seg); /* Index into the first page of the first block */ - dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits; + dio->first_block_in_page = + rwm->ops->seg_page_offset(rwm, seg) >> blkbits; dio->final_block_in_request = dio->block_in_file + (bytes >> blkbits); /* Page fetching state */ + dio->cur_seg = seg; + dio->cur_seg_cursor = 0; dio->head = 0; dio->tail = 0; -
[PATCH 4/4] add dio interface for page/offset/len tuples
This is what it might look like to feed pgol in to some part of the fs stack instead of iovecs. I imagine we'd want to do it at a much higher level, perhaps something like vfs_write_pages(). --- fs/direct-io.c | 21 + 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 0d5ed41..e86bcbc 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1213,3 +1213,24 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, } EXPORT_SYMBOL(__blockdev_direct_IO); + +ssize_t +__blockdev_direct_IO_pages(int rw, struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, struct pgol *pgol, loff_t offset, + unsigned long nr_pages, get_block_t get_block, dio_iodone_t end_io, + int dio_lock_type) +{ + struct rwmem_pages rwp = { + .rwmem.ops = &rwmem_pages_ops, + .rwmem.nr_segs = nr_pages, + .pgol = pgol, + }; + struct rwmem *rwm = &rwp.rwmem; + + rwm->ops->init(rwm); + + return blockdev_direct_IO_rwmem(rw, iocb, inode, bdev, rwm, offset, + get_block, end_io, dio_lock_type); +} + +EXPORT_SYMBOL(__blockdev_direct_IO_pages); -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] fs io with struct page instead of iovecs
At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the kernel with pages instead of with iovecs. This patch series explores one direction we could head in to achieve this. We obviously can't just translate user iovecs (which might represent more memory than the machine has) to pinned pages and pass page struct pointers all the way down the stack. And we don't want to duplicate a lot of the generic FS machinery down a duplicate path which works with pages instead of iovecs. So this patch series heads in the direction of abstracting out the memory argument to the read and write calls. It's still based on segments, but we hide that it's either iovecs or arrays of page pointers from the callers in the rw stack. I didn't go too nuts with the syntactic sugar. This is just intended to show the basic mechanics. We can obviously pretty it up if we think this is a sane thing to be doing. The series builds but has never been run. What do people (that's you, Christoph) think? I'm flexible. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote: > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > > Rather than vmstat, can you use something like iostat to show how busy your > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. > > Both "vmstat 10" and "iostat -x 10" output from this test: > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 2 0 0 3700592 0 85424003183 108 244 2 1 95 1 > -> emerge reads something, don't knwo for sure what... > 1 0 0 3665352 0 8794000 239 2 343 585 2 1 97 0 > > The last 20% of the btrace look more or less completely like this, no > other programs do any IO... > > 253,03 104626 526.293450729 974 C WS 79344288 + 8 [0] > 253,03 104627 526.293455078 974 C WS 79344296 + 8 [0] > 253,0136469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] > 253,0136470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] ^^ Apparently we are doing synchronous writes. That would explain why it is slow. We shouldn't be doing synchronous writes here. I'll see if I can reproduce this. Yes, I can reproduce the sync writes coming out of xfssyncd. I'll look into this further and send a patch when I have something concrete. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/6/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > -- > Subject: writeback: remove pages_skipped accounting in > __block_write_full_page() > From: Fengguang Wu <[EMAIL PROTECTED]> > > Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug: [sni] > fs/buffer.c |1 - > fs/xfs/linux-2.6/xfs_aops.c |5 ++--- > 2 files changed, 2 insertions(+), 4 deletions(-) I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted. This does still stall. On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > Rather than vmstat, can you use something like iostat to show how busy your > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. Both "vmstat 10" and "iostat -x 10" output from this test: procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 3700592 0 85424003183 108 244 2 1 95 1 -> emerge reads something, don't knwo for sure what... 1 0 0 3665352 0 8794000 239 2 343 585 2 1 97 0 0 0 0 3657728 0 9122800 32235 445 833 0 0 99 0 1 0 0 3653136 0 9469200 33033 455 844 1 1 98 0 0 0 0 3646836 0 9772000 289 3 422 751 1 1 98 0 0 0 0 3616468 0 9969200 18533 399 614 9 3 87 1 -> starts to remove the kernel tree 0 0 0 3610452 0 10259200 138 3598 1398 3945 3 6 90 1 0 0 0 3607136 0 10454800 2 5962 1919 6070 4 9 87 0 0 0 0 3606636 0 10508000 0 1539 810 2200 1 2 97 0 -> first stall 28 sec. 0 0 0 3606592 0 10529200 0 698 679 1390 0 1 99 0 0 0 0 3606440 0 10553200 0 658 690 1457 0 0 99 0 0 0 0 3606068 0 10612800 1 1780 947 1982 1 3 96 0 -> second stall 24 sec. 0 0 0 3606036 0 10646400 4 858 758 1457 0 1 98 0 0 0 0 3605380 0 10687200 0 1173 807 1880 1 2 97 0 0 0 0 3605000 0 10774800 1 2413 1103 2996 2 4 94 0 -> third stall 38 sec. 0 0 0 3604488 0 1084720045 897 748 1577 0 1 98 0 0 0 0 3604176 0 10876400 0 824 752 1700 0 1 98 0 0 0 0 3604012 0 10898800 0 660 643 1237 0 1 99 0 0 0 0 3608936 0 11012000 1 3490 1232 3455 3 5 91 0 -> fourth stall 64 sec. 1 0 0 3609060 0 11029600 0 568 669 1222 0 1 99 0 0 0 0 3609464 0 11049600 0 604 638 1366 0 1 99 0 0 0 0 3609244 0 11074000 0 844 714 1282 0 1 99 0 0 0 0 3609508 0 11091200 0 552 584 1185 1 1 99 0 2 0 0 3609436 0 3200 0 658 643 1442 0 1 99 0 0 0 0 3609212 0 11134800 0 714 637 1382 0 0 99 0 0 0 0 3619132 0 11049200 130 1086 736 1870 4 3 91 2 0 0 0 3657016 0 11549600 466 589 718 1367 1 1 98 0 -> emerge finishs, dirty data was the hole time <1Mb, stays now below 300kb (btrace running...) 0 0 0 3657844 0 11566000 0 564 635 1226 1 1 99 0 0 0 0 3658236 0 11584000 0 582 600 1248 1 0 99 0 0 0 0 3658296 0 11601200 0 566 606 1232 1 1 99 0 0 0 0 3657924 0 11621200 0 688 596 1321 1 0 99 0 0 0 0 3658252 0 11641600 0 631 642 1356 1 0 98 0 0 0 0 3658184 0 11659200 0 566 575 1273 0 0 99 0 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 3658344 0 11677200 0 649 606 1301 0 0 99 0 0 0 0 3658548 0 11697600 0 617 624 1345 0 0 99 0 0 0 0 3659204 0 11716000 0 550 576 1223 1 1 99 0 0 0 0 3659944 0 11734400 0 620 583 1272 0 0 99 0 0 0 0 3660548 0 11754000 0 605 611 1338 0 0 99 0 0 0 0 3661236 0 11773200 0 582 569 1275 0 0 99 0 0 0 0 3662420 0 11788800 0 590 571 1157 0 0 99 0 0 0 0 3664324 0 11806800 0 566 553 1222 1 1 99 0 0 0 0 3665240 0 11816800 0 401 574 862 0 0 99 0 0 0 0 3666984 0 11828000 0 454 574 958 1 1 99 0 0 0 0 3668664 0 11840000 0 396 559 946 0 0 99 0 0 0 0 3670628 0 11849600 0 296 495 784 0 0 99 0 0 0 0 3671316 0 118496
cramfs in big endian
Hi! I'm currently trying to enable the cramfs to mount filesystems with a different endianness. All I have is an intel compatible machine that produces cramfs images in little endian format. I found a modified version of mkcramfs that is able to produce images in big endian mode. Unfortunately it seems to use a different compression algorithm (LZO) than the standard implementation (zlib). I would be very glad if somebody could send me an image in big endian mode using the zlib deflate compression algorithm. I have uploaded a bz2 archive containing some testfiles. You can find it at [1]. Regards, Andi [1] http://drebesium.org/~hackbert/cramfs-testfiles.tar.bz2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/6/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote: > > > I'm struggling to understand what possible changed in XFS or writeback that > > would lead to stalls like this, esp. as you appear to be removing files when > > the stalls occur. > > Just a crazy idea,.. > > Could there be a set_page_dirty() that doesn't have > balance_dirty_pages() call near? For example modifying meta data in > unlink? > > Such a situation could lead to an excess of dirty pages and the next > call to balance_dirty_pages() would appear to stall, as it would > desperately try to get below the limit again. Only if accounting of the dirty pages is also broken. In the unmerge testcase I see most of the time only <200kb of dirty data in /proc/meminfo. The system has 4Gb of RAM so I'm not sure if it should ever be valid to stall even the emerge/install testcase. Torsten Now building a kernel with the skipped-pages-accounting-patch reverted... - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
> On Tue, 6 Nov 2007 12:30:38 +0100 Jörn Engel <[EMAIL PROTECTED]> wrote: > On Tue, 6 November 2007 10:11:49 +0100, Jan Blunck wrote: > > On Mon, Nov 05, Jörn Engel wrote: > > > > > This patch changes some 400 lines, most if not all of which get longer > > > and more complicated to read. 23 get sufficiently longer to require an > > > additional linebreak. I can't remember complexity being invited into > > > the kernel without good reasoning, yet the patch description is > > > surprisingly low on reasoning: > > > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. > > > > I don't measure complexity by lines of code or length of lines. Maybe I was > > not verbose enough in the description, fair. > > If you have a better metric, please share it. In the paragraph you > deleted I explicitly asked for _any_ metric that shows favorable > numbers. Lacking numbers, we could only argue about our respective > personal taste. > > > This is a cleanup series. In mostly no case there is a reason why someone > > would want to use a dentry for itself. This series reflects that fact in > > nameidata where there is absolutly no reason at all. > > 400+ lines changed in this patch, some 10 in a followup patch that > combines dentry/vfsmount assignments into a single path assignment. If > your argument above was valid, I would expect more simplifications and > fewer complications. Call me a sceptic until further patches show up to > support your point. > > > It enforced the correct > > order of getting/releasing refcount on pairs. > > This argument I buy. > > > It enables us > > to do some more cleanups wrt lookup (which are coming later). > > Please send those patches. I invite cleanups that do clean things up > and won't argue against then. ;) > > > For stacking > > support in VFS it is essential to have the pair in every > > place where you want to traverse the stack. > > True, but unrelated to this patch. > > > > If churn is the only effect of this, please considere it NAKed again. > > > > I wonder why you didn't speak up when this series was posted to LKML. It was > > at least posted three times before. > > I did speak up. Once. If you missed that thread, please forgive me > missing those in which the same patch I disapproved of were resent > without me on Cc. > > I'm not categorically against this struct path business. It does have > some advantages at first glance. But the patch we're arguing about > clearly makes code more complicated and harder to read. We should have > more than superficial benefits if we decide to pay such a cost. It sounds like we at least need a better overall changlog, please.. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote: > I'm struggling to understand what possible changed in XFS or writeback that > would lead to stalls like this, esp. as you appear to be removing files when > the stalls occur. Just a crazy idea,.. Could there be a set_page_dirty() that doesn't have balance_dirty_pages() call near? For example modifying meta data in unlink? Such a situation could lead to an excess of dirty pages and the next call to balance_dirty_pages() would appear to stall, as it would desperately try to get below the limit again. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qstr abuse in git-cifs
> I suspect that bad things are happening in there I doubt that it is too bad since various filesystems (including JFS and CIFS) have always written over qstr->name in these dentries in order to better handle case insensitive compares. Other than trying to remove the compiler warning recently (mainline throws the same warning), I didn't see what changed other than now displaying a warning. I agree that it looks strange, but I also don't see a trivial way to change it. Any ideas on how to prevent the case of an existing negative dentry from taking precedence? Creating more dentries for these duplicates would seem like a worse idea. > fs/cifs/dir.c: In function 'cifs_ci_compare': > fs/cifs/dir.c:596: warning: passing argument 1 of '__constant_memcpy' > discards qualifiers from pointer target type > fs/cifs/dir.c:596: warning: passing argument 1 of '__memcpy' discards > qualifiers from pointer target type -- Thanks, Steve - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
Hello. Christoph Hellwig wrote: > Any code except VFS internals has no business using it at all and doesn't > do that in mainline either. I'd start looking for design bugs in whatever > code you have using it first. Isn't security_inode_create() a part of VFS internals? I think security_inode_create() is a part of VFS internals because it is called from vfs_create(). Regards. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IGET: Use ino_t consistently for vxfs_iget() and its declaration
Use ino_t consistently for vxfs_iget() and its declaration, rather than using ino_t for one and unsigned long for the other. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/freevxfs/vxfs_extern.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/freevxfs/vxfs_extern.h b/fs/freevxfs/vxfs_extern.h index f307694..2b46064 100644 --- a/fs/freevxfs/vxfs_extern.h +++ b/fs/freevxfs/vxfs_extern.h @@ -58,7 +58,7 @@ extern struct inode * vxfs_get_fake_inode(struct super_block *, extern voidvxfs_put_fake_inode(struct inode *); extern struct vxfs_inode_info *vxfs_blkiget(struct super_block *, u_long, ino_t); extern struct vxfs_inode_info *vxfs_stiget(struct super_block *, ino_t); -extern struct inode * vxfs_iget(struct super_block *, unsigned long); +extern struct inode * vxfs_iget(struct super_block *, ino_t); extern voidvxfs_clear_inode(struct inode *); /* vxfs_lookup.c */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
Hello Jan Blunck, Jan Blunck: > I start sending out the patches in multiple chunks because nobody reviewed the > union mount series except for coding style violations. So this is the prework > for the changes that come with my union mount series. So they are related > but not a part of the union mount patch series. It seems that people tend to > like the patch series with small changes for itself instead of a big fat > series. I've read your union mount code which was posted on the end of last July. Here is a comment which is I remember now. Whiteouts in your code can be a serious memory pressure, since they are kept in dcache. I know the inode for whiteouts exists only one and it is shared, but dentries for whiteouts are not. They are created for each name and resident in dcache. I am afraid it can be a problem easily when you create and unlink a temporary file many times. Generally their filenames are unique. Regarding to struct path in nameidata, I have no objection basically. But I think it is better to create macros for backward compatibility as struct file did. Junjiro Okajima - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
On Tue, Nov 06, 2007 at 01:00:41PM +0900, Tetsuo Handa wrote: > Hello. > > I found that accessing namespace_sem from security_inode_create() > causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y . Any code except VFS internals has no business using it at all and doesn't do that in mainline either. I'd start looking for design bugs in whatever code you have using it first. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
On Tue, 6 November 2007 13:59:03 +0100, Jan Blunck wrote: > > Coming back to the mental stuff: the savings of the first bunch of patches > that already hit -mm: > > Textsize without patches: 0x20e572 > Textsize with patches:0x20e042 > -- > 0x530 = 1328 bytes Ok, that is more substantial than mental masturbation. Thank you for the numbers. All my criticism becomes void with this. Jörn -- Victory in war is not repetitious. -- Sun Tzu - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
On Tue, Nov 06, Jörn Engel wrote: > > This is a cleanup series. In mostly no case there is a reason why someone > > would want to use a dentry for itself. This series reflects that fact in > > nameidata where there is absolutly no reason at all. > > 400+ lines changed in this patch, some 10 in a followup patch that > combines dentry/vfsmount assignments into a single path assignment. If > your argument above was valid, I would expect more simplifications and > fewer complications. Call me a sceptic until further patches show up to > support your point. This are the patches currently in -mm related to the struct path cleanup: move-struct-path-into-its-own-header.patch embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch introduce-path_put.patch use-path_put-in-a-few-places-instead-of-mntdput.patch introduce-path_get.patch use-struct-path-in-fs_struct.patch make-set_fs_rootpwd-take-a-struct-path.patch introduce-path_get-unionfs.patch embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch introduce-path_put-unionfs.patch one-less-parameter-to-__d_path.patch d_path-kerneldoc-cleanup.patch d_path-use-struct-path-in-struct-avc_audit_data.patch d_path-make-proc_get_link-use-a-struct-path-argument.patch d_path-make-get_dcookie-use-a-struct-path-argument.patch use-struct-path-in-struct-svc_export.patch use-struct-path-in-struct-svc_expkey.patch d_path-make-seq_path-use-a-struct-path-argument.patch d_path-make-d_path-use-a-struct-path.patch > > It enables us > > to do some more cleanups wrt lookup (which are coming later). > > Please send those patches. I invite cleanups that do clean things up > and won't argue against then. ;) I'll send them in a later series. > > For stacking > > support in VFS it is essential to have the pair in every > > place where you want to traverse the stack. > > True, but unrelated to this patch. I start sending out the patches in multiple chunks because nobody reviewed the union mount series except for coding style violations. So this is the prework for the changes that come with my union mount series. So they are related but not a part of the union mount patch series. It seems that people tend to like the patch series with small changes for itself instead of a big fat series. > > I wonder why you didn't speak up when this series was posted to LKML. It was > > at least posted three times before. > > I did speak up. Once. If you missed that thread, please forgive me > missing those in which the same patch I disapproved of were resent > without me on Cc. Sorry for missing your feedback but now I found your mail ("mental masturbation that complicates the source"). I guess this is what happens when multiple people start posting the same patch series. Coming back to the mental stuff: the savings of the first bunch of patches that already hit -mm: Textsize without patches: 0x20e572 Textsize with patches:0x20e042 -- 0x530 = 1328 bytes Cheers, Jan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] smbfs: fix calculation of kernel_recvmsg size parameter in smb_receive
smb_receive calls kernel_recvmsg with a size that's the minimum of the amount of buffer space in the kvec passed in or req->rq_rlen (which represents the length of the response). This does not take into account any response data that was read in an earlier pass through smb_receive. If the first pass through smb_receive receives some but not all of the response, then the next pass can call kernel_recvmsg with a size field that's too big. kernel_recvmsg can overrun into the next response, throwing off the alignment and making it unrecognizable. This causes messages like this to pop up in the ring buffer: smb_get_length: Invalid NBT packet, code=69 as well as other errors indicating that the response is unrecognizable. Typically this is seen on a smbfs mount under heavy I/O. This patch changes the code to use (req->rq_rlen - req->rq_bytes_recvd) instead instead of just req->rq_rlen, since that should represent the amount of unread data in the response. Signed-off-by: Jeff Layton <[EMAIL PROTECTED]> Acked-by: Guenter Kukkukk <[EMAIL PROTECTED]> --- fs/smbfs/sock.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/smbfs/sock.c b/fs/smbfs/sock.c index e48bd82..e37fe4d 100644 --- a/fs/smbfs/sock.c +++ b/fs/smbfs/sock.c @@ -329,9 +329,8 @@ smb_receive(struct smb_sb_info *server, struct smb_request *req) msg.msg_control = NULL; /* Dont repeat bytes and count available bufferspace */ - rlen = smb_move_iov(&p, &num, iov, req->rq_bytes_recvd); - if (req->rq_rlen < rlen) - rlen = req->rq_rlen; + rlen = min_t(int, smb_move_iov(&p, &num, iov, req->rq_bytes_recvd), + (req->rq_rlen - req->rq_bytes_recvd)); result = kernel_recvmsg(sock, &msg, p, num, rlen, flags); -- 1.5.2.1 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
On Tue, 6 November 2007 10:11:49 +0100, Jan Blunck wrote: > On Mon, Nov 05, Jörn Engel wrote: > > > This patch changes some 400 lines, most if not all of which get longer > > and more complicated to read. 23 get sufficiently longer to require an > > additional linebreak. I can't remember complexity being invited into > > the kernel without good reasoning, yet the patch description is > > surprisingly low on reasoning: > > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. > > I don't measure complexity by lines of code or length of lines. Maybe I was > not verbose enough in the description, fair. If you have a better metric, please share it. In the paragraph you deleted I explicitly asked for _any_ metric that shows favorable numbers. Lacking numbers, we could only argue about our respective personal taste. > This is a cleanup series. In mostly no case there is a reason why someone > would want to use a dentry for itself. This series reflects that fact in > nameidata where there is absolutly no reason at all. 400+ lines changed in this patch, some 10 in a followup patch that combines dentry/vfsmount assignments into a single path assignment. If your argument above was valid, I would expect more simplifications and fewer complications. Call me a sceptic until further patches show up to support your point. > It enforced the correct > order of getting/releasing refcount on pairs. This argument I buy. > It enables us > to do some more cleanups wrt lookup (which are coming later). Please send those patches. I invite cleanups that do clean things up and won't argue against then. ;) > For stacking > support in VFS it is essential to have the pair in every > place where you want to traverse the stack. True, but unrelated to this patch. > > If churn is the only effect of this, please considere it NAKed again. > > I wonder why you didn't speak up when this series was posted to LKML. It was > at least posted three times before. I did speak up. Once. If you missed that thread, please forgive me missing those in which the same patch I disapproved of were resent without me on Cc. I'm not categorically against this struct path business. It does have some advantages at first glance. But the patch we're arguing about clearly makes code more complicated and harder to read. We should have more than superficial benefits if we decide to pay such a cost. > Did I break your COW link patches? ;) Nope. No bovine maladies involved. Jörn -- The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive slowdown when re-querying large nfs dir
Al Boldi wrote: > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+ > entries) using a simple ls -l. > > On 2.6.23 client and server running userland rpc.nfs.V2: > first try: time -p ls -l <2k+ entry dir> in ~2.5sec > more tries: time -p ls -l <2k+ entry dir> in ~8sec > > first try: time -p ls -l <5k+ entry dir> in ~9sec > more tries: time -p ls -l <5k+ entry dir> in ~180sec > > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2: > first try: time -p ls -l <2k+ entry dir> in ~2.5sec > more tries: time -p ls -l <2k+ entry dir> in ~7sec > > first try: time -p ls -l <5k+ entry dir> in ~8sec > more tries: time -p ls -l <5k+ entry dir> in ~43sec > > Remounting the nfs-dir on the client resets the problem. > > Any ideas? Ok, I played some more with this, and it turns out that nfsV3 is a lot faster. But, this does not explain why the 2.4.31 kernel is still over 4-times faster than 2.6.23. Can anybody explain what's going on? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]
David Howells <[EMAIL PROTECTED]> wrote: > > Not completely trivial - the ino_t -> unsigned long conversion needs to be > > propagated fairly widely in there. > > I'll just make it unsigned long. That's what iget() took as the inode number > parameter type, so it can't behave any differently as a first guess. On the other hand, __vxfs_iget() takes ino_t, so that's the right thing to do in this case. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]
Andrew Morton <[EMAIL PROTECTED]> wrote: > Not completely trivial - the ino_t -> unsigned long conversion needs to be > propagated fairly widely in there. I'll just make it unsigned long. That's what iget() took as the inode number parameter type, so it can't behave any differently as a first guess. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]
On Thu, 25 Oct 2007 17:35:14 +0100 David Howells <[EMAIL PROTECTED]> wrote: > Stop the FreeVXFS filesystem from using iget() and read_inode(). Replace > vxfs_read_inode() with vxfs_iget(), and call that instead of iget(). > vxfs_iget() then uses iget_locked() directly and returns a proper error code > instead of an inode in the event of an error. > > vxfs_fill_super() returns any error incurred when getting the root inode > instead of EINVAL. alpha: fs/freevxfs/vxfs_inode.c:298: error: conflicting types for 'vxfs_iget' fs/freevxfs/vxfs_extern.h:61: error: previous declaration of 'vxfs_iget' was here Not completely trivial - the ino_t -> unsigned long conversion needs to be propagated fairly widely in there. (when should we be using ino_t and when should we be using unsigned long, anyway?) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Mon, 2007-11-05 at 15:57 -0800, Andrew Morton wrote: > > > Subject: mm: speed up writeback ramp-up on clean systems > > > > > > We allow violation of bdi limits if there is a lot of room on the > > > system. Once we hit half the total limit we start enforcing bdi limits > > > and bdi ramp-up should happen. Doing it this way avoids many small > > > writeouts on an otherwise idle system and should also speed up the > > > ramp-up. > > Given the problems we're having in there I'm a bit reluctant to go tossing > hastily put together and inadequately tested stuff onto the fire. And > that's what this patch looks like to me. Not really hastily, I think it was written before the stuff hit mainline. Inadequately tested, perhaps, its been in my and probably Wu's kernels for a while. Granted that's not a lot of testing in the face of those who have problems atm. > Wanna convince me otherwise? I'm perfectly happy with this patch earning its credits in -mm for a while and maybe going in around -rc4 or something like that (hoping that by then we've fixed these nagging issues). Another patch I did come up with yesterday - not driven by any problems in that area - could perhaps join this one on that path: --- Subject: mm: bdi: tweak task dirty penalty Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather excessive on large memory machines. Use sqrt to scale it sub-linearly. Update the comment while we're there. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- mm/page-writeback.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6-2/mm/page-writeback.c === --- linux-2.6-2.orig/mm/page-writeback.c +++ linux-2.6-2/mm/page-writeback.c @@ -213,17 +213,21 @@ static inline void task_dirties_fraction } /* - * scale the dirty limit + * Task specific dirty limit: * - * task specific dirty limit: + * dirty -= 8 * sqrt(dirty) * p_{t} * - * dirty -= (dirty/8) * p_{t} + * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This + * avoids infrequent dirtiers from getting stuck in this other guys dirty + * pages. + * + * Use a sub-linear function to scale the penalty, we only need a little room. */ void task_dirty_limit(struct task_struct *tsk, long *pdirty) { long numerator, denominator; long dirty = *pdirty; - u64 inv = dirty >> 3; + u64 inv = 8*int_sqrt(dirty); task_dirties_fraction(tsk, &numerator, &denominator); inv *= numerator; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > [ 547.20] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 > > global 12829 72 0 wc __ tw 0 sk 0 > [ 550.48] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 > > global 12017 62 0 wc __ tw 0 sk 0 > [ 552.71] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 > > global 11133 83 0 wc __ tw 0 sk 0 > [ 558.66] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 > > global 10470 33 0 wc _M tw 0 sk 0 4s > [ 562.75] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 > > global 10555 69 0 wc _M tw 0 sk 0 3s > [ 565.15] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 > > global 9562 498 0 wc _M tw -2 sk 0 4s > [ 569.49] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 > > global 8960 2 0 wc _M tw 0 sk 0 3s > [ 572.91] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 > > global 8088 205 0 wc _M tw -13 sk 0 2s > [ 574.61] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 > > global 7114 188 0 wc _M tw -1 sk 0 10s > [ 584.27] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 > > global 14544 0 0 wc _M tw -1 sk 0 9s > [ 593.05] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 > > global 24583 736 0 wc _M tw -1 sk 0 7s > [ 600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 > > global 27004 6 0 wc _M tw 587 sk 0 > [ 600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 > > global 27004 6 0 wc __ tw 1014 sk 0 The above messages and the below 'D' state pdflush indicate that one single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete. Let's try reverting the below patch with `patch -R`? It looks like the most relevant change - if it's not a low level bug. > [note] first stall, the output from emerge stops, so it seems it can > not start processing the next file until the stall ends > [ 630.00] SysRq : Emergency Sync > [ 630.12] Emergency Sync complete > [ 632.85] SysRq : Show Blocked State > [ 632.85] taskPC stack pid father > [ 632.85] pdflush D 81000f091788 0 285 2 > [ 632.85] 810005d4da80 0046 0800 > 0071 > [ 632.85] 81000fd52400 8022d61c 80819b00 > 80819b00 > [ 632.85] 80815f40 80819b00 810100316f98 > > [ 632.85] Call Trace: > [ 632.85] [] task_rq_lock+0x4c/0x90 > [ 632.85] [] __wake_up_common+0x5a/0x90 > [ 632.85] [] __down+0xa7/0x11e > [ 632.85] [] default_wake_function+0x0/0x10 > [ 632.85] [] __down_failed+0x35/0x3a > [ 632.85] [] xfs_buf_lock+0x3e/0x40 > [ 632.85] [] _xfs_buf_find+0x13e/0x240 > [ 632.85] [] xfs_buf_get_flags+0x6f/0x190 > [ 632.85] [] xfs_buf_read_flags+0x12/0xa0 > [ 632.85] [] xfs_trans_read_buf+0x64/0x340 > [ 632.85] [] xfs_itobp+0x81/0x1e0 > [ 632.85] [] write_cache_pages+0x123/0x330 > [ 632.85] [] xfs_iflush+0xfe/0x520 > [ 632.85] [] __down_read_trylock+0x42/0x60 > [ 632.85] [] xfs_inode_flush+0x179/0x1b0 > [ 632.85] [] xfs_fs_write_inode+0x2f/0x90 > [ 632.85] [] __writeback_single_inode+0x2ac/0x380 > [ 632.85] [] dm_table_any_congested+0x2e/0x80 > [ 632.85] [] generic_sync_sb_inodes+0x20d/0x330 > [ 632.85] [] writeback_inodes+0xa2/0xe0 > [ 632.85] [] wb_kupdate+0xa6/0x140 > [ 632.85] [] pdflush+0x0/0x1e0 > [ 632.85] [] pdflush+0x110/0x1e0 > [ 632.85] [] wb_kupdate+0x0/0x140 > [ 632.85] [] kthread+0x4b/0x80 > [ 632.85] [] child_rip+0xa/0x12 > [ 632.85] [] kthread+0x0/0x80 > [ 632.85] [] child_rip+0x0/0x12 > [ 632.85] > [ 632.85] emergeD 0 6220 6129 > [ 632.85] 810103ced9f8 0086 > 0071 > [ 632.85] 81000fd52cf8 80819b00 > 80819b00 > [ 632.85] 80815f40 80819b00 810103ced9b8 > 810103ced9a8 > [ 632.85] Call Trace: > [ 632.85] [] __down+0xa7/0x11e > [ 632.85] [] default_wake_function+0x0/0x10 > [ 632.85] [] __down_failed+0x35/0x3a > [ 632.85] [] xfs_buf_lock+0x3e/0x40 > [ 632.85] [] _xfs_buf_find+0x13e/0x240 > [ 632.85] [] xfs_buf_get_flags+0x6f/0x190 > [ 632.85] [] xfs_buf_read_flags+0x12/0xa0 > [ 632.85] [] xfs_trans_read_buf+0x64/0x340 > [ 632.85] [] xfs_itobp+0x81/0x1e0 > [ 632.85] [] xfs_buf_rele+0x2e/0xd0 > [ 632.85] [] xfs_iflush+0xfe/0x520 > [ 632.85] [] __down_read_trylock+0x42/0x60 > [ 632.85] [] xfs_inode_item_push+0x12/0x20 > [ 632.85] [] xfs_trans_push_ail+0x267/0x2b0 > [ 632.85] [] xfs_log_reserve+0x72/0x120 > [ 632.85] [] xfs_trans_reserve+0xa8/0x210 > [ 63
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
On Mon, Nov 05, Jörn Engel wrote: > > Subject: Embed a struct path into struct nameidata instead of > > nd->{dentry,mnt} > > From: Jan Blunck <[EMAIL PROTECTED]> > > > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. > > > > Signed-off-by: Jan Blunck <[EMAIL PROTECTED]> > > Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]> > > Acked-by: Christoph Hellwig <[EMAIL PROTECTED]> > > Cc: Al Viro <[EMAIL PROTECTED]> > > CC: > > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> > Frowned-upon-by: Joern Engel <[EMAIL PROTECTED]> > > This patch changes some 400 lines, most if not all of which get longer > and more complicated to read. 23 get sufficiently longer to require an > additional linebreak. I can't remember complexity being invited into > the kernel without good reasoning, yet the patch description is > surprisingly low on reasoning: > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. I don't measure complexity by lines of code or length of lines. Maybe I was not verbose enough in the description, fair. This is a cleanup series. In mostly no case there is a reason why someone would want to use a dentry for itself. This series reflects that fact in nameidata where there is absolutly no reason at all. It enforced the correct order of getting/releasing refcount on pairs. It enables us to do some more cleanups wrt lookup (which are coming later). For stacking support in VFS it is essential to have the pair in every place where you want to traverse the stack. > If churn is the only effect of this, please considere it NAKed again. I wonder why you didn't speak up when this series was posted to LKML. It was at least posted three times before. Did I break your COW link patches? ;) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing
On Nov. 06, 2007, 7:06 +0200, Andrew Morton <[EMAIL PROTECTED]> wrote: > On Sat, 03 Nov 2007 07:09:25 -0400 Steve Dickson <[EMAIL PROTECTED]> wrote: > >> The following patch stops NFS sillyname renames and umounts from racing. > > (appropriate cc's added) > >> I have a test script does the following: >> 1) start nfs server >> 2) mount loopback >> 3) open file in background >> 4) remove file >> 5) stop nfs server >> 6) kill -9 process which has file open >> 7) restart nfs server >> 8) umount looback mount. >> >> After umount I got the "VFS: Busy inodes after unmount" message >> because the processing of the rename has not finished. >> >> Below is a patch that the uses the new silly_count mechanism to >> synchronize sillyname processing and umounts. The patch introduces a >> nfs_put_super() routine that waits until the nfsi->silly_count count >> is zero. >> >> A side-effect of finding and waiting for all the inode to >> find the sillyname processing, is I need to traverse >> the sb->s_inodes list in the supper block. To do that >> safely the inode_lock spin lock has to be held. So for >> modules to be able to "see" that lock I needed to >> EXPORT_SYMBOL_GPL() it. >> >> Any objections to exporting the inode_lock spin lock? >> If so, how should modules _safely_ access the s_inode list? >> >> steved. >> >> >> Author: Steve Dickson <[EMAIL PROTECTED]> >> Date: Wed Oct 31 12:19:26 2007 -0400 >> >> Close a unlink/sillyname rename and umount race by added a >> nfs_put_super routine that will run through all the inode >> currently on the super block, waiting for those that are >> in the middle of a sillyname rename or removal. >> >> This patch stop the infamous "VFS: Busy inodes after unmount... " >> warning during umounts. >> >> Signed-off-by: Steve Dickson <[EMAIL PROTECTED]> >> >> diff --git a/fs/inode.c b/fs/inode.c >> index ed35383..da9034a 100644 >> --- a/fs/inode.c >> +++ b/fs/inode.c >> @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtable __read_mostly; >>* the i_state of an inode while it is in use.. >>*/ >> DEFINE_SPINLOCK(inode_lock); >> +EXPORT_SYMBOL_GPL(inode_lock); > > That's going to make hch unhappy. > > Your email client is performing space-stuffing. > See http://mbligh.org/linuxdocs/Email/Clients/Thunderbird > >> static struct file_system_type nfs_fs_type = { >> .owner = THIS_MODULE, >> @@ -223,6 +225,7 @@ static const struct super_operations nfs_sops = { >> .alloc_inode= nfs_alloc_inode, >> .destroy_inode = nfs_destroy_inode, >> .write_inode= nfs_write_inode, >> +.put_super = nfs_put_super, >> .statfs = nfs_statfs, >> .clear_inode= nfs_clear_inode, >> .umount_begin = nfs_umount_begin, >> @@ -1767,6 +1770,30 @@ static void nfs4_kill_super(struct super_block *sb) >> nfs_free_server(server); >> } >> >> +void nfs_put_super(struct super_block *sb) > > This was (correctly) declared to be static. We should define it that way > too (I didn't know you could do this, actually). > >> +{ >> +struct inode *inode; >> +struct nfs_inode *nfsi; >> +/* >> + * Make sure there are no outstanding renames >> + */ >> +relock: >> +spin_lock(&inode_lock); >> +list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { >> +nfsi = NFS_I(inode); >> +if (atomic_read(&nfsi->silly_count) > 0) { >> +/* Keep this inode around during the wait */ >> +atomic_inc(&inode->i_count); >> +spin_unlock(&inode_lock); >> +wait_event(nfsi->waitqueue, >> +atomic_read(&nfsi->silly_count) == 1); >> +iput(inode); >> +goto relock; >> +} >> +} >> +spin_unlock(&inode_lock); >> +} > > That's an O(n^2) search. If it is at all possible to hit a catastrophic > slowdown in here, you can bet that someone out there will indeed hit it in > real life. > > I'm too lazy to look, but we might need to check things like I_FREEING > and I_CLEAR before taking a ref on this inode. It'd be very nice if the silly renamed inodes (with silly_count > 1) were moved to a different list in the first pass, under the inode_lock, and then waited on until silly_count <= 1 in a second pass only on the filtered list. This will provide you with O(1). - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing
On Tue, Nov 06, 2007 at 10:24:50AM +0200, Benny Halevy wrote: > It'd be very nice if the silly renamed inodes (with silly_count > 1) were > moved > to a different list in the first pass, under the inode_lock, and then waited > on > until silly_count <= 1 in a second pass only on the filtered list. This will > provide you with O(1). It's absolutely pointless, starting with any kind of searching for inodes, etc. If you want fs shutdown _not_ to happen until async activity of that kind is over, don't reinvent the sodding wheels, just tell VFS that you are holding an active reference to superblock. End of story. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
On 11/6/2007 1:11 PM, Arjan van de Ven wrote: On Tue, 06 Nov 2007 13:00:41 +0900 Tetsuo Handa <[EMAIL PROTECTED]> wrote: Hello. I found that accessing namespace_sem from security_inode_create() causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y . sounds like you have an AB-BA deadlock... sed /you/AppArmor shipped with OpenSuSE 10.1 and 10.2/ :) Though I don't think this deadlock should occur quite often, it occurs when it occurs. Care should be taken promptly. There should be no way around for this problem as its nature. Passing vfsmount parameter to VFS helper functions and LSM hooks seems to be a good choice to me. Cheers, Toshiharu Harada - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html