Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Neil Brown
On Tuesday November 6, [EMAIL PROTECTED] wrote:
> > On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi <[EMAIL PROTECTED]> wrote:
> > Al Boldi wrote:
> > > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
> > > entries) using a simple ls -l.
> > >
> > > On 2.6.23 client and server running userland rpc.nfs.V2:
> > > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > > more tries: time -p ls -l <2k+ entry dir>  in ~8sec
> > >
> > > first  try: time -p ls -l <5k+ entry dir>  in ~9sec
> > > more tries: time -p ls -l <5k+ entry dir>  in ~180sec
> > >
> > > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
> > > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > > more tries: time -p ls -l <2k+ entry dir>  in ~7sec
> > >
> > > first  try: time -p ls -l <5k+ entry dir>  in ~8sec
> > > more tries: time -p ls -l <5k+ entry dir>  in ~43sec
> > >
> > > Remounting the nfs-dir on the client resets the problem.
> > >
> > > Any ideas?
> > 
> > Ok, I played some more with this, and it turns out that nfsV3 is a lot 
> > faster.  But, this does not explain why the 2.4.31 kernel is still over 
> > 4-times faster than 2.6.23.
> > 
> > Can anybody explain what's going on?
> > 
> 
> Sure, Neil can! ;)

Nuh.
He said "userland rpc.nfs.Vx".  I only do "kernel-land NFS".  In these
days of high specialisation, each line of code is owned by a different
person, and finding the right person is hard

I would suggest getting a 'tcpdump -s0' trace and seeing (with
wireshark) what is different between the various cases.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser
On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Ok, so it's not synchronous writes that we are doing - we're just
> submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
> The "synchronous" nature appears to be coming from higher level
> locking when reclaiming inodes (on the flush lock). It appears that
> inode write clustering is failing completely so we are writing the
> same block multiple times i.e. once for each inode in the cluster we
> have to write.

Works for me. The only remaining stalls are sub second and look
completely valid, considering the amount of files being removed.

iostat 10 from this test:
 3  0  0 3500192332 20495600   105  8512 1809 6473  6 10 83  1
 0  0  0 3500200332 20457600 0  4367 1355 3712  2  6 92  0
 2  0  0 3504264332 20352800 0  6805 1912 4967  4  8 88  0
 0  0  0 3511632332 20352800 0  2843  805 1791  2  4 94  0
 0  0  0 3516852332 20351600 0  3375  879 2712  3  5 93  0
 0  0  0 3530544332 20266800   186   776  488 1152  4  2 89  4
 0  0  0 3574788332 20496000   226   326  358  787  0  1 98  0
 0  0  0 3576820332 20496000 0   376  332  737  0  0 99  0
 0  0  0 3578432332 20496000 0   356  293  606  1  1 99  0
 0  0  0 3580192332 20496000 0   101  104  384  0  0 99  0

I'm pleased to note that this is now much faster again.
Thanks!

Tested-by: Torsten Kaiser <[EMAIL PROTECTED]>

CC's please note: It looks like this was really a different problem
then the 100% iowait that was seen with reiserfs.
Also the one complete stall I have seen is probably something else.
But I have not been able to reproduce this again with -mm and have
never seen this on mainline, so I will just ignore that single event
until I see it again.

Torsten

> ---
>  fs/xfs/xfs_iget.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
> ===
> --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
> +1100
> +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
> @@ -248,7 +248,7 @@ finish_inode:
> icl = NULL;
> if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq,
> first_index, 1)) {
> -   if ((iq->i_ino & mask) == first_index)
> +   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
> icl = iq->i_cluster;
> }
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Andrew Morton
> On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi <[EMAIL PROTECTED]> wrote:
> Al Boldi wrote:
> > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
> > entries) using a simple ls -l.
> >
> > On 2.6.23 client and server running userland rpc.nfs.V2:
> > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > more tries: time -p ls -l <2k+ entry dir>  in ~8sec
> >
> > first  try: time -p ls -l <5k+ entry dir>  in ~9sec
> > more tries: time -p ls -l <5k+ entry dir>  in ~180sec
> >
> > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
> > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > more tries: time -p ls -l <2k+ entry dir>  in ~7sec
> >
> > first  try: time -p ls -l <5k+ entry dir>  in ~8sec
> > more tries: time -p ls -l <5k+ entry dir>  in ~43sec
> >
> > Remounting the nfs-dir on the client resets the problem.
> >
> > Any ideas?
> 
> Ok, I played some more with this, and it turns out that nfsV3 is a lot 
> faster.  But, this does not explain why the 2.4.31 kernel is still over 
> 4-times faster than 2.6.23.
> 
> Can anybody explain what's going on?
> 

Sure, Neil can! ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread David Chinner
On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote:
> On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
> > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > > Rather than vmstat, can you use something like iostat to show how busy 
> > > your
> > > disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
> > 
> > Both "vmstat 10" and "iostat -x 10" output from this test:
> > procs ---memory-- ---swap-- -io -system-- 
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa
> >  2  0  0 3700592  0  85424003183  108  244  2  1 95 
> >  1
> > -> emerge reads something, don't knwo for sure what...
> >  1  0  0 3665352  0  8794000   239 2  343  585  2  1 97 
> >  0
> 
> > 
> > The last 20% of the btrace look more or less completely like this, no
> > other programs do any IO...
> > 
> > 253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
> > 253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
> > 253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
> > 253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
> ^^
> Apparently we are doing synchronous writes. That would explain why
> it is slow. We shouldn't be doing synchronous writes here. I'll see if
> I can reproduce this.
> 
> 
> 
> Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
> look into this further and send a patch when I have something concrete.

Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The "synchronous" nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.

This must be a side effect of some other change as we haven't
changed anything in the reclaim code recently.

/me scurries off to run some tests 

Indeed it is. The patch below should fix the problem - the inode
clusters weren't getting set up properly when inodes were being
read in or allocated. This is a regression, introduced by this
mod:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_iget.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
@@ -248,7 +248,7 @@ finish_inode:
icl = NULL;
if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq,
first_index, 1)) {
-   if ((iq->i_ino & mask) == first_index)
+   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
icl = iq->i_cluster;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] add rwmem type backed by pages

2007-11-06 Thread Zach Brown
This lets callers specify a region of memory to read from or write to with
an array of page/offset/len tuples.  There have been specific requests to
do this from servers which want to do O_DIRECT from the kernel.  (knfsd?)

This could also be used by places which currently hold a kmap() and call
fop->write.  ecryptfs_write_lower_page_segment() is one such caller.
---
 fs/rwmem.c|   66 +
 include/linux/rwmem.h |   16 
 2 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/fs/rwmem.c b/fs/rwmem.c
index 0433ba4..c87e8a4 100644
--- a/fs/rwmem.c
+++ b/fs/rwmem.c
@@ -90,3 +90,69 @@ struct rwmem_ops rwmem_iovec_ops = {
.seg_bytes  = rwmem_iovec_seg_bytes,
.get_seg_pages  = rwmem_iovec_get_seg_pages,
 };
+
+void rwmem_pages_init(struct rwmem *rwm)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   struct pgol *pgol;
+   unsigned long i;
+
+   rwm->total_bytes = 0;
+   rwm->nr_pages = rwm->nr_segs;
+   rwm->boundary_bits = 0;
+
+   for (i = 0; i < rwm->nr_segs; i++) {
+   pgol = &rwp->pgol[i];
+
+   rwm->total_bytes += pgol->len;
+   rwm->boundary_bits |= pgol->offset | pgol->len;
+   }
+}
+
+/*
+ * Returns the offset of the start of a segment within its first page.
+ */
+unsigned long rwmem_pages_seg_page_offset(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return rwp->pgol[i].offset;
+}
+
+/*
+ * Returns the total bytes in the given segment.
+ */
+unsigned long rwmem_pages_seg_bytes(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return rwp->pgol[i].len;
+}
+
+/*
+ * For now each page is its own seg.
+ */
+int rwmem_pages_get_seg_pages(struct rwmem *rwm, unsigned long i,
+ unsigned long *cursor, struct page **pages,
+ unsigned long max_pages, int write)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   int ret = 0;
+
+   BUG_ON(i >= rwm->nr_segs);
+   BUG_ON(*cursor != 0);
+
+   if (max_pages) {
+   pages[0] = rwp->pgol[i].page;
+   get_page(pages[0]);
+   ret = 1;
+   }
+   return ret;
+}
+
+struct rwmem_ops rwmem_pages_ops = {
+   .init   = rwmem_pages_init,
+   .seg_page_offset= rwmem_pages_seg_page_offset,
+   .seg_bytes  = rwmem_pages_seg_bytes,
+   .get_seg_pages  = rwmem_pages_get_seg_pages,
+};
diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h
index 666f9f4..47019f0 100644
--- a/include/linux/rwmem.h
+++ b/include/linux/rwmem.h
@@ -26,4 +26,20 @@ struct rwmem_iovec {
 };
 struct rwmem_ops rwmem_iovec_ops;
 
+/*
+ * How many times do we need this in subsystems before we make a universal
+ * struct?  (bio_vec, skb_frag_struct, pipe_buffer)
+ */
+struct pgol {
+   struct page *page;
+   unsigned int offset;
+   unsigned int len;
+};
+
+struct rwmem_pages {
+   struct rwmemrwmem;
+   struct pgol *pgol;
+};
+struct rwmem_ops rwmem_pages_ops;
+
 #endif
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write

2007-11-06 Thread Zach Brown
This adds a structure and interface to represent the segments of memory
which are acting as the source or destination for a read or write operation.

Callers would fill this structure and then pass it down the rw path.

The intent is to let stages in the rw path make specific calls against this
API and structure instead of working with, say, struct iovec natively.

The main intent of this is to enable kernel calls into the rw path which
specify memory with page/offset/len tuples.

Another potential benefit of this is the reduction in iterations over iovecs at
various points in the kernel.  Each iov_length(iov) call, for example, could be
translated into rwm->total_bytes.  O_DIRECTs check of memory alignment is
changed into a single test against rwm->boundary_bits.

I imagine this might integrate well with the iov_iter interface, though I
haven't examined that in any depth.
---
 fs/Makefile   |2 +-
 fs/rwmem.c|   92 +
 include/linux/rwmem.h |   29 +++
 3 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 fs/rwmem.c
 create mode 100644 include/linux/rwmem.h

diff --git a/fs/Makefile b/fs/Makefile
index 500cf15..c342365 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o rwmem.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/rwmem.c b/fs/rwmem.c
new file mode 100644
index 000..0433ba4
--- /dev/null
+++ b/fs/rwmem.c
@@ -0,0 +1,92 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline unsigned long pages_spanned(unsigned long addr,
+ unsigned long bytes)
+{
+   return ((addr + bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) -
+   (addr >> PAGE_SHIFT);
+}
+
+void rwmem_iovec_init(struct rwmem *rwm)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   struct iovec *iov;
+   unsigned long i;
+
+   rwm->total_bytes = 0;
+   rwm->nr_pages = 0;
+   rwm->boundary_bits = 0;
+
+   for (i = 0; i < rwm->nr_segs; i++) {
+   iov = &rwi->iov[i];
+
+   rwm->total_bytes += iov->iov_len;
+   rwm->nr_pages += pages_spanned((unsigned long)iov->iov_base,
+   iov->iov_len);
+   rwm->boundary_bits |= (unsigned long)iov->iov_base |
+ (unsigned long)iov->iov_len;
+   }
+}
+
+/*
+ * Returns the offset of the start of a segment within its first page.
+ */
+unsigned long rwmem_iovec_seg_page_offset(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return (unsigned long)rwi->iov[i].iov_base & ~PAGE_MASK;
+}
+
+/*
+ * Returns the total bytes in the given segment.
+ */
+unsigned long rwmem_iovec_seg_bytes(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return rwi->iov[i].iov_len;
+}
+
+int rwmem_iovec_get_seg_pages(struct rwmem *rwm, unsigned long i,
+ unsigned long *cursor, struct page **pages,
+ unsigned long max_pages, int write)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   struct iovec *iov;
+   int ret;
+
+   BUG_ON(i >= rwm->nr_segs);
+   iov = &rwi->iov[i];
+
+   if (*cursor == 0)
+   *cursor = (unsigned long)iov->iov_base;
+
+   max_pages = min(pages_spanned(*cursor, iov->iov_len -
+ (*cursor - (unsigned long)iov->iov_base)),
+   max_pages);
+
+   down_read(¤t->mm->mmap_sem);
+   ret = get_user_pages(current, current->mm, *cursor, max_pages, write,
+0, pages, NULL);
+   up_read(¤t->mm->mmap_sem);
+
+   if (ret > 0) { 
+   *cursor += ret * PAGE_SIZE;
+   if (*cursor >= (unsigned long)iov->iov_base + iov->iov_len)
+   *cursor = ~0;
+   }
+
+   return ret;
+}
+
+struct rwmem_ops rwmem_iovec_ops = {
+   .init   = rwmem_iovec_init,
+   .seg_page_offset= rwmem_iovec_seg_page_offset,
+   .seg_bytes  = rwmem_iovec_seg_bytes,
+   .get_seg_pages  = rwmem_iovec_get_seg_pages,
+};
diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h
new file mode 100644
index 000..666f9f4
--- /dev/null
+++ b/include/linux/rwmem.h
@@ -0,0 +1,29 @@
+#ifndef

[PATCH 2/4] dio: use rwmem to work with r/w memory arguments

2007-11-06 Thread Zach Brown
This switches dio to work with the rwmem api to get memory pages for the IO
instead of working with iovecs directly.

It can use direct rwm struct accesses for some static universal properties of a
set of memory segments that make up the buffer argument.

It uses helper functions to work with the underlying data structures directly.
---
 fs/direct-io.c |  123 +---
 1 files changed, 55 insertions(+), 68 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index acf0da1..0d5ed41 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -34,7 +34,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 /*
@@ -105,11 +105,12 @@ struct dio {
sector_t cur_page_block;/* Where it starts */
 
/*
-* Page fetching state. These variables belong to dio_refill_pages().
+* Page fetching state. direct_io_worker() sets these for
+* dio_refill_pages() who modifies them as it fetches.
 */
-   int curr_page;  /* changes */
-   int total_pages;/* doesn't change */
-   unsigned long curr_user_address;/* changes */
+   struct rwmem *rwm;
+   unsigned long cur_seg;
+   unsigned long cur_seg_cursor;
 
/*
 * Page queue.  These variables belong to dio_refill_pages() and
@@ -146,21 +147,11 @@ static inline unsigned dio_pages_present(struct dio *dio)
  */
 static int dio_refill_pages(struct dio *dio)
 {
+   struct rwmem *rwm = dio->rwm;
int ret;
-   int nr_pages;
-
-   nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-   down_read(¤t->mm->mmap_sem);
-   ret = get_user_pages(
-   current,/* Task for fault acounting */
-   current->mm,/* whose pages? */
-   dio->curr_user_address, /* Where from? */
-   nr_pages,   /* How many pages? */
-   dio->rw == READ,/* Write to memory? */
-   0,  /* force (?) */
-   &dio->pages[0],
-   NULL);  /* vmas */
-   up_read(¤t->mm->mmap_sem);
+
+   ret = rwm->ops->get_seg_pages(rwm, dio->cur_seg, &dio->cur_seg_cursor,
+ dio->pages, DIO_PAGES, dio->rw == READ);
 
if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
struct page *page = ZERO_PAGE(0);
@@ -180,8 +171,6 @@ static int dio_refill_pages(struct dio *dio)
}
 
if (ret >= 0) {
-   dio->curr_user_address += ret * PAGE_SIZE;
-   dio->curr_page += ret;
dio->head = 0;
dio->tail = ret;
ret = 0;
@@ -938,11 +927,9 @@ out:
  */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-   const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
-   unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-   struct dio *dio)
+   struct rwmem *rwm, loff_t offset, unsigned blkbits,
+   get_block_t get_block, dio_iodone_t end_io, struct dio *dio)
 {
-   unsigned long user_addr; 
unsigned long flags;
int seg;
ssize_t ret = 0;
@@ -966,44 +953,33 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode 
*inode,
spin_lock_init(&dio->bio_lock);
dio->refcount = 1;
 
+   dio->rwm = rwm;
+
/*
 * In case of non-aligned buffers, we may need 2 more
 * pages since we need to zero out first and last block.
 */
+   dio->pages_in_io = rwm->nr_pages;
if (unlikely(dio->blkfactor))
-   dio->pages_in_io = 2;
-
-   for (seg = 0; seg < nr_segs; seg++) {
-   user_addr = (unsigned long)iov[seg].iov_base;
-   dio->pages_in_io +=
-   ((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
-   - user_addr/PAGE_SIZE);
-   }
+   dio->pages_in_io += 2;
 
-   for (seg = 0; seg < nr_segs; seg++) {
-   user_addr = (unsigned long)iov[seg].iov_base;
-   dio->size += bytes = iov[seg].iov_len;
+   for (seg = 0; seg < rwm->nr_segs; seg++) {
+   dio->size += bytes = rwm->ops->seg_bytes(rwm, seg);
 
/* Index into the first page of the first block */
-   dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
+   dio->first_block_in_page = 
+   rwm->ops->seg_page_offset(rwm, seg) >> blkbits;
dio->final_block_in_request = dio->block_in_file +
(bytes >> blkbits);
/* Page fetching state */
+   dio->cur_seg = seg;
+   dio->cur_seg_cursor = 0;
dio->head = 0;
dio->tail = 0;
- 

[PATCH 4/4] add dio interface for page/offset/len tuples

2007-11-06 Thread Zach Brown
This is what it might look like to feed pgol in to some part of the fs stack
instead of iovecs.  I imagine we'd want to do it at a much higher level, perhaps
something like vfs_write_pages().
---
 fs/direct-io.c |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0d5ed41..e86bcbc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1213,3 +1213,24 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
 }
 
 EXPORT_SYMBOL(__blockdev_direct_IO);
+
+ssize_t
+__blockdev_direct_IO_pages(int rw, struct kiocb *iocb, struct inode *inode,
+   struct block_device *bdev, struct pgol *pgol, loff_t offset, 
+   unsigned long nr_pages, get_block_t get_block, dio_iodone_t end_io,
+   int dio_lock_type)
+{
+   struct rwmem_pages rwp = {
+   .rwmem.ops = &rwmem_pages_ops,
+   .rwmem.nr_segs = nr_pages,
+   .pgol = pgol,
+   };
+   struct rwmem *rwm = &rwp.rwmem;
+
+   rwm->ops->init(rwm);
+
+   return blockdev_direct_IO_rwmem(rw, iocb, inode, bdev, rwm, offset,
+   get_block, end_io, dio_lock_type);
+}
+
+EXPORT_SYMBOL(__blockdev_direct_IO_pages);
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] fs io with struct page instead of iovecs

2007-11-06 Thread Zach Brown
At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the
kernel with pages instead of with iovecs.  This patch series explores one
direction we could head in to achieve this.

We obviously can't just translate user iovecs (which might represent more
memory than the machine has) to pinned pages and pass page struct pointers all
the way down the stack.

And we don't want to duplicate a lot of the generic FS machinery down a
duplicate path which works with pages instead of iovecs.

So this patch series heads in the direction of abstracting out the memory
argument to the read and write calls.  It's still based on segments, but we
hide that it's either iovecs or arrays of page pointers from the callers in the
rw stack.

I didn't go too nuts with the syntactic sugar.  This is just intended to show
the basic mechanics.  We can obviously pretty it up if we think this is a sane
thing to be doing.

The series builds but has never been run.

What do people (that's you, Christoph) think?  I'm flexible.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread David Chinner
On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
> On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > Rather than vmstat, can you use something like iostat to show how busy your
> > disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
> 
> Both "vmstat 10" and "iostat -x 10" output from this test:
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  2  0  0 3700592  0  85424003183  108  244  2  1 95  1
> -> emerge reads something, don't knwo for sure what...
>  1  0  0 3665352  0  8794000   239 2  343  585  2  1 97  0

> 
> The last 20% of the btrace look more or less completely like this, no
> other programs do any IO...
> 
> 253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
> 253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
> 253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
> 253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
^^
Apparently we are doing synchronous writes. That would explain why
it is slow. We shouldn't be doing synchronous writes here. I'll see if
I can reproduce this.



Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
look into this further and send a patch when I have something concrete.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser
On 11/6/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> --
> Subject: writeback: remove pages_skipped accounting in 
> __block_write_full_page()
> From: Fengguang Wu <[EMAIL PROTECTED]>
>
> Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug:
[sni]
>  fs/buffer.c |1 -
>  fs/xfs/linux-2.6/xfs_aops.c |5 ++---
>  2 files changed, 2 insertions(+), 4 deletions(-)

I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted.
This does still stall.

On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Rather than vmstat, can you use something like iostat to show how busy your
> disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.

Both "vmstat 10" and "iostat -x 10" output from this test:
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3700592  0  85424003183  108  244  2  1 95  1
-> emerge reads something, don't knwo for sure what...
 1  0  0 3665352  0  8794000   239 2  343  585  2  1 97  0
 0  0  0 3657728  0  9122800   32235  445  833  0  0 99  0
 1  0  0 3653136  0  9469200   33033  455  844  1  1 98  0
 0  0  0 3646836  0  9772000   289 3  422  751  1  1 98  0
 0  0  0 3616468  0  9969200   18533  399  614  9  3 87  1
-> starts to remove the kernel tree
 0  0  0 3610452  0 10259200   138  3598 1398 3945  3  6 90  1
 0  0  0 3607136  0 10454800 2  5962 1919 6070  4  9 87  0
 0  0  0 3606636  0 10508000 0  1539  810 2200  1  2 97  0
-> first stall 28 sec.
 0  0  0 3606592  0 10529200 0   698  679 1390  0  1 99  0
 0  0  0 3606440  0 10553200 0   658  690 1457  0  0 99  0
 0  0  0 3606068  0 10612800 1  1780  947 1982  1  3 96  0
-> second stall 24 sec.
 0  0  0 3606036  0 10646400 4   858  758 1457  0  1 98  0
 0  0  0 3605380  0 10687200 0  1173  807 1880  1  2 97  0
 0  0  0 3605000  0 10774800 1  2413 1103 2996  2  4 94  0
-> third stall 38 sec.
 0  0  0 3604488  0 1084720045   897  748 1577  0  1 98  0
 0  0  0 3604176  0 10876400 0   824  752 1700  0  1 98  0
 0  0  0 3604012  0 10898800 0   660  643 1237  0  1 99  0
 0  0  0 3608936  0 11012000 1  3490 1232 3455  3  5 91  0
-> fourth stall 64 sec.
 1  0  0 3609060  0 11029600 0   568  669 1222  0  1 99  0
 0  0  0 3609464  0 11049600 0   604  638 1366  0  1 99  0
 0  0  0 3609244  0 11074000 0   844  714 1282  0  1 99  0
 0  0  0 3609508  0 11091200 0   552  584 1185  1  1 99  0
 2  0  0 3609436  0 3200 0   658  643 1442  0  1 99  0
 0  0  0 3609212  0 11134800 0   714  637 1382  0  0 99  0
 0  0  0 3619132  0 11049200   130  1086  736 1870  4  3 91  2
 0  0  0 3657016  0 11549600   466   589  718 1367  1  1 98  0
-> emerge finishs, dirty data was the hole time <1Mb, stays now below 300kb
(btrace running...)
 0  0  0 3657844  0 11566000 0   564  635 1226  1  1 99  0
 0  0  0 3658236  0 11584000 0   582  600 1248  1  0 99  0
 0  0  0 3658296  0 11601200 0   566  606 1232  1  1 99  0
 0  0  0 3657924  0 11621200 0   688  596 1321  1  0 99  0
 0  0  0 3658252  0 11641600 0   631  642 1356  1  0 98  0
 0  0  0 3658184  0 11659200 0   566  575 1273  0  0 99  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  0  0 3658344  0 11677200 0   649  606 1301  0  0 99  0
 0  0  0 3658548  0 11697600 0   617  624 1345  0  0 99  0
 0  0  0 3659204  0 11716000 0   550  576 1223  1  1 99  0
 0  0  0 3659944  0 11734400 0   620  583 1272  0  0 99  0
 0  0  0 3660548  0 11754000 0   605  611 1338  0  0 99  0
 0  0  0 3661236  0 11773200 0   582  569 1275  0  0 99  0
 0  0  0 3662420  0 11788800 0   590  571 1157  0  0 99  0
 0  0  0 3664324  0 11806800 0   566  553 1222  1  1 99  0
 0  0  0 3665240  0 11816800 0   401  574  862  0  0 99  0
 0  0  0 3666984  0 11828000 0   454  574  958  1  1 99  0
 0  0  0 3668664  0 11840000 0   396  559  946  0  0 99  0
 0  0  0 3670628  0 11849600 0   296  495  784  0  0 99  0
 0  0  0 3671316  0 118496  

cramfs in big endian

2007-11-06 Thread Andi Drebes
Hi!
I'm currently trying to enable the cramfs to mount filesystems with a 
different endianness. All I have is an intel compatible machine that produces 
cramfs images in little endian format. I found a modified version of mkcramfs 
that is able to produce images in big endian mode. Unfortunately it seems to 
use a different compression algorithm (LZO) than the standard implementation 
(zlib).

I would be very glad if somebody could send me an image in big endian mode 
using the zlib deflate compression algorithm. I have uploaded a bz2 archive 
containing some testfiles. You can find it at [1].


Regards,
Andi

[1] http://drebesium.org/~hackbert/cramfs-testfiles.tar.bz2
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser
On 11/6/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:
>
> > I'm struggling to understand what possible changed in XFS or writeback that
> > would lead to stalls like this, esp. as you appear to be removing files when
> > the stalls occur.
>
> Just a crazy idea,..
>
> Could there be a set_page_dirty() that doesn't have
> balance_dirty_pages() call near? For example modifying meta data in
> unlink?
>
> Such a situation could lead to an excess of dirty pages and the next
> call to balance_dirty_pages() would appear to stall, as it would
> desperately try to get below the limit again.

Only if accounting of the dirty pages is also broken.
In the unmerge testcase I see most of the time only <200kb of dirty
data in /proc/meminfo.

The system has 4Gb of RAM so I'm not sure if it should ever be valid
to stall even the emerge/install testcase.

Torsten

Now building a kernel with the skipped-pages-accounting-patch reverted...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread Andrew Morton
> On Tue, 6 Nov 2007 12:30:38 +0100 Jörn Engel <[EMAIL PROTECTED]> wrote:
> On Tue, 6 November 2007 10:11:49 +0100, Jan Blunck wrote:
> > On Mon, Nov 05, Jörn Engel wrote:
> > 
> > > This patch changes some 400 lines, most if not all of which get longer
> > > and more complicated to read.  23 get sufficiently longer to require an
> > > additional linebreak.  I can't remember complexity being invited into
> > > the kernel without good reasoning, yet the patch description is
> > > surprisingly low on reasoning:
> > > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
> > 
> > I don't measure complexity by lines of code or length of lines. Maybe I was
> > not verbose enough in the description, fair.
> 
> If you have a better metric, please share it.  In the paragraph you
> deleted I explicitly asked for _any_ metric that shows favorable
> numbers.  Lacking numbers, we could only argue about our respective
> personal taste.
> 
> > This is a cleanup series. In mostly no case there is a reason why someone
> > would want to use a dentry for itself. This series reflects that fact in
> > nameidata where there is absolutly no reason at all.
> 
> 400+ lines changed in this patch, some 10 in a followup patch that
> combines dentry/vfsmount assignments into a single path assignment.  If
> your argument above was valid, I would expect more simplifications and
> fewer complications.  Call me a sceptic until further patches show up to
> support your point.
> 
> > It enforced the correct
> > order of getting/releasing refcount on  pairs.
> 
> This argument I buy.
> 
> > It enables us
> > to do some more cleanups wrt lookup (which are coming later).
> 
> Please send those patches.  I invite cleanups that do clean things up
> and won't argue against then. ;)
> 
> > For stacking
> > support in VFS it is essential to have the  pair in every
> > place where you want to traverse the stack.
> 
> True, but unrelated to this patch.
> 
> > > If churn is the only effect of this, please considere it NAKed again.
> > 
> > I wonder why you didn't speak up when this series was posted to LKML. It was
> > at least posted three times before.
> 
> I did speak up.  Once.  If you missed that thread, please forgive me
> missing those in which the same patch I disapproved of were resent
> without me on Cc.
> 
> I'm not categorically against this struct path business.  It does have
> some advantages at first glance.  But the patch we're arguing about
> clearly makes code more complicated and harder to read.  We should have
> more than superficial benefits if we decide to pay such a cost.

It sounds like we at least need a better overall changlog, please..
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread Peter Zijlstra
On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:

> I'm struggling to understand what possible changed in XFS or writeback that
> would lead to stalls like this, esp. as you appear to be removing files when
> the stalls occur. 

Just a crazy idea,..

Could there be a set_page_dirty() that doesn't have
balance_dirty_pages() call near? For example modifying meta data in
unlink?

Such a situation could lead to an excess of dirty pages and the next
call to balance_dirty_pages() would appear to stall, as it would
desperately try to get below the limit again.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qstr abuse in git-cifs

2007-11-06 Thread Steve French
> I suspect that bad things are happening in there
I doubt that it is too bad since various filesystems (including JFS
and CIFS) have always written over qstr->name in these dentries in
order to better handle case insensitive compares.   Other than trying
to remove the compiler warning recently (mainline throws the same
warning), I didn't see what changed other than now displaying a
warning.   I agree that it looks strange, but I also don't see a
trivial way to change it.  Any ideas on how to prevent the case of an
existing negative dentry from taking precedence? Creating more
dentries for these duplicates would seem like a worse idea.

> fs/cifs/dir.c: In function 'cifs_ci_compare':
> fs/cifs/dir.c:596: warning: passing argument 1 of '__constant_memcpy' 
> discards qualifiers from pointer target type
> fs/cifs/dir.c:596: warning: passing argument 1 of '__memcpy' discards 
> qualifiers from pointer target type

-- 
Thanks,

Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with accessing namespace_sem from LSM.

2007-11-06 Thread Tetsuo Handa
Hello.

Christoph Hellwig wrote:
> Any code except VFS internals has no business using it at all and doesn't
> do that in mainline either.  I'd start looking for design bugs in whatever
> code you have using it first.
Isn't security_inode_create() a part of VFS internals?
I think security_inode_create() is a part of VFS internals
because it is called from vfs_create().

Regards.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IGET: Use ino_t consistently for vxfs_iget() and its declaration

2007-11-06 Thread David Howells
Use ino_t consistently for vxfs_iget() and its declaration, rather than using
ino_t for one and unsigned long for the other.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/freevxfs/vxfs_extern.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/freevxfs/vxfs_extern.h b/fs/freevxfs/vxfs_extern.h
index f307694..2b46064 100644
--- a/fs/freevxfs/vxfs_extern.h
+++ b/fs/freevxfs/vxfs_extern.h
@@ -58,7 +58,7 @@ extern struct inode * vxfs_get_fake_inode(struct 
super_block *,
 extern voidvxfs_put_fake_inode(struct inode *);
 extern struct vxfs_inode_info *vxfs_blkiget(struct super_block *, 
u_long, ino_t);
 extern struct vxfs_inode_info *vxfs_stiget(struct super_block *, 
ino_t);
-extern struct inode *  vxfs_iget(struct super_block *, unsigned long);
+extern struct inode *  vxfs_iget(struct super_block *, ino_t);
 extern voidvxfs_clear_inode(struct inode *);
 
 /* vxfs_lookup.c */

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread hooanon05

Hello Jan Blunck,

Jan Blunck:
> I start sending out the patches in multiple chunks because nobody reviewed the
> union mount series except for coding style violations. So this is the prework
> for the changes that come with my union mount series. So they are related
> but not a part of the union mount patch series. It seems that people tend to
> like the patch series with small changes for itself instead of a big fat
> series.

I've read your union mount code which was posted on the end of last
July. Here is a comment which is I remember now.

Whiteouts in your code can be a serious memory pressure, since they are
kept in dcache. I know the inode for whiteouts exists only one and it is
shared, but dentries for whiteouts are not. They are created for each
name and resident in dcache.
I am afraid it can be a problem easily when you create and unlink a
temporary file many times. Generally their filenames are unique.

Regarding to struct path in nameidata, I have no objection
basically. But I think it is better to create macros for backward
compatibility as struct file did.


Junjiro Okajima
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with accessing namespace_sem from LSM.

2007-11-06 Thread Christoph Hellwig
On Tue, Nov 06, 2007 at 01:00:41PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> I found that accessing namespace_sem from security_inode_create()
> causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y .

Any code except VFS internals has no business using it at all and doesn't
do that in mainline either.  I'd start looking for design bugs in whatever
code you have using it first.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread Jörn Engel
On Tue, 6 November 2007 13:59:03 +0100, Jan Blunck wrote:
> 
> Coming back to the mental stuff: the savings of the first bunch of patches
> that already hit -mm:
> 
> Textsize without patches: 0x20e572
> Textsize with patches:0x20e042
> --
>  0x530 = 1328 bytes

Ok, that is more substantial than mental masturbation.  Thank you for
the numbers.

All my criticism becomes void with this.

Jörn

-- 
Victory in war is not repetitious.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread Jan Blunck
On Tue, Nov 06, Jörn Engel wrote:

> > This is a cleanup series. In mostly no case there is a reason why someone
> > would want to use a dentry for itself. This series reflects that fact in
> > nameidata where there is absolutly no reason at all.
> 
> 400+ lines changed in this patch, some 10 in a followup patch that
> combines dentry/vfsmount assignments into a single path assignment.  If
> your argument above was valid, I would expect more simplifications and
> fewer complications.  Call me a sceptic until further patches show up to
> support your point.

This are the patches currently in -mm related to the struct path cleanup:

move-struct-path-into-its-own-header.patch
embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch
introduce-path_put.patch
use-path_put-in-a-few-places-instead-of-mntdput.patch
introduce-path_get.patch
use-struct-path-in-fs_struct.patch
make-set_fs_rootpwd-take-a-struct-path.patch
introduce-path_get-unionfs.patch
embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch
introduce-path_put-unionfs.patch
one-less-parameter-to-__d_path.patch
d_path-kerneldoc-cleanup.patch
d_path-use-struct-path-in-struct-avc_audit_data.patch
d_path-make-proc_get_link-use-a-struct-path-argument.patch
d_path-make-get_dcookie-use-a-struct-path-argument.patch
use-struct-path-in-struct-svc_export.patch
use-struct-path-in-struct-svc_expkey.patch
d_path-make-seq_path-use-a-struct-path-argument.patch
d_path-make-d_path-use-a-struct-path.patch

> > It enables us
> > to do some more cleanups wrt lookup (which are coming later).
> 
> Please send those patches.  I invite cleanups that do clean things up
> and won't argue against then. ;)

I'll send them in a later series.

> > For stacking
> > support in VFS it is essential to have the  pair in every
> > place where you want to traverse the stack.
> 
> True, but unrelated to this patch.

I start sending out the patches in multiple chunks because nobody reviewed the
union mount series except for coding style violations. So this is the prework
for the changes that come with my union mount series. So they are related
but not a part of the union mount patch series. It seems that people tend to
like the patch series with small changes for itself instead of a big fat
series.

> > I wonder why you didn't speak up when this series was posted to LKML. It was
> > at least posted three times before.
> 
> I did speak up.  Once.  If you missed that thread, please forgive me
> missing those in which the same patch I disapproved of were resent
> without me on Cc.

Sorry for missing your feedback but now I found your mail ("mental
masturbation that complicates the source"). I guess this is what happens when
multiple people start posting the same patch series.

Coming back to the mental stuff: the savings of the first bunch of patches
that already hit -mm:

Textsize without patches: 0x20e572
Textsize with patches:0x20e042
--
 0x530 = 1328 bytes

Cheers,
 Jan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] smbfs: fix calculation of kernel_recvmsg size parameter in smb_receive

2007-11-06 Thread Jeff Layton
smb_receive calls kernel_recvmsg with a size that's the minimum of the
amount of buffer space in the kvec passed in or req->rq_rlen (which
represents the length of the response). This does not take into account
any response data that was read in an earlier pass through smb_receive.

If the first pass through smb_receive receives some but not all of the
response, then the next pass can call kernel_recvmsg with a size field
that's too big. kernel_recvmsg can overrun into the next response,
throwing off the alignment and making it unrecognizable.

This causes messages like this to pop up in the ring buffer:

smb_get_length: Invalid NBT packet, code=69

as well as other errors indicating that the response is unrecognizable.
Typically this is seen on a smbfs mount under heavy I/O.

This patch changes the code to use (req->rq_rlen - req->rq_bytes_recvd)
instead instead of just req->rq_rlen, since that should represent the
amount of unread data in the response.

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>
Acked-by: Guenter Kukkukk <[EMAIL PROTECTED]>
---
 fs/smbfs/sock.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/smbfs/sock.c b/fs/smbfs/sock.c
index e48bd82..e37fe4d 100644
--- a/fs/smbfs/sock.c
+++ b/fs/smbfs/sock.c
@@ -329,9 +329,8 @@ smb_receive(struct smb_sb_info *server, struct smb_request 
*req)
msg.msg_control = NULL;
 
/* Dont repeat bytes and count available bufferspace */
-   rlen = smb_move_iov(&p, &num, iov, req->rq_bytes_recvd);
-   if (req->rq_rlen < rlen)
-   rlen = req->rq_rlen;
+   rlen = min_t(int, smb_move_iov(&p, &num, iov, req->rq_bytes_recvd),
+   (req->rq_rlen - req->rq_bytes_recvd));
 
result = kernel_recvmsg(sock, &msg, p, num, rlen, flags);
 
-- 
1.5.2.1

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread Jörn Engel
On Tue, 6 November 2007 10:11:49 +0100, Jan Blunck wrote:
> On Mon, Nov 05, Jörn Engel wrote:
> 
> > This patch changes some 400 lines, most if not all of which get longer
> > and more complicated to read.  23 get sufficiently longer to require an
> > additional linebreak.  I can't remember complexity being invited into
> > the kernel without good reasoning, yet the patch description is
> > surprisingly low on reasoning:
> > > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
> 
> I don't measure complexity by lines of code or length of lines. Maybe I was
> not verbose enough in the description, fair.

If you have a better metric, please share it.  In the paragraph you
deleted I explicitly asked for _any_ metric that shows favorable
numbers.  Lacking numbers, we could only argue about our respective
personal taste.

> This is a cleanup series. In mostly no case there is a reason why someone
> would want to use a dentry for itself. This series reflects that fact in
> nameidata where there is absolutly no reason at all.

400+ lines changed in this patch, some 10 in a followup patch that
combines dentry/vfsmount assignments into a single path assignment.  If
your argument above was valid, I would expect more simplifications and
fewer complications.  Call me a sceptic until further patches show up to
support your point.

> It enforced the correct
> order of getting/releasing refcount on  pairs.

This argument I buy.

> It enables us
> to do some more cleanups wrt lookup (which are coming later).

Please send those patches.  I invite cleanups that do clean things up
and won't argue against then. ;)

> For stacking
> support in VFS it is essential to have the  pair in every
> place where you want to traverse the stack.

True, but unrelated to this patch.

> > If churn is the only effect of this, please considere it NAKed again.
> 
> I wonder why you didn't speak up when this series was posted to LKML. It was
> at least posted three times before.

I did speak up.  Once.  If you missed that thread, please forgive me
missing those in which the same patch I disapproved of were resent
without me on Cc.

I'm not categorically against this struct path business.  It does have
some advantages at first glance.  But the patch we're arguing about
clearly makes code more complicated and harder to read.  We should have
more than superficial benefits if we decide to pay such a cost.

> Did I break your COW link patches? ;)

Nope.  No bovine maladies involved.

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Al Boldi
Al Boldi wrote:
> There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
> entries) using a simple ls -l.
>
> On 2.6.23 client and server running userland rpc.nfs.V2:
> first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> more tries: time -p ls -l <2k+ entry dir>  in ~8sec
>
> first  try: time -p ls -l <5k+ entry dir>  in ~9sec
> more tries: time -p ls -l <5k+ entry dir>  in ~180sec
>
> On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
> first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> more tries: time -p ls -l <2k+ entry dir>  in ~7sec
>
> first  try: time -p ls -l <5k+ entry dir>  in ~8sec
> more tries: time -p ls -l <5k+ entry dir>  in ~43sec
>
> Remounting the nfs-dir on the client resets the problem.
>
> Any ideas?

Ok, I played some more with this, and it turns out that nfsV3 is a lot 
faster.  But, this does not explain why the 2.4.31 kernel is still over 
4-times faster than 2.6.23.

Can anybody explain what's going on?


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]

2007-11-06 Thread David Howells
David Howells <[EMAIL PROTECTED]> wrote:

> > Not completely trivial - the ino_t -> unsigned long conversion needs to be
> > propagated fairly widely in there.
> 
> I'll just make it unsigned long.  That's what iget() took as the inode number
> parameter type, so it can't behave any differently as a first guess.

On the other hand, __vxfs_iget() takes ino_t, so that's the right thing to do
in this case.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]

2007-11-06 Thread David Howells
Andrew Morton <[EMAIL PROTECTED]> wrote:

> Not completely trivial - the ino_t -> unsigned long conversion needs to be
> propagated fairly widely in there.

I'll just make it unsigned long.  That's what iget() took as the inode number
parameter type, so it can't behave any differently as a first guess.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]

2007-11-06 Thread Andrew Morton
On Thu, 25 Oct 2007 17:35:14 +0100 David Howells <[EMAIL PROTECTED]> wrote:

> Stop the FreeVXFS filesystem from using iget() and read_inode().  Replace
> vxfs_read_inode() with vxfs_iget(), and call that instead of iget().
> vxfs_iget() then uses iget_locked() directly and returns a proper error code
> instead of an inode in the event of an error.
> 
> vxfs_fill_super() returns any error incurred when getting the root inode
> instead of EINVAL.

alpha:

fs/freevxfs/vxfs_inode.c:298: error: conflicting types for 'vxfs_iget'
fs/freevxfs/vxfs_extern.h:61: error: previous declaration of 'vxfs_iget' was 
here

Not completely trivial - the ino_t -> unsigned long conversion needs to be
propagated fairly widely in there.

(when should we be using ino_t and when should we be using unsigned long,
anyway?)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread Peter Zijlstra
On Mon, 2007-11-05 at 15:57 -0800, Andrew Morton wrote:

> > > Subject: mm: speed up writeback ramp-up on clean systems
> > > 
> > > We allow violation of bdi limits if there is a lot of room on the
> > > system. Once we hit half the total limit we start enforcing bdi limits
> > > and bdi ramp-up should happen. Doing it this way avoids many small
> > > writeouts on an otherwise idle system and should also speed up the
> > > ramp-up.
> 
> Given the problems we're having in there I'm a bit reluctant to go tossing
> hastily put together and inadequately tested stuff onto the fire.  And
> that's what this patch looks like to me.

Not really hastily, I think it was written before the stuff hit
mainline. Inadequately tested, perhaps, its been in my and probably Wu's
kernels for a while. Granted that's not a lot of testing in the face of
those who have problems atm.

> Wanna convince me otherwise?

I'm perfectly happy with this patch earning its credits in -mm for a
while and maybe going in around -rc4 or something like that (hoping that
by then we've fixed these nagging issues).

Another patch I did come up with yesterday - not driven by any problems
in that area - could perhaps join this one on that path:

---
Subject: mm: bdi: tweak task dirty penalty

Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather
excessive on large memory machines. Use sqrt to scale it sub-linearly.

Update the comment while we're there.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6-2/mm/page-writeback.c
===
--- linux-2.6-2.orig/mm/page-writeback.c
+++ linux-2.6-2/mm/page-writeback.c
@@ -213,17 +213,21 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * Task specific dirty limit:
  *
- * task specific dirty limit:
+ *   dirty -= 8 * sqrt(dirty) * p_{t}
  *
- *   dirty -= (dirty/8) * p_{t}
+ * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This
+ * avoids infrequent dirtiers from getting stuck in this other guys dirty
+ * pages.
+ *
+ * Use a sub-linear function to scale the penalty, we only need a little room.
  */
 void task_dirty_limit(struct task_struct *tsk, long *pdirty)
 {
long numerator, denominator;
long dirty = *pdirty;
-   u64 inv = dirty >> 3;
+   u64 inv = 8*int_sqrt(dirty);
 
task_dirties_fraction(tsk, &numerator, &denominator);
inv *= numerator;


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-06 Thread Fengguang Wu
On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
> [  547.20] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 > 
> global 12829 72 0 wc __ tw 0 sk 0
> [  550.48] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 > 
> global 12017 62 0 wc __ tw 0 sk 0
> [  552.71] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 > 
> global 11133 83 0 wc __ tw 0 sk 0
> [  558.66] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 > 
> global 10470 33 0 wc _M tw 0 sk 0
4s
> [  562.75] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 > 
> global 10555 69 0 wc _M tw 0 sk 0
3s
> [  565.15] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 > 
> global 9562 498 0 wc _M tw -2 sk 0
4s
> [  569.49] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 > 
> global 8960 2 0 wc _M tw 0 sk 0
3s
> [  572.91] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 > 
> global 8088 205 0 wc _M tw -13 sk 0
2s
> [  574.61] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 > 
> global 7114 188 0 wc _M tw -1 sk 0
10s
> [  584.27] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 > 
> global 14544 0 0 wc _M tw -1 sk 0
9s
> [  593.05] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 > 
> global 24583 736 0 wc _M tw -1 sk 0
7s
> [  600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 > 
> global 27004 6 0 wc _M tw 587 sk 0
> [  600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 > 
> global 27004 6 0 wc __ tw 1014 sk 0

The above messages and the below 'D' state pdflush indicate that one
single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete.

Let's try reverting the below patch with `patch -R`? It looks like
the most relevant change - if it's not a low level bug.

> [note] first stall, the output from emerge stops, so it seems it can
> not start processing the next file until the stall ends
> [  630.00] SysRq : Emergency Sync
> [  630.12] Emergency Sync complete
> [  632.85] SysRq : Show Blocked State
> [  632.85]   taskPC stack   pid father
> [  632.85] pdflush   D 81000f091788 0   285  2
> [  632.85]  810005d4da80 0046 0800
> 0071
> [  632.85]  81000fd52400 8022d61c 80819b00
> 80819b00
> [  632.85]  80815f40 80819b00 810100316f98
> 
> [  632.85] Call Trace:
> [  632.85]  [] task_rq_lock+0x4c/0x90
> [  632.85]  [] __wake_up_common+0x5a/0x90
> [  632.85]  [] __down+0xa7/0x11e
> [  632.85]  [] default_wake_function+0x0/0x10
> [  632.85]  [] __down_failed+0x35/0x3a
> [  632.85]  [] xfs_buf_lock+0x3e/0x40
> [  632.85]  [] _xfs_buf_find+0x13e/0x240
> [  632.85]  [] xfs_buf_get_flags+0x6f/0x190
> [  632.85]  [] xfs_buf_read_flags+0x12/0xa0
> [  632.85]  [] xfs_trans_read_buf+0x64/0x340
> [  632.85]  [] xfs_itobp+0x81/0x1e0
> [  632.85]  [] write_cache_pages+0x123/0x330
> [  632.85]  [] xfs_iflush+0xfe/0x520
> [  632.85]  [] __down_read_trylock+0x42/0x60
> [  632.85]  [] xfs_inode_flush+0x179/0x1b0
> [  632.85]  [] xfs_fs_write_inode+0x2f/0x90
> [  632.85]  [] __writeback_single_inode+0x2ac/0x380
> [  632.85]  [] dm_table_any_congested+0x2e/0x80
> [  632.85]  [] generic_sync_sb_inodes+0x20d/0x330
> [  632.85]  [] writeback_inodes+0xa2/0xe0
> [  632.85]  [] wb_kupdate+0xa6/0x140
> [  632.85]  [] pdflush+0x0/0x1e0
> [  632.85]  [] pdflush+0x110/0x1e0
> [  632.85]  [] wb_kupdate+0x0/0x140
> [  632.85]  [] kthread+0x4b/0x80
> [  632.85]  [] child_rip+0xa/0x12
> [  632.85]  [] kthread+0x0/0x80
> [  632.85]  [] child_rip+0x0/0x12
> [  632.85]
> [  632.85] emergeD  0  6220   6129
> [  632.85]  810103ced9f8 0086 
> 0071
> [  632.85]  81000fd52cf8  80819b00
> 80819b00
> [  632.85]  80815f40 80819b00 810103ced9b8
> 810103ced9a8
> [  632.85] Call Trace:
> [  632.85]  [] __down+0xa7/0x11e
> [  632.85]  [] default_wake_function+0x0/0x10
> [  632.85]  [] __down_failed+0x35/0x3a
> [  632.85]  [] xfs_buf_lock+0x3e/0x40
> [  632.85]  [] _xfs_buf_find+0x13e/0x240
> [  632.85]  [] xfs_buf_get_flags+0x6f/0x190
> [  632.85]  [] xfs_buf_read_flags+0x12/0xa0
> [  632.85]  [] xfs_trans_read_buf+0x64/0x340
> [  632.85]  [] xfs_itobp+0x81/0x1e0
> [  632.85]  [] xfs_buf_rele+0x2e/0xd0
> [  632.85]  [] xfs_iflush+0xfe/0x520
> [  632.85]  [] __down_read_trylock+0x42/0x60
> [  632.85]  [] xfs_inode_item_push+0x12/0x20
> [  632.85]  [] xfs_trans_push_ail+0x267/0x2b0
> [  632.85]  [] xfs_log_reserve+0x72/0x120
> [  632.85]  [] xfs_trans_reserve+0xa8/0x210
> [  63

Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread Jan Blunck
On Mon, Nov 05, Jörn Engel wrote:

> > Subject: Embed a struct path into struct nameidata instead of 
> > nd->{dentry,mnt}
> > From: Jan Blunck <[EMAIL PROTECTED]>
> > 
> > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
> > 
> > Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> > Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
> > Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
> > Cc: Al Viro <[EMAIL PROTECTED]>
> > CC: 
> > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> Frowned-upon-by: Joern Engel <[EMAIL PROTECTED]>
> 
> This patch changes some 400 lines, most if not all of which get longer
> and more complicated to read.  23 get sufficiently longer to require an
> additional linebreak.  I can't remember complexity being invited into
> the kernel without good reasoning, yet the patch description is
> surprisingly low on reasoning:
> > Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

I don't measure complexity by lines of code or length of lines. Maybe I was
not verbose enough in the description, fair.

This is a cleanup series. In mostly no case there is a reason why someone
would want to use a dentry for itself. This series reflects that fact in
nameidata where there is absolutly no reason at all. It enforced the correct
order of getting/releasing refcount on  pairs. It enables us
to do some more cleanups wrt lookup (which are coming later). For stacking
support in VFS it is essential to have the  pair in every
place where you want to traverse the stack.

> If churn is the only effect of this, please considere it NAKed again.

I wonder why you didn't speak up when this series was posted to LKML. It was
at least posted three times before.

Did I break your COW link patches? ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing

2007-11-06 Thread Benny Halevy
On Nov. 06, 2007, 7:06 +0200, Andrew Morton <[EMAIL PROTECTED]> wrote:
> On Sat, 03 Nov 2007 07:09:25 -0400 Steve Dickson <[EMAIL PROTECTED]> wrote:
> 
>> The following patch stops NFS sillyname renames and umounts from racing.
> 
> (appropriate cc's added)
> 
>> I have a test script does the following:
>>  1) start nfs server
>>   2) mount loopback
>>   3) open file in background
>>   4) remove file
>>   5) stop nfs server
>>   6) kill -9 process which has file open
>>   7) restart nfs server
>>   8) umount looback mount.
>>
>> After umount I got the "VFS: Busy inodes after unmount" message
>> because the processing of the rename has not finished.
>>
>> Below is a patch that the uses the new silly_count mechanism to
>> synchronize sillyname processing and umounts. The patch introduces a
>> nfs_put_super() routine that waits until the nfsi->silly_count count
>> is zero.
>>
>> A side-effect of finding and waiting for all the inode to
>> find the sillyname processing, is I need to traverse
>> the sb->s_inodes list in the supper block. To do that
>> safely the inode_lock spin lock has to be held. So for
>> modules to be able to "see" that lock I needed to
>> EXPORT_SYMBOL_GPL() it.
>>
>> Any objections to exporting the inode_lock spin lock?
>> If so, how should modules _safely_ access the s_inode list?
>>
>> steved.
>>
>>
>> Author: Steve Dickson <[EMAIL PROTECTED]>
>> Date:   Wed Oct 31 12:19:26 2007 -0400
>>
>>  Close a unlink/sillyname rename and umount race by added a
>>  nfs_put_super routine that will run through all the inode
>>  currently on the super block, waiting for those that are
>>  in the middle of a sillyname rename or removal.
>>
>>  This patch stop the infamous "VFS: Busy inodes after unmount... "
>>  warning during umounts.
>>
>>  Signed-off-by: Steve Dickson <[EMAIL PROTECTED]>
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ed35383..da9034a 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
>>* the i_state of an inode while it is in use..
>>*/
>>   DEFINE_SPINLOCK(inode_lock);
>> +EXPORT_SYMBOL_GPL(inode_lock);
> 
> That's going to make hch unhappy.
> 
> Your email client is performing space-stuffing.
> See http://mbligh.org/linuxdocs/Email/Clients/Thunderbird
> 
>>   static struct file_system_type nfs_fs_type = {
>>  .owner  = THIS_MODULE,
>> @@ -223,6 +225,7 @@ static const struct super_operations nfs_sops = {
>>  .alloc_inode= nfs_alloc_inode,
>>  .destroy_inode  = nfs_destroy_inode,
>>  .write_inode= nfs_write_inode,
>> +.put_super  = nfs_put_super,
>>  .statfs = nfs_statfs,
>>  .clear_inode= nfs_clear_inode,
>>  .umount_begin   = nfs_umount_begin,
>> @@ -1767,6 +1770,30 @@ static void nfs4_kill_super(struct super_block *sb)
>>  nfs_free_server(server);
>>   }
>>
>> +void nfs_put_super(struct super_block *sb)
> 
> This was (correctly) declared to be static.  We should define it that way
> too (I didn't know you could do this, actually).
> 
>> +{
>> +struct inode *inode;
>> +struct nfs_inode *nfsi;
>> +/*
>> + * Make sure there are no outstanding renames
>> + */
>> +relock:
>> +spin_lock(&inode_lock);
>> +list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>> +nfsi = NFS_I(inode);
>> +if (atomic_read(&nfsi->silly_count) > 0) {
>> +/* Keep this inode around  during the wait */
>> +atomic_inc(&inode->i_count);
>> +spin_unlock(&inode_lock);
>> +wait_event(nfsi->waitqueue,
>> +atomic_read(&nfsi->silly_count) == 1);
>> +iput(inode);
>> +goto relock;
>> +}
>> +}
>> +spin_unlock(&inode_lock);
>> +}
> 
> That's an O(n^2) search.  If it is at all possible to hit a catastrophic
> slowdown in here, you can bet that someone out there will indeed hit it in
> real life.
> 
> I'm too lazy to look, but we might need to check things like I_FREEING
> and I_CLEAR before taking a ref on this inode.

It'd be very nice if the silly renamed inodes (with silly_count > 1) were moved
to a different list in the first pass, under the inode_lock, and then waited on
until silly_count <= 1 in a second pass only on the filtered list.  This will
provide you with O(1).

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing

2007-11-06 Thread Alexander Viro
On Tue, Nov 06, 2007 at 10:24:50AM +0200, Benny Halevy wrote:

> It'd be very nice if the silly renamed inodes (with silly_count > 1) were 
> moved
> to a different list in the first pass, under the inode_lock, and then waited 
> on
> until silly_count <= 1 in a second pass only on the filtered list.  This will
> provide you with O(1).

It's absolutely pointless, starting with any kind of searching for inodes,
etc.  If you want fs shutdown _not_ to happen until async activity of
that kind is over, don't reinvent the sodding wheels, just tell VFS that
you are holding an active reference to superblock.  End of story.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with accessing namespace_sem from LSM.

2007-11-06 Thread Toshiharu Harada

On 11/6/2007 1:11 PM, Arjan van de Ven wrote:

On Tue, 06 Nov 2007 13:00:41 +0900
Tetsuo Handa <[EMAIL PROTECTED]> wrote:


Hello.

I found that accessing namespace_sem from security_inode_create()
causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y .


sounds like you have an AB-BA deadlock...


sed /you/AppArmor shipped with OpenSuSE 10.1 and 10.2/ :)

Though I don't think this deadlock should occur quite often,
it occurs when it occurs. Care should be taken promptly.
There should be no way around for this problem as its nature.
Passing vfsmount parameter to VFS helper functions and
LSM hooks seems to be a good choice to me.

Cheers,
Toshiharu Harada

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html