Re: BH_Req question

2001-04-10 Thread Rajagopal Ananthanarayanan

Andrea Arcangeli wrote:
> 
[ ... ]
> 
> BH_Req is never unset until the buffer is destroyed (put back on the freelist).
> BH_Req only says if such a buffer ever did any I/O yet or not. It is basically
> only used to deal with I/O errors in sync_buffers().

Interesting. Thanks for the explanation. Since submit_bh was setting BH_Req,
I was misled into thinking that end_io would unset it ...


> 
> > PS: In case why the question: I've got a system with tons of
> > pages with buffers marked BH_Req, so try_to_free_buffers() bails
> > out thinking that the buffer is busy ...
> 
> Either your debugging is wrong or you broke try_to_free_buffers because a
> buffer with BH_Req must still be perfectly freeable.


Okay, I got distracted by BH_Req, which I mistook to be in BUFFER_BUSY_BITS.
There was also BH_Lock set on the buffers, which would qualify for BUFFER_BUSY_BITS ...
so may be it is a buffer_locking problem somewhere.

cheers,

ananth.

------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



BH_Req question

2001-04-10 Thread Rajagopal Ananthanarayanan


Hi,

It seems BH_Req is set on a buffer_head by submit_bh.
What part of the code unsets this flag during normal
operations? One path seems to be block_flushpage->unmap_buffer
->clear_bit(BH_Req), but IIRC block_flushpage is used only
for truncates. There must be another path to unset BH_Req
under normal memory pressure, or (more unambiguously) on IO completion.

So: in what ways can BH_Req be unset?

Thanks for any input, i've been staring at the code for long without avail ...

cheers,

ananth.

PS: In case why the question: I've got a system with tons of
pages with buffers marked BH_Req, so try_to_free_buffers() bails
out thinking that the buffer is busy ...

-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



BH_Req question

2001-04-10 Thread Rajagopal Ananthanarayanan


Hi,

It seems BH_Req is set on a buffer_head by submit_bh.
What part of the code unsets this flag during normal
operations? One path seems to be block_flushpage-unmap_buffer
-clear_bit(BH_Req), but IIRC block_flushpage is used only
for truncates. There must be another path to unset BH_Req
under normal memory pressure, or (more unambiguously) on IO completion.

So: in what ways can BH_Req be unset?

Thanks for any input, i've been staring at the code for long without avail ...

cheers,

ananth.

PS: In case why the question: I've got a system with tons of
pages with buffers marked BH_Req, so try_to_free_buffers() bails
out thinking that the buffer is busy ...

-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BH_Req question

2001-04-10 Thread Rajagopal Ananthanarayanan

Andrea Arcangeli wrote:
 
[ ... ]
 
 BH_Req is never unset until the buffer is destroyed (put back on the freelist).
 BH_Req only says if such a buffer ever did any I/O yet or not. It is basically
 only used to deal with I/O errors in sync_buffers().

Interesting. Thanks for the explanation. Since submit_bh was setting BH_Req,
I was misled into thinking that end_io would unset it ...


 
  PS: In case why the question: I've got a system with tons of
  pages with buffers marked BH_Req, so try_to_free_buffers() bails
  out thinking that the buffer is busy ...
 
 Either your debugging is wrong or you broke try_to_free_buffers because a
 buffer with BH_Req must still be perfectly freeable.


Okay, I got distracted by BH_Req, which I mistook to be in BUFFER_BUSY_BITS.
There was also BH_Lock set on the buffers, which would qualify for BUFFER_BUSY_BITS ...
so may be it is a buffer_locking problem somewhere.

cheers,

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Rajagopal Ananthanarayanan

Jeff Dike wrote:
[ ... ]
> 
> > Another synchronization method popular with database peeps is "post/
> > wait" for which SGI have a patch available for Linux. I understand
> > that this is relatively "light weight" and might be a better choice
> > for PG.
> 
> URL?
> 
> Jeff


Here it is:

http://oss.sgi.com/projects/postwait/

Check out the download section for a 2.4.0 patch.

cheers,

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Rajagopal Ananthanarayanan

Jeff Dike wrote:
[ ... ]
 
  Another synchronization method popular with database peeps is "post/
  wait" for which SGI have a patch available for Linux. I understand
  that this is relatively "light weight" and might be a better choice
  for PG.
 
 URL?
 
 Jeff


Here it is:

http://oss.sgi.com/projects/postwait/

Check out the download section for a 2.4.0 patch.

cheers,

ananth.

------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ramfs & a_ops->truncatepage()

2001-03-05 Thread Rajagopal Ananthanarayanan


I'm looking at this part of 2.4.2-ac8:

diff -u --new-file --recursive --exclude-from /usr/src/exclude linux-2.4.0/mm/filemap.c
linux.ac/mm/filemap.c
--- linux-2.4.0/mm/filemap.cWed Jan  3 02:59:45 2001
+++ linux.ac/mm/filemap.c   Thu Jan 11 17:26:55 2001
@@ -206,6 +206,9 @@
if (!page->buffers || block_flushpage(page, 0))
lru_cache_del(page);

+   if (page->mapping->a_ops->truncatepage)
+   page->mapping->a_ops->truncatepage(page);
+
/*
 * We remove the page from the page cache _after_ we have
 * destroyed all buffer-cache references to it. Otherwise some
--

Does anyone know who proposed these changes as part of
ramfs enhancements? Basically, we have a very similar
operation in XFS, but would like the call to truncatepage
be _before_ the call to block_flushpage(). As far as ramfs
is concerned, such a change would be a no-op since ramfs doesn't
have page->buffers. Is this correct?

thanks,

ananth.

------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ramfs a_ops-truncatepage()

2001-03-05 Thread Rajagopal Ananthanarayanan


I'm looking at this part of 2.4.2-ac8:

diff -u --new-file --recursive --exclude-from /usr/src/exclude linux-2.4.0/mm/filemap.c
linux.ac/mm/filemap.c
--- linux-2.4.0/mm/filemap.cWed Jan  3 02:59:45 2001
+++ linux.ac/mm/filemap.c   Thu Jan 11 17:26:55 2001
@@ -206,6 +206,9 @@
if (!page-buffers || block_flushpage(page, 0))
lru_cache_del(page);

+   if (page-mapping-a_ops-truncatepage)
+   page-mapping-a_ops-truncatepage(page);
+
/*
 * We remove the page from the page cache _after_ we have
 * destroyed all buffer-cache references to it. Otherwise some
--

Does anyone know who proposed these changes as part of
ramfs enhancements? Basically, we have a very similar
operation in XFS, but would like the call to truncatepage
be _before_ the call to block_flushpage(). As far as ramfs
is concerned, such a change would be a no-op since ramfs doesn't
have page-buffers. Is this correct?

thanks,

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



IO Clustering & Delayed allocation

2001-03-01 Thread Rajagopal Ananthanarayanan


Below is a partial patch to provide hooks so
that IO clustering can be performed by the file-system.
As presented, the same code is used to perform delayed allocation.
There has also been a lot of talk about implementing delayed
allocation. To be clear, delayed allocation means not
immediately allocating disk space on writing the data,
say during a sys_write. Instead, allocation of disk blocks
to logical blocks is done at the time the logical block
needs to be written out.

In the end, it looks like taking care of delayed-allocation
at writepage() is the best way to go. Following is a patch
where buffer-based routines will employ writepage() to do
such conversions. In addition to allocating blocks for a single
buffer or page, extent based filesystems would allocate blocks
for all delayed allocate blocks for the entire extent. These
same hooks can be used for clustering (i.e. pushing out
contiguous on-disk) as well.

Comments, suggestions welcome.

ananth.

-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--

--- ../../linux-2.4.2/linux/fs/buffer.c Fri Feb  9 11:29:44 2001
+++ fs/buffer.c Thu Mar  1 11:02:01 2001
@@ -161,6 +161,40 @@
atomic_dec(>b_count);
 }
 
+
+#define buffer_delay_busy(bh) \
+   (test_bit(BH_Delay, >b_state) && bh->b_page && PageLocked(bh->b_page))
+   
+static void
+_write_buffer(struct buffer_head *bh)
+{
+   struct page *page = bh->b_page;
+
+   if (!page || TryLockPage(page)) {
+   if (current->need_resched)
+   schedule();
+   return;
+   }
+   /*
+* Raced with someone?
+*/
+   if (page->buffers != bh || !buffer_delay(bh) || !buffer_dirty(bh)) {
+   UnlockPage(page);
+   return;
+   }
+   page->mapping->a_ops->writepage(page);
+}
+
+static inline void
+write_buffer(struct buffer_head *bh)
+{
+   if (!buffer_delay(bh))
+   ll_rw_block(WRITE, 1, );
+   else
+   _write_buffer(bh);
+}
+
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -232,7 +266,7 @@
 
atomic_inc(>b_count);
spin_unlock(_list_lock);
-   ll_rw_block(WRITE, 1, );
+   write_buffer(bh);
atomic_dec(>b_count);
retry = 1;
goto repeat;
@@ -507,6 +541,8 @@
struct bh_free_head *head = _list[BUFSIZE_INDEX(bh->b_size)];
struct buffer_head **bhp = >list;
 
+   if (test_bit(BH_Delay, >b_state))
+   BUG();
bh->b_state = 0;
 
spin_lock(>lock);
@@ -879,7 +915,7 @@
if (buffer_dirty(bh)) {
atomic_inc(>b_count);
spin_unlock(_list_lock);
-   ll_rw_block(WRITE, 1, );
+   write_buffer(bh);
brelse(bh);
spin_lock(_list_lock);
}
@@ -1394,6 +1430,11 @@
 
head = page->buffers;
bh = head;
+
+   if (buffer_delay(bh)) {
+   page->mapping->a_ops->writepage_nounlock(page);
+   return 0; /* just started I/O ... likely didn't complete */
+   }
do {
unsigned int next_off = curr_off + bh->b_size;
next = bh->b_this_page;
@@ -2334,7 +2375,7 @@
if (wait > 1)
__wait_on_buffer(p);
} else if (buffer_dirty(p))
-   ll_rw_block(WRITE, 1, );
+   write_buffer(p);
} while (tmp != bh);
 }
 
@@ -2361,6 +2402,11 @@
int index = BUFSIZE_INDEX(bh->b_size);
int loop = 0;
 
+   if (buffer_delay(bh)) {
+   if (wait)
+   page->mapping->a_ops->writepage_nounlock(page);
+   return 0; /* just started I/O ... likely didn't complete */
+   }
 cleaned_buffers_try_again:
spin_lock(_list_lock);
write_lock(_table_lock);
@@ -2562,7 +2608,7 @@
__refile_buffer(bh);
continue;
}
-   if (buffer_locked(bh))
+   if (buffer_locked(bh) || buffer_delay_busy(bh))
continue;
 
if (check_flushtime) {
@@ -2580,7 +2626,7 @@
/* OK, now we are committed to write it out. */
atomic_inc(>b_count);
spin_unlock(_list_l

Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

[ ... ]

> Except that your code throws the random junk at the elevator all
> the time, while my code only bothers the elevator every once in
> a while. This should make it possible for the disk reads to
> continue with less interruptions.
> 

Couldn't agree with you more. The elevator does a decent job
these days, but higher level clustering could do more ...

[ ...]

> Indeed. IMHO we should fix this by putting explicit IO
> clustering in the ->writepage() functions.

Enhancing writepage() to perform clustering is the first step.
In addition you want entities (kupdated, kswapd, et. al)
that currently work with only buffers to invoke writepage()
at appropriate points. Just today I sent a patch that does this
and also combines delayed allocation out to Al Viro for comments.
If anyone else is interested I can send it out to the list.

ananth.

------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch][rfc][rft] vm throughput 2.4.2-ac4

2001-03-01 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

[ ... ]

 Except that your code throws the random junk at the elevator all
 the time, while my code only bothers the elevator every once in
 a while. This should make it possible for the disk reads to
 continue with less interruptions.
 

Couldn't agree with you more. The elevator does a decent job
these days, but higher level clustering could do more ...

[ ...]

 Indeed. IMHO we should fix this by putting explicit IO
 clustering in the -writepage() functions.

Enhancing writepage() to perform clustering is the first step.
In addition you want entities (kupdated, kswapd, et. al)
that currently work with only buffers to invoke writepage()
at appropriate points. Just today I sent a patch that does this
and also combines delayed allocation out to Al Viro for comments.
If anyone else is interested I can send it out to the list.

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



IO Clustering Delayed allocation

2001-03-01 Thread Rajagopal Ananthanarayanan


Below is a partial patch to provide hooks so
that IO clustering can be performed by the file-system.
As presented, the same code is used to perform delayed allocation.
There has also been a lot of talk about implementing delayed
allocation. To be clear, delayed allocation means not
immediately allocating disk space on writing the data,
say during a sys_write. Instead, allocation of disk blocks
to logical blocks is done at the time the logical block
needs to be written out.

In the end, it looks like taking care of delayed-allocation
at writepage() is the best way to go. Following is a patch
where buffer-based routines will employ writepage() to do
such conversions. In addition to allocating blocks for a single
buffer or page, extent based filesystems would allocate blocks
for all delayed allocate blocks for the entire extent. These
same hooks can be used for clustering (i.e. pushing out
contiguous on-disk) as well.

Comments, suggestions welcome.

ananth.

-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--

--- ../../linux-2.4.2/linux/fs/buffer.c Fri Feb  9 11:29:44 2001
+++ fs/buffer.c Thu Mar  1 11:02:01 2001
@@ -161,6 +161,40 @@
atomic_dec(bh-b_count);
 }
 
+
+#define buffer_delay_busy(bh) \
+   (test_bit(BH_Delay, bh-b_state)  bh-b_page  PageLocked(bh-b_page))
+   
+static void
+_write_buffer(struct buffer_head *bh)
+{
+   struct page *page = bh-b_page;
+
+   if (!page || TryLockPage(page)) {
+   if (current-need_resched)
+   schedule();
+   return;
+   }
+   /*
+* Raced with someone?
+*/
+   if (page-buffers != bh || !buffer_delay(bh) || !buffer_dirty(bh)) {
+   UnlockPage(page);
+   return;
+   }
+   page-mapping-a_ops-writepage(page);
+}
+
+static inline void
+write_buffer(struct buffer_head *bh)
+{
+   if (!buffer_delay(bh))
+   ll_rw_block(WRITE, 1, bh);
+   else
+   _write_buffer(bh);
+}
+
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -232,7 +266,7 @@
 
atomic_inc(bh-b_count);
spin_unlock(lru_list_lock);
-   ll_rw_block(WRITE, 1, bh);
+   write_buffer(bh);
atomic_dec(bh-b_count);
retry = 1;
goto repeat;
@@ -507,6 +541,8 @@
struct bh_free_head *head = free_list[BUFSIZE_INDEX(bh-b_size)];
struct buffer_head **bhp = head-list;
 
+   if (test_bit(BH_Delay, bh-b_state))
+   BUG();
bh-b_state = 0;
 
spin_lock(head-lock);
@@ -879,7 +915,7 @@
if (buffer_dirty(bh)) {
atomic_inc(bh-b_count);
spin_unlock(lru_list_lock);
-   ll_rw_block(WRITE, 1, bh);
+   write_buffer(bh);
brelse(bh);
spin_lock(lru_list_lock);
}
@@ -1394,6 +1430,11 @@
 
head = page-buffers;
bh = head;
+
+   if (buffer_delay(bh)) {
+   page-mapping-a_ops-writepage_nounlock(page);
+   return 0; /* just started I/O ... likely didn't complete */
+   }
do {
unsigned int next_off = curr_off + bh-b_size;
next = bh-b_this_page;
@@ -2334,7 +2375,7 @@
if (wait  1)
__wait_on_buffer(p);
} else if (buffer_dirty(p))
-   ll_rw_block(WRITE, 1, p);
+   write_buffer(p);
} while (tmp != bh);
 }
 
@@ -2361,6 +2402,11 @@
int index = BUFSIZE_INDEX(bh-b_size);
int loop = 0;
 
+   if (buffer_delay(bh)) {
+   if (wait)
+   page-mapping-a_ops-writepage_nounlock(page);
+   return 0; /* just started I/O ... likely didn't complete */
+   }
 cleaned_buffers_try_again:
spin_lock(lru_list_lock);
write_lock(hash_table_lock);
@@ -2562,7 +2608,7 @@
__refile_buffer(bh);
continue;
}
-   if (buffer_locked(bh))
+   if (buffer_locked(bh) || buffer_delay_busy(bh))
continue;
 
if (check_flushtime) {
@@ -2580,7 +2626,7 @@
/* OK, now we are committed to write it out. */
atomic_inc(bh-b_count);
spin_unlock(lru_list_lock);
-   ll_rw_block(WRITE, 1, bh);
+   write

Re: [PATCH] 2.4.1 find_page_nolock fixes

2001-02-28 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

> 
> 3. add a __find_page_simple(), which is like __find_page_nolock()
>but only needs 2 arguments and doesn't touch the page ... this
>can be used by IO clustering and other things that really don't
>want to influence page aging, removing the 3rd argument also
>keeps things simple
> 

We've used an exported version of __find_page_simple in XFS to good effect.
Following is a patch against 2.4.2 which is an extension of Rik's patch
to export find_get_page_simple(). Alan, if you want a patch against
the ac series please let me know.

thanks,

ananth.

diff -Naur ../../linux-2.4.2/linux/include/linux/pagemap.h ./include/linux/pagemap.h
--- ../../linux-2.4.2/linux/include/linux/pagemap.h Wed Feb 21 16:10:01 2001
+++ ./include/linux/pagemap.h   Wed Feb 28 14:10:48 2001
@@ -71,6 +71,8 @@
 unsigned long offset, struct page **hash);
 extern struct page * __find_lock_page (struct address_space * mapping,
unsigned long index, struct page **hash);
+extern struct page * find_get_page_simple (struct address_space * mapping,
+   unsigned long index);
 extern void lock_page(struct page *page);
 #define find_lock_page(mapping, index) \
__find_lock_page(mapping, index, page_hash(mapping, index))
diff -Naur ../../linux-2.4.2/linux/kernel/ksyms.c ./kernel/ksyms.c
--- ../../linux-2.4.2/linux/kernel/ksyms.c  Fri Feb  9 11:29:44 2001
+++ ./kernel/ksyms.cWed Feb 28 14:09:51 2001
@@ -241,6 +241,7 @@
 EXPORT_SYMBOL(poll_freewait);
 EXPORT_SYMBOL(ROOT_DEV);
 EXPORT_SYMBOL(__find_lock_page);
+EXPORT_SYMBOL(find_get_page_simple);
 EXPORT_SYMBOL(grab_cache_page);
 EXPORT_SYMBOL(read_cache_page);
 EXPORT_SYMBOL(vfs_readlink);
diff -Naur ../../linux-2.4.2/linux/mm/filemap.c ./mm/filemap.c
--- ../../linux-2.4.2/linux/mm/filemap.cFri Feb 16 16:06:17 2001
+++ ./mm/filemap.c  Wed Feb 28 14:22:09 2001
@@ -285,6 +285,34 @@
spin_unlock(_lock);
 }
 
+/*
+ * This function is pretty much like __find_page_nolock(), but it only
+ * requires 2 arguments and doesn't mark the page as touched, making it
+ * ideal for ->writepage() clustering and other places where you don't
+ * want to mark the page referenced.
+ *
+ * The caller needs to hold the pagecache_lock.
+ */
+struct page * __find_page_simple(struct address_space *mapping, unsigned long index)
+{
+   struct page * page = *page_hash(mapping, index);
+   goto inside;
+
+   for (;;) {
+   page = page->next_hash;
+inside:
+   if (!page)
+   goto not_found;
+   if (page->mapping != mapping)
+   continue;
+   if (page->index == index)
+   break;
+   }
+
+not_found:
+   return page;
+}
+
 static inline struct page * __find_page_nolock(struct address_space *mapping, 
unsigned long offset, struct page *page)
 {
goto inside;
@@ -300,13 +328,14 @@
break;
}
/*
-* Touching the page may move it to the active list.
-* If we end up with too few inactive pages, we wake
-* up kswapd.
+* Mark the page referenced, moving inactive pages to the
+* active list.
 */
-   age_page_up(page);
-   if (inactive_shortage() > inactive_target / 2 && free_shortage())
-   wakeup_kswapd();
+   if (!PageActive(page))
+   activate_page(page);
+   else
+   SetPageReferenced(page);
+
 not_found:
return page;
 }
@@ -679,6 +708,22 @@
 }
 
 /*
+ * Similar to find_get_page but with no VM side-effects such as aging.
+ */
+struct page * find_get_page_simple(struct address_space *mapping,
+ unsigned long index)
+{
+   struct page *page;
+
+   spin_lock(_lock);
+   page = __find_page_simple(mapping, index);
+   if (page)
+   page_cache_get(page);
+   spin_unlock(_lock);
+   return page;
+}
+
+/*
  * Get the lock to a page atomically.
  */
 struct page * __find_lock_page (struct address_space *mapping,
@@ -734,7 +779,6 @@
 {
struct inode *inode = file->f_dentry->d_inode;
struct address_space *mapping = inode->i_mapping;
-   struct page **hash;
struct page *page;
unsigned long start;
 
@@ -755,8 +799,7 @@
 */
spin_lock(_lock);
while (--index >= start) {
-   hash = page_hash(mapping, index);
-   page = __find_page_nolock(mapping, index, *hash);
+   page = __find_page_simple(mapping, index);
if (!page)
break;
deactivate_page(page);



[PATCH] bug in scsi debug code

2001-02-28 Thread Rajagopal Ananthanarayanan


A small fix in dump_stats() (scsi_merge.c) invoked when (struct req)
has inconsistent number of segments. The list formed
by b_reqnext is null terminated, so the current code is
simply wrong: it can cause a oops if (req->bh) is NULL,
or it fails to print the last element in the b_reqnext chain.



-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--

--- ../../linux-2.4.2/linux/drivers/scsi/scsi_merge.c   Fri Feb  9 11:30:23 2001
+++ drivers/scsi/scsi_merge.c   Wed Feb 28 11:55:48 2001
@@ -90,7 +90,7 @@
printk("nr_segments is %x\n", req->nr_segments);
printk("counted segments is %x\n", segments);
printk("Flags %d %d\n", use_clustering, dma_host);
-   for (bh = req->bh; bh->b_reqnext != NULL; bh = bh->b_reqnext) 
+   for (bh = req->bh; bh != NULL; bh = bh->b_reqnext)
{
printk("Segment 0x%p, blocks %d, addr 0x%lx\n",
   bh,



Clustered IO (was: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4)

2001-02-28 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

> 
> Another solution would be to do some more explicit IO clustering and
> only flush _large_ clusters ... no need to invoke extra disk seeks
> just to free a single page, unless you only have single pages left.

Hi Rik,

Yes, clustering IO at the higher level can improve performance.
This improvement is on top of the excellent elevator changes that
Jens Axboe has done in 2.4.2. In XFS we are doing clustering
at writepage(). There are two paths:

1. page_launder() -> writepage() -> cluster
# this path under memory pressure.
2. try_to_free_buffers() -> writepage() -> cluster
# this path under background writing as in bdflush
# but can also be used by sync() type operations that
# work with buffers than pages.

Clustering by itself (in XFS) improves write performance by about 15-20%,
and we're seeing close to raw I/O performance. With clustering
the IO requests are pegged at 1024 sectors (512K bytes)
when performing large sequential writes ...


ananth.


------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Clustered IO (was: Re: [patch][rfc][rft] vm throughput 2.4.2-ac4)

2001-02-28 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

 
 Another solution would be to do some more explicit IO clustering and
 only flush _large_ clusters ... no need to invoke extra disk seeks
 just to free a single page, unless you only have single pages left.

Hi Rik,

Yes, clustering IO at the higher level can improve performance.
This improvement is on top of the excellent elevator changes that
Jens Axboe has done in 2.4.2. In XFS we are doing clustering
at writepage(). There are two paths:

1. page_launder() - writepage() - cluster
# this path under memory pressure.
2. try_to_free_buffers() - writepage() - cluster
# this path under background writing as in bdflush
# but can also be used by sync() type operations that
# work with buffers than pages.

Clustering by itself (in XFS) improves write performance by about 15-20%,
and we're seeing close to raw I/O performance. With clustering
the IO requests are pegged at 1024 sectors (512K bytes)
when performing large sequential writes ...


ananth.


--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] bug in scsi debug code

2001-02-28 Thread Rajagopal Ananthanarayanan


A small fix in dump_stats() (scsi_merge.c) invoked when (struct req)
has inconsistent number of segments. The list formed
by b_reqnext is null terminated, so the current code is
simply wrong: it can cause a oops if (req-bh) is NULL,
or it fails to print the last element in the b_reqnext chain.



-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--

--- ../../linux-2.4.2/linux/drivers/scsi/scsi_merge.c   Fri Feb  9 11:30:23 2001
+++ drivers/scsi/scsi_merge.c   Wed Feb 28 11:55:48 2001
@@ -90,7 +90,7 @@
printk("nr_segments is %x\n", req-nr_segments);
printk("counted segments is %x\n", segments);
printk("Flags %d %d\n", use_clustering, dma_host);
-   for (bh = req-bh; bh-b_reqnext != NULL; bh = bh-b_reqnext) 
+   for (bh = req-bh; bh != NULL; bh = bh-b_reqnext)
{
printk("Segment 0x%p, blocks %d, addr 0x%lx\n",
   bh,



Re: [PATCH] 2.4.1 find_page_nolock fixes

2001-02-28 Thread Rajagopal Ananthanarayanan

Rik van Riel wrote:

 
 3. add a __find_page_simple(), which is like __find_page_nolock()
but only needs 2 arguments and doesn't touch the page ... this
can be used by IO clustering and other things that really don't
want to influence page aging, removing the 3rd argument also
keeps things simple
 

We've used an exported version of __find_page_simple in XFS to good effect.
Following is a patch against 2.4.2 which is an extension of Rik's patch
to export find_get_page_simple(). Alan, if you want a patch against
the ac series please let me know.

thanks,

ananth.

diff -Naur ../../linux-2.4.2/linux/include/linux/pagemap.h ./include/linux/pagemap.h
--- ../../linux-2.4.2/linux/include/linux/pagemap.h Wed Feb 21 16:10:01 2001
+++ ./include/linux/pagemap.h   Wed Feb 28 14:10:48 2001
@@ -71,6 +71,8 @@
 unsigned long offset, struct page **hash);
 extern struct page * __find_lock_page (struct address_space * mapping,
unsigned long index, struct page **hash);
+extern struct page * find_get_page_simple (struct address_space * mapping,
+   unsigned long index);
 extern void lock_page(struct page *page);
 #define find_lock_page(mapping, index) \
__find_lock_page(mapping, index, page_hash(mapping, index))
diff -Naur ../../linux-2.4.2/linux/kernel/ksyms.c ./kernel/ksyms.c
--- ../../linux-2.4.2/linux/kernel/ksyms.c  Fri Feb  9 11:29:44 2001
+++ ./kernel/ksyms.cWed Feb 28 14:09:51 2001
@@ -241,6 +241,7 @@
 EXPORT_SYMBOL(poll_freewait);
 EXPORT_SYMBOL(ROOT_DEV);
 EXPORT_SYMBOL(__find_lock_page);
+EXPORT_SYMBOL(find_get_page_simple);
 EXPORT_SYMBOL(grab_cache_page);
 EXPORT_SYMBOL(read_cache_page);
 EXPORT_SYMBOL(vfs_readlink);
diff -Naur ../../linux-2.4.2/linux/mm/filemap.c ./mm/filemap.c
--- ../../linux-2.4.2/linux/mm/filemap.cFri Feb 16 16:06:17 2001
+++ ./mm/filemap.c  Wed Feb 28 14:22:09 2001
@@ -285,6 +285,34 @@
spin_unlock(pagecache_lock);
 }
 
+/*
+ * This function is pretty much like __find_page_nolock(), but it only
+ * requires 2 arguments and doesn't mark the page as touched, making it
+ * ideal for -writepage() clustering and other places where you don't
+ * want to mark the page referenced.
+ *
+ * The caller needs to hold the pagecache_lock.
+ */
+struct page * __find_page_simple(struct address_space *mapping, unsigned long index)
+{
+   struct page * page = *page_hash(mapping, index);
+   goto inside;
+
+   for (;;) {
+   page = page-next_hash;
+inside:
+   if (!page)
+   goto not_found;
+   if (page-mapping != mapping)
+   continue;
+   if (page-index == index)
+   break;
+   }
+
+not_found:
+   return page;
+}
+
 static inline struct page * __find_page_nolock(struct address_space *mapping, 
unsigned long offset, struct page *page)
 {
goto inside;
@@ -300,13 +328,14 @@
break;
}
/*
-* Touching the page may move it to the active list.
-* If we end up with too few inactive pages, we wake
-* up kswapd.
+* Mark the page referenced, moving inactive pages to the
+* active list.
 */
-   age_page_up(page);
-   if (inactive_shortage()  inactive_target / 2  free_shortage())
-   wakeup_kswapd();
+   if (!PageActive(page))
+   activate_page(page);
+   else
+   SetPageReferenced(page);
+
 not_found:
return page;
 }
@@ -679,6 +708,22 @@
 }
 
 /*
+ * Similar to find_get_page but with no VM side-effects such as aging.
+ */
+struct page * find_get_page_simple(struct address_space *mapping,
+ unsigned long index)
+{
+   struct page *page;
+
+   spin_lock(pagecache_lock);
+   page = __find_page_simple(mapping, index);
+   if (page)
+   page_cache_get(page);
+   spin_unlock(pagecache_lock);
+   return page;
+}
+
+/*
  * Get the lock to a page atomically.
  */
 struct page * __find_lock_page (struct address_space *mapping,
@@ -734,7 +779,6 @@
 {
struct inode *inode = file-f_dentry-d_inode;
struct address_space *mapping = inode-i_mapping;
-   struct page **hash;
struct page *page;
unsigned long start;
 
@@ -755,8 +799,7 @@
 */
spin_lock(pagecache_lock);
while (--index = start) {
-   hash = page_hash(mapping, index);
-   page = __find_page_nolock(mapping, index, *hash);
+   page = __find_page_simple(mapping, index);
if (!page)
break;
deactivate_page(page);



sync on pages containing EOF

2001-02-20 Thread Rajagopal Ananthanarayanan


I was looking at some code to deal with sync (eg. sys_fsync(fd)).
Generally, sync is performed by calling filemap_fdatasync(...)
which does writepage() on pages in the dirty list of the inode,
and then using filemap_fdatawait to wait on the I/O's started by
the writepage's.

Consider writepage() on a (partial) page containing EOF. In this case,
prepare_write/commit_write is employed to write the page out.
However, commit_write will only mark the buffer dirty, and
not actually start the I/O. Subsequently, either memory pressure
(page_launder) or write pressure (flush_dirty_buffers) will
start the I/O on the EOF-page. So, it appears that filemap_fdatawait
will be delayed.

I'm just wondering if this argument is correct ...

-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



sync on pages containing EOF

2001-02-20 Thread Rajagopal Ananthanarayanan


I was looking at some code to deal with sync (eg. sys_fsync(fd)).
Generally, sync is performed by calling filemap_fdatasync(...)
which does writepage() on pages in the dirty list of the inode,
and then using filemap_fdatawait to wait on the I/O's started by
the writepage's.

Consider writepage() on a (partial) page containing EOF. In this case,
prepare_write/commit_write is employed to write the page out.
However, commit_write will only mark the buffer dirty, and
not actually start the I/O. Subsequently, either memory pressure
(page_launder) or write pressure (flush_dirty_buffers) will
start the I/O on the EOF-page. So, it appears that filemap_fdatawait
will be delayed.

I'm just wondering if this argument is correct ...

-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] Re: 1st glance at kiobuf overhead in kernelaiovs pread vs user aio

2001-02-02 Thread Rajagopal Ananthanarayanan

Ingo Molnar wrote:
> 
> On Fri, 2 Feb 2001, Rajagopal Ananthanarayanan wrote:
> 
> > Do you really have worker threads? In my reading of the patch it seems
> > that the wtd is serviced by keventd. [...]
> 
> i think worker threads (or any 'helper' threads) should be avoided. It can
> be done without any extra process context, and it should be done that way.
> Why all the trouble with async IO requests if requests are going to end up
> in a worker thread's context anyway? (which will be a serializing point,
> otherwise why does it end up there?)
> 

Good point. Can you expand on how you plan to service pending
chunks of work (eg. issuing readpage() on some pages) without
the use of threads?

thanks,


------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] Re: 1st glance at kiobuf overhead in kernelaio vs pread vs user aio

2001-02-02 Thread Rajagopal Ananthanarayanan

Benjamin LaHaise wrote:
> 
> Hey Ingo,
> 
> On Fri, 2 Feb 2001, Ingo Molnar wrote:
> 
> > - first of all, great patch! I've got a conceptual question: exactly how
> > does the AIO code prevent filesystem-related scheduling in the issuing
> > process' context? I'd like to use (and test) your AIO code for TUX, but i
> > do not see where it's guaranteed that the process that does the aio does
> > not block - from the patch this is not yet clear to me. (Right now TUX
> > uses separate 'async IO' kernel threads to avoid this problem.) Or if it's
> > not yet possible, what are the plans to handle this?
> 
> Thanks!  Right now the code does the page cache lookup allocations and
> lookups in the caller's thread, the write path then attempts to lock all
> pages sequentially during io using the async page locking function
> wtd_lock_page.  I've tried to get this close to some of the ideas proposed
> by Jeff Merkey, and have implemented async page and buffer locking
> mechanisms so far.  The down in the write path is still synchronous,
> mostly because I want some feedback before going much further down this
> path.  The read path verifies the up2date state of individual pages, and
> if it encounters one which is not, then it queues the request for the
> worker thread which calls readpage on all the pages that need updating.

[ Ben, good to see you have a patch to send, something which I've been requesting
  you for sometime now ;-) ]

Do you really have worker threads? In my reading of the patch it seems
that the wtd is serviced by keventd. And by using mapped kiobuf's you've
avoided issues such as:

a. (not) requiring a requestor's process context to perform the copy (copy-out
   on read, for example)
b. avoiding requestor's (user) page from being unmapped when a
 __iodesc_read_finish is being executed.

These are two major improvements I'm glad to see over my earlier KAIO patch
(obURL: http://oss.sgi.com/projects/kaio/) ... of course, several abstractions,
including kiobufs & more generic task queues in 2.4 have made this easier,
which is a good thing.

I see several similarities to the KAIO patch too; stuff like splitting
generic_read routine (which now you have expanded to include the write
routine also), and the handling of RAW devices.

A nice addition in your patch is the introduction of kiobuf as a common container of
pages, which in the KAIO patch was handled with an ad-hoc (page *) vector
for non RAW & kiobuf's for the RAW case.

One point which is not clear is how one would implement aio_suspend(...)
which waits for any ONE of N aiocb's to complete. The aio_complete(...)
routine in your patch expects a particular idx to wait on, so I assume
as is, only one aiocb can be waited upon. Am I correct? This particular
case is solved in the KAIO patch ...

Also, can you also put out a library that goes with the kernel patch?
I can imagine what it would look like, but ...

Cheers,

ananth.

------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] Re: 1st glance at kiobuf overhead in kernelaio vs pread vs user aio

2001-02-02 Thread Rajagopal Ananthanarayanan

Benjamin LaHaise wrote:
 
 Hey Ingo,
 
 On Fri, 2 Feb 2001, Ingo Molnar wrote:
 
  - first of all, great patch! I've got a conceptual question: exactly how
  does the AIO code prevent filesystem-related scheduling in the issuing
  process' context? I'd like to use (and test) your AIO code for TUX, but i
  do not see where it's guaranteed that the process that does the aio does
  not block - from the patch this is not yet clear to me. (Right now TUX
  uses separate 'async IO' kernel threads to avoid this problem.) Or if it's
  not yet possible, what are the plans to handle this?
 
 Thanks!  Right now the code does the page cache lookup allocations and
 lookups in the caller's thread, the write path then attempts to lock all
 pages sequentially during io using the async page locking function
 wtd_lock_page.  I've tried to get this close to some of the ideas proposed
 by Jeff Merkey, and have implemented async page and buffer locking
 mechanisms so far.  The down in the write path is still synchronous,
 mostly because I want some feedback before going much further down this
 path.  The read path verifies the up2date state of individual pages, and
 if it encounters one which is not, then it queues the request for the
 worker thread which calls readpage on all the pages that need updating.

[ Ben, good to see you have a patch to send, something which I've been requesting
  you for sometime now ;-) ]

Do you really have worker threads? In my reading of the patch it seems
that the wtd is serviced by keventd. And by using mapped kiobuf's you've
avoided issues such as:

a. (not) requiring a requestor's process context to perform the copy (copy-out
   on read, for example)
b. avoiding requestor's (user) page from being unmapped when a
 __iodesc_read_finish is being executed.

These are two major improvements I'm glad to see over my earlier KAIO patch
(obURL: http://oss.sgi.com/projects/kaio/) ... of course, several abstractions,
including kiobufs  more generic task queues in 2.4 have made this easier,
which is a good thing.

I see several similarities to the KAIO patch too; stuff like splitting
generic_read routine (which now you have expanded to include the write
routine also), and the handling of RAW devices.

A nice addition in your patch is the introduction of kiobuf as a common container of
pages, which in the KAIO patch was handled with an ad-hoc (page *) vector
for non RAW  kiobuf's for the RAW case.

One point which is not clear is how one would implement aio_suspend(...)
which waits for any ONE of N aiocb's to complete. The aio_complete(...)
routine in your patch expects a particular idx to wait on, so I assume
as is, only one aiocb can be waited upon. Am I correct? This particular
case is solved in the KAIO patch ...

Also, can you also put out a library that goes with the kernel patch?
I can imagine what it would look like, but ...

Cheers,

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] Re: 1st glance at kiobuf overhead in kernelaiovs pread vs user aio

2001-02-02 Thread Rajagopal Ananthanarayanan

Ingo Molnar wrote:
 
 On Fri, 2 Feb 2001, Rajagopal Ananthanarayanan wrote:
 
  Do you really have worker threads? In my reading of the patch it seems
  that the wtd is serviced by keventd. [...]
 
 i think worker threads (or any 'helper' threads) should be avoided. It can
 be done without any extra process context, and it should be done that way.
 Why all the trouble with async IO requests if requests are going to end up
 in a worker thread's context anyway? (which will be a serializing point,
 otherwise why does it end up there?)
 

Good point. Can you expand on how you plan to service pending
chunks of work (eg. issuing readpage() on some pages) without
the use of threads?

thanks,


--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

2000-12-08 Thread Rajagopal Ananthanarayanan

Linus Torvalds wrote:
> 
> On Fri, 8 Dec 2000, Daniel Phillips wrote:
> >
> > [ flush-buffers taking the page lock ]
> >
> > This is great when you have buffersize==pagesize.  When there are
> > multiple buffers per page it means that some of the buffers might have
> > to wait for flushing just because bdflush started IO on some other
> > buffer on the same page.  Oh well.  The common case improves in terms
> > being proveably correct and the uncommon case gets worse a tiny bit.
> > It sounds like a win.
> 
> Also, I think that we should strive for a setup where most of the dirty
> buffer flushing is done through "page_launder()" instead of using
> sync_buffers all that much at all.
> 
> I'm convinced that the page LRU list is as least as good as, if not better
> than, the dirty buffer timestamp stuff. And as we need to have the page
> LRU for other reasons anyway, I'd like the long-range plan to be to get
> rid of the buffer LRU completely. It wastes memory and increases
> complexity for very little gain, I think.
> 

I think flushing pages instead of buffers is a good direction to take.
Two things:

1. currently bdflush is setup to use page_launder only
   under memory pressure (if (free_shortage() ... )
   Do you think that it should call page_launder regardless?

2. There are two operations here:
a. starting a write-back, periodically.
b. freeing a page, which may involve taking the page
out of a inode mapping, etc. IOW, what page_launder does.
   bdflush primarily does (a). If we want to move to page-oriented
   flushing, we atleast need extra information in the _page_ structure
   as to whether it is time to flush the page back.


--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre4: deadlock in VM?

2000-10-25 Thread Rajagopal Ananthanarayanan

Tigran Aivazian wrote:
> 
> Hi guys,
> 
> When running SPEC SFS tests against 2.4.0-test10-pre4 on a 4-way SMP
> machine with 6G RAM (highmem+PAE enabled) I got
> 
> __alloc_pages: 0-order allocation failed.
> 
> (probably coming from nfsd, why don't we print eip of the caller there?)
> 
> and the machine locked up (but pingable). So I entered kdb and got stack
> traces of all running proceeses:


Hmm. It appears that some of the processes are stuck on this
part of page_launder:

/*
 * Re-take the spinlock. Note that we cannot
 * unlock the page yet since we're still
 * accessing the page_struct here...
 */
spin_lock(_lru_lock);

It will be interesting to see what's going on in each of the cpus.
Use "cpu x" x=0,1,2,3 on your 4 cpu system to switch to cpu x,
and just type "bt" on each cpu. Also, it will be good to see what
kswapd (pid 2) is upto ...


------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre4: deadlock in VM?

2000-10-25 Thread Rajagopal Ananthanarayanan

Tigran Aivazian wrote:
 
 Hi guys,
 
 When running SPEC SFS tests against 2.4.0-test10-pre4 on a 4-way SMP
 machine with 6G RAM (highmem+PAE enabled) I got
 
 __alloc_pages: 0-order allocation failed.
 
 (probably coming from nfsd, why don't we print eip of the caller there?)
 
 and the machine locked up (but pingable). So I entered kdb and got stack
 traces of all running proceeses:


Hmm. It appears that some of the processes are stuck on this
part of page_launder:

/*
 * Re-take the spinlock. Note that we cannot
 * unlock the page yet since we're still
 * accessing the page_struct here...
 */
spin_lock(pagemap_lru_lock);

It will be interesting to see what's going on in each of the cpus.
Use "cpu x" x=0,1,2,3 on your 4 cpu system to switch to cpu x,
and just type "bt" on each cpu. Also, it will be good to see what
kswapd (pid 2) is upto ...


------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



kmap_high/flush_tlb_all/smp_call_function problem

2000-10-09 Thread Rajagopal Ananthanarayanan
ystem_call+0x33
   kernel .text 0xc010 0xc010a748 0xc010a780
[1]kdb> cpu 2

Entering kdb (current=0xe24dc000, pid 8180) on processor 2 due to cpu switch
[2]kdb> bt
EBP   EIP Function(args)
0xe24ddc74 0xc0112f48 smp_call_function+0x84 (0xc0112db8, 0x0, 0x1, 0x1)
   kernel .text 0xc010 0xc0112ec4 0xc0112f7c
0xe24ddc90 0xc0112e1c flush_tlb_all+0x14
   kernel .text 0xc010 0xc0112e08 0xc0112e68
0xe24ddca4 0xc01334f6 flush_all_zero_pkmaps+0x7a (0x0)
   kernel .text 0xc010 0xc013347c 0xc0133500
0xe24ddcd8 0xc01335c6 kmap_high+0xc6
   kernel .text 0xc010 0xc0133500 0xc0133674
0xe24ddd00 0xc0166248 _pagebuf_handle_iovecs+0x88 (0xe24ddee0, 0xc3118940, 0x0, 
0x39000, 0x0)
   kernel .text 0xc010 0xc01661c0 0xc016648c
0xe24ddd2c 0xc01664b2 _pagebuf_iomove_apply+0x26 (0xe24ddee0, 0xe0294a20, 0x39000, 0x0,
0xc3118940)
   kernel .text 0xc010 0xc016648c 0xc01664bc
0xe24ddd90 0xc01657c4 pagebuf_segment_apply+0x234 (0xc016648c, 0xe24ddee0, 0xe0294a20, 
0x0,
0x8000)
   kernel .text 0xc010 0xc0165590 0xc016581c
0xe24dddbc 0xc01667ad _pb_buffered_read+0xed (0xe0b98c60, 0x32000, 0x0, 0x8000, 
0xe24dde30)
   kernel .text 0xc010 0xc01666c0 0xc01667c4
0xe24dde4c 0xc0166b61 pagebuf_file_read+0x225 (0xe20ce140, 0xe24ddee0, 0xe1785ab4, 0x1,
0xe0b98c60)
   kernel .text 0xc010 0xc016693c 0xc0166c08
0xe24dde74 0xc01c0a12 linvfs_file_read+0x32 (0xe20ce140, 0xe24ddee0)
   kernel .text 0xc010 0xc01c09e0 0xc01c0a30
0xe24ddeac 0xc012a1fc do_generic_file_read+0xe8 (0xe20ce140, 0xe20ce160, 0xe24ddee0, 
0xc0166c08)
   kernel .text 0xc010 0xc012a114 0xc012a618
[2]more> 
0xe24ddf1c 0xc0166cfb pagebuf_generic_file_read+0xc3 (0xe20ce140, 0x41dc9200, 0x38000,
0xe20ce160)
   kernel .text 0xc010 0xc0166c38 0xc0166d4c
0xe24ddf44 0xc01c12a8 xfs_rdwr+0x48 (0xe1785ab4, 0xe20ce140, 0x41dc9200, 0x38000, 
0xe20ce160)
   kernel .text 0xc010 0xc01c1260 0xc01c12d4
0xe24ddf70 0xc01c1325 xfs_read+0x51 (0xe1785ab4, 0xe20ce140, 0x41dc9200, 0x38000, 
0xe20ce160)
   kernel .text 0xc010 0xc01c12d4 0xc01c1330
0xe24ddf98 0xc01be202 linvfs_read+0x62 (0xe20ce140, 0x41dc9200, 0x38000, 0xe20ce160, 
0xe24dc000)
   kernel .text 0xc010 0xc01be1a0 0xc01be20c
0xe24ddfbc 0xc01355a8 sys_read+0x94 (0x3a, 0x41dc9200, 0x38000, 0x38000, 0x41dc9200)
   kernel .text 0xc010 0xc0135514 0xc01355c0
   0xc010a77b system_call+0x33
   kernel .text 0xc010 0xc010a748 0xc010a780
[2]kdb> cpu 3

Entering kdb (current=0xe2668000, pid 8182) on processor 3 due to cpu switch
[3]kdb> bt
EBP   EIP Function(args)
   0xc0278abd stext_lock+0x1955
   kernel .text.lock 0xc0277168 0xc0277168 0xc027dac0
0xe2669e14 0xc013350b kmap_high+0xb
   kernel .text 0xc010 0xc0133500 0xc0133674
0xe2669e3c 0xc0166248 _pagebuf_handle_iovecs+0x88 (0xe2669ee0, 0xc3064d54, 0x0, 
0x8000, 0x0)
   kernel .text 0xc010 0xc01661c0 0xc016648c
0xe2669e6c 0xc0166c2d _pagebuf_read_helper+0x25 (0xe2669ee0, 0xc3064d54, 0x0, 0x1000)
   kernel .text 0xc010 0xc0166c08 0xc0166c38
0xe2669eac 0xc012a334 do_generic_file_read+0x220 (0xe34c74a0, 0xe34c74c0, 0xe2669ee0,
0xc0166c08)
   kernel .text 0xc010 0xc012a114 0xc012a618
0xe2669f1c 0xc0166cfb pagebuf_generic_file_read+0xc3 (0xe34c74a0, 0x41d88200, 0x38000,
0xe34c74c0)
   kernel .text 0xc010 0xc0166c38 0xc0166d4c
0xe2669f44 0xc01c12a8 xfs_rdwr+0x48 (0xe180d470, 0xe34c74a0, 0x41d88200, 0x38000, 
0xe34c74c0)
   kernel .text 0xc010 0xc01c1260 0xc01c12d4
0xe2669f70 0xc01c1325 xfs_read+0x51 (0xe180d470, 0xe34c74a0, 0x41d88200, 0x38000, 
0xe34c74c0)
   kernel .text 0xc010 0xc01c12d4 0xc01c1330
0xe2669f98 0xc01be202 linvfs_read+0x62 (0xe34c74a0, 0x41d88200, 0x38000, 0xe34c74c0, 
0xe2668000)
   kernel .text 0xc010 0xc01be1a0 0xc01be20c
0xe2669fbc 0xc01355a8 sys_read+0x94 (0x31, 0x41d88200, 0x38000, 0x38000, 0x41d88200)
   kernel .text 0xc010 0xc0135514 0xc01355c0
   0xc010a77b system_call+0x33
   kernel .text 0xc010 0xc010a748 0xc010a780


-- 
------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--

kmap_high/flush_tlb_all/smp_call_function problem

2000-10-09 Thread Rajagopal Ananthanarayanan
   kernel .text 0xc010 0xc010a748 0xc010a780
[1]kdb cpu 2

Entering kdb (current=0xe24dc000, pid 8180) on processor 2 due to cpu switch
[2]kdb bt
EBP   EIP Function(args)
0xe24ddc74 0xc0112f48 smp_call_function+0x84 (0xc0112db8, 0x0, 0x1, 0x1)
   kernel .text 0xc010 0xc0112ec4 0xc0112f7c
0xe24ddc90 0xc0112e1c flush_tlb_all+0x14
   kernel .text 0xc010 0xc0112e08 0xc0112e68
0xe24ddca4 0xc01334f6 flush_all_zero_pkmaps+0x7a (0x0)
   kernel .text 0xc010 0xc013347c 0xc0133500
0xe24ddcd8 0xc01335c6 kmap_high+0xc6
   kernel .text 0xc010 0xc0133500 0xc0133674
0xe24ddd00 0xc0166248 _pagebuf_handle_iovecs+0x88 (0xe24ddee0, 0xc3118940, 0x0, 
0x39000, 0x0)
   kernel .text 0xc010 0xc01661c0 0xc016648c
0xe24ddd2c 0xc01664b2 _pagebuf_iomove_apply+0x26 (0xe24ddee0, 0xe0294a20, 0x39000, 0x0,
0xc3118940)
   kernel .text 0xc010 0xc016648c 0xc01664bc
0xe24ddd90 0xc01657c4 pagebuf_segment_apply+0x234 (0xc016648c, 0xe24ddee0, 0xe0294a20, 
0x0,
0x8000)
   kernel .text 0xc010 0xc0165590 0xc016581c
0xe24dddbc 0xc01667ad _pb_buffered_read+0xed (0xe0b98c60, 0x32000, 0x0, 0x8000, 
0xe24dde30)
   kernel .text 0xc010 0xc01666c0 0xc01667c4
0xe24dde4c 0xc0166b61 pagebuf_file_read+0x225 (0xe20ce140, 0xe24ddee0, 0xe1785ab4, 0x1,
0xe0b98c60)
   kernel .text 0xc010 0xc016693c 0xc0166c08
0xe24dde74 0xc01c0a12 linvfs_file_read+0x32 (0xe20ce140, 0xe24ddee0)
   kernel .text 0xc010 0xc01c09e0 0xc01c0a30
0xe24ddeac 0xc012a1fc do_generic_file_read+0xe8 (0xe20ce140, 0xe20ce160, 0xe24ddee0, 
0xc0166c08)
   kernel .text 0xc010 0xc012a114 0xc012a618
[2]more 
0xe24ddf1c 0xc0166cfb pagebuf_generic_file_read+0xc3 (0xe20ce140, 0x41dc9200, 0x38000,
0xe20ce160)
   kernel .text 0xc010 0xc0166c38 0xc0166d4c
0xe24ddf44 0xc01c12a8 xfs_rdwr+0x48 (0xe1785ab4, 0xe20ce140, 0x41dc9200, 0x38000, 
0xe20ce160)
   kernel .text 0xc010 0xc01c1260 0xc01c12d4
0xe24ddf70 0xc01c1325 xfs_read+0x51 (0xe1785ab4, 0xe20ce140, 0x41dc9200, 0x38000, 
0xe20ce160)
   kernel .text 0xc010 0xc01c12d4 0xc01c1330
0xe24ddf98 0xc01be202 linvfs_read+0x62 (0xe20ce140, 0x41dc9200, 0x38000, 0xe20ce160, 
0xe24dc000)
   kernel .text 0xc010 0xc01be1a0 0xc01be20c
0xe24ddfbc 0xc01355a8 sys_read+0x94 (0x3a, 0x41dc9200, 0x38000, 0x38000, 0x41dc9200)
   kernel .text 0xc010 0xc0135514 0xc01355c0
   0xc010a77b system_call+0x33
   kernel .text 0xc010 0xc010a748 0xc010a780
[2]kdb cpu 3

Entering kdb (current=0xe2668000, pid 8182) on processor 3 due to cpu switch
[3]kdb bt
EBP   EIP Function(args)
   0xc0278abd stext_lock+0x1955
   kernel .text.lock 0xc0277168 0xc0277168 0xc027dac0
0xe2669e14 0xc013350b kmap_high+0xb
   kernel .text 0xc010 0xc0133500 0xc0133674
0xe2669e3c 0xc0166248 _pagebuf_handle_iovecs+0x88 (0xe2669ee0, 0xc3064d54, 0x0, 
0x8000, 0x0)
   kernel .text 0xc010 0xc01661c0 0xc016648c
0xe2669e6c 0xc0166c2d _pagebuf_read_helper+0x25 (0xe2669ee0, 0xc3064d54, 0x0, 0x1000)
   kernel .text 0xc010 0xc0166c08 0xc0166c38
0xe2669eac 0xc012a334 do_generic_file_read+0x220 (0xe34c74a0, 0xe34c74c0, 0xe2669ee0,
0xc0166c08)
   kernel .text 0xc010 0xc012a114 0xc012a618
0xe2669f1c 0xc0166cfb pagebuf_generic_file_read+0xc3 (0xe34c74a0, 0x41d88200, 0x38000,
0xe34c74c0)
   kernel .text 0xc010 0xc0166c38 0xc0166d4c
0xe2669f44 0xc01c12a8 xfs_rdwr+0x48 (0xe180d470, 0xe34c74a0, 0x41d88200, 0x38000, 
0xe34c74c0)
   kernel .text 0xc010 0xc01c1260 0xc01c12d4
0xe2669f70 0xc01c1325 xfs_read+0x51 (0xe180d470, 0xe34c74a0, 0x41d88200, 0x38000, 
0xe34c74c0)
   kernel .text 0xc010 0xc01c12d4 0xc01c1330
0xe2669f98 0xc01be202 linvfs_read+0x62 (0xe34c74a0, 0x41d88200, 0x38000, 0xe34c74c0, 
0xe2668000)
   kernel .text 0xc010 0xc01be1a0 0xc01be20c
0xe2669fbc 0xc01355a8 sys_read+0x94 (0x31, 0x41d88200, 0x38000, 0x38000, 0x41d88200)
   kernel .text 0xc010 0xc0135514 0xc01355c0
   0xc010a77b system_call+0x33
   kernel .text 0xc010 0xc010a748 0xc010a780


-- 
--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from