raw device and linux scheduling performance weirdness

2001-03-12 Thread Ying Chen

Hi,

I ran some really trivial raw disk performance tests on 2.4.0 using
the raw disk support in it. I seem to be getting some really strange
performance results. My program opens up a raw device, then does
a sequence of sequential/random reads/writes on the raw device using
pread/pwrite. I put timing around both the sequence and the individual
requests. I noticed that in some of the runs the elapsed time for
the whole sequence of I/O requests is significantly longer than
the sum of the individual I/O request response times (like 100 times
longer say), yet my program does nothing in between the requests but
a gettimeofday call to record the request starting time. The system
has nothing else running when the tests were run, so the process should not 
be contenting with other things.

This seems to me that somehow the raw device I/O process is either
stuck or the linux scheduler is skewing things up somewhere. I tried to nice 
the process to higher priority values, it didn't seem to help.
Any ideas?

Thanks,

Ying

_
Get your FREE download of MSN Explorer at http://explorer.msn.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



raw device and linux scheduling performance weirdness

2001-03-12 Thread Ying Chen

Hi,

I ran some really trivial raw disk performance tests on 2.4.0 using
the raw disk support in it. I seem to be getting some really strange
performance results. My program opens up a raw device, then does
a sequence of sequential/random reads/writes on the raw device using
pread/pwrite. I put timing around both the sequence and the individual
requests. I noticed that in some of the runs the elapsed time for
the whole sequence of I/O requests is significantly longer than
the sum of the individual I/O request response times (like 100 times
longer say), yet my program does nothing in between the requests but
a gettimeofday call to record the request starting time. The system
has nothing else running when the tests were run, so the process should not 
be contenting with other things.

This seems to me that somehow the raw device I/O process is either
stuck or the linux scheduler is skewing things up somewhere. I tried to nice 
the process to higher priority values, it didn't seem to help.
Any ideas?

Thanks,

Ying

_
Get your FREE download of MSN Explorer at http://explorer.msn.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



pthreads related issues

2001-03-07 Thread Ying Chen

Hi,

I think I forgot to include the subject on the email I sent last time.
Not sure how many people saw it. I'm trying to send this message again...

I have two questions on Linux pthread related issues. Would anyone be able 
to help?

1. Does any one have some suggestions (pointers) on good kernel Linux thread 
libraries?
2. We ran multi-threaded application using Linux pthread library on 2-way 
SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see 
significant increase in context switching when moving from UP to SMP, and 
high CPU usage with no performance gain in turns of actual work being done 
when moving to SMP, despite the fact the benchmark we are running is 
CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks 
went to scheduling and signaling overheads. Has anyone seen something like 
this before with pthread applications running on SMP platforms? Any 
suggestions or pointers on this subject?


Ying

_
Get your FREE download of MSN Explorer at http://explorer.msn.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



pthreads related issues

2001-03-07 Thread Ying Chen

Hi,

I think I forgot to include the subject on the email I sent last time.
Not sure how many people saw it. I'm trying to send this message again...

I have two questions on Linux pthread related issues. Would anyone be able 
to help?

1. Does any one have some suggestions (pointers) on good kernel Linux thread 
libraries?
2. We ran multi-threaded application using Linux pthread library on 2-way 
SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see 
significant increase in context switching when moving from UP to SMP, and 
high CPU usage with no performance gain in turns of actual work being done 
when moving to SMP, despite the fact the benchmark we are running is 
CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks 
went to scheduling and signaling overheads. Has anyone seen something like 
this before with pthread applications running on SMP platforms? Any 
suggestions or pointers on this subject?


Ying

_
Get your FREE download of MSN Explorer at http://explorer.msn.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



No Subject

2001-03-06 Thread Ying Chen

Hi,

I have two questions on Linux pthread related issues. Would anyone be able 
to help?

1. Does any one have some suggestions (pointers) on good kernel Linux thread 
libraries?
2. We ran multi-threaded application using Linux pthread library on 2-way 
SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see 
significant increase in context switching when moving from UP to SMP, and 
high CPU usage with no performance gain in turns of actual work being done 
when moving to SMP, despite the fact the benchmark we are running is 
CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks 
went to scheduling and signaling overheads. Has anyone seen something like 
this before with pthread applications running on SMP platforms? Any 
suggestions or pointers on this subject?

Thanks a lot!

Ying



_
Get your FREE download of MSN Explorer at http://explorer.msn.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



No Subject

2001-03-06 Thread Ying Chen

Hi,

I have two questions on Linux pthread related issues. Would anyone be able 
to help?

1. Does any one have some suggestions (pointers) on good kernel Linux thread 
libraries?
2. We ran multi-threaded application using Linux pthread library on 2-way 
SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see 
significant increase in context switching when moving from UP to SMP, and 
high CPU usage with no performance gain in turns of actual work being done 
when moving to SMP, despite the fact the benchmark we are running is 
CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks 
went to scheduling and signaling overheads. Has anyone seen something like 
this before with pthread applications running on SMP platforms? Any 
suggestions or pointers on this subject?

Thanks a lot!

Ying



_
Get your FREE download of MSN Explorer at http://explorer.msn.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: test11-pre6

2000-11-16 Thread Ying Chen/Almaden/IBM


Linus,

You forgot about wakeup_bdflush(1) stuff.

Here is the patch again (against test10).
===
There are several places where schedule() is called after wakeup_bdflush(1)
is called. This is completely unnecessary, since wakeup_bdflush(1) already
gave up the control, and when the control is returned to the calling thread
who called wakeup_bdflush(1), it should just go on. Calling schedule()
after wakeup_bdflush(1) will make the calling thread give up control again.
This is a problem for some of those latency sensitive benchmarks (like SPEC
SFS) and applications.


diff -ruN mm.orig/highmem.c mm.opt/highmem.c
--- mm.orig/highmem.c   Wed Oct 18 14:25:46 2000
+++ mm.opt/highmem.cFri Nov 10 17:51:39 2000
@@ -310,8 +310,6 @@
bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER);
if (!bh) {
wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
-   current->policy |= SCHED_YIELD;
-   schedule();
goto repeat_bh;
}
/*
@@ -324,8 +322,6 @@
page = alloc_page(GFP_BUFFER);
if (!page) {
wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
-   current->policy |= SCHED_YIELD;
-   schedule();
goto repeat_page;
}
set_bh_page(bh, page, 0);
diff -ruN fs.orig/buffer.c fs.opt/buffer.c
--- fs.orig/buffer.cThu Oct 12 14:19:32 2000
+++ fs.opt/buffer.c Fri Nov 10 20:05:44 2000
@@ -707,11 +707,8 @@
  */
 static void refill_freelist(int size)
 {
-   if (!grow_buffers(size)) {
+   if (!grow_buffers(size))
wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
-   current->policy |= SCHED_YIELD;
-   schedule();
-   }
 }

 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void
*private)
======


Ying Chen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test11-pre6

2000-11-16 Thread Ying Chen/Almaden/IBM


Linus,

You forgot about wakeup_bdflush(1) stuff.

Here is the patch again (against test10).
===
There are several places where schedule() is called after wakeup_bdflush(1)
is called. This is completely unnecessary, since wakeup_bdflush(1) already
gave up the control, and when the control is returned to the calling thread
who called wakeup_bdflush(1), it should just go on. Calling schedule()
after wakeup_bdflush(1) will make the calling thread give up control again.
This is a problem for some of those latency sensitive benchmarks (like SPEC
SFS) and applications.


diff -ruN mm.orig/highmem.c mm.opt/highmem.c
--- mm.orig/highmem.c   Wed Oct 18 14:25:46 2000
+++ mm.opt/highmem.cFri Nov 10 17:51:39 2000
@@ -310,8 +310,6 @@
bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER);
if (!bh) {
wakeup_bdflush(1);  /* Sets task-state to TASK_RUNNING */
-   current-policy |= SCHED_YIELD;
-   schedule();
goto repeat_bh;
}
/*
@@ -324,8 +322,6 @@
page = alloc_page(GFP_BUFFER);
if (!page) {
wakeup_bdflush(1);  /* Sets task-state to TASK_RUNNING */
-   current-policy |= SCHED_YIELD;
-   schedule();
goto repeat_page;
}
set_bh_page(bh, page, 0);
diff -ruN fs.orig/buffer.c fs.opt/buffer.c
--- fs.orig/buffer.cThu Oct 12 14:19:32 2000
+++ fs.opt/buffer.c Fri Nov 10 20:05:44 2000
@@ -707,11 +707,8 @@
  */
 static void refill_freelist(int size)
 {
-   if (!grow_buffers(size)) {
+   if (!grow_buffers(size))
wakeup_bdflush(1);  /* Sets task-state to TASK_RUNNING */
-   current-policy |= SCHED_YIELD;
-   schedule();
-   }
 }

 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void
*private)
==


Ying Chen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] nfsd optimizations for test10 (yet another try)

2000-11-13 Thread Ying Chen/Almaden/IBM


Neil,

Here is a set of fixes and answers to you questions/points. The new patch
was tested in my own environment again and worked fine.


1/ Why did you change nfsd_busy into an atomic_t?  It is only ever
   used or updated inside the Big-Kernel-Lock, so it doesn't need
   to be atomic.

I think I described why this was there in the previous email.

2/ Your new nfsd_racache_init always allocates a new cache, were as
   the current one checks first to see if it has already been
   allocated.
   This is important because it is quite legal to run "rpc.nfsd"
   multiple times.  Subsequent invocations serve to change the number
   of nfsd threads running.

Fixed.


3/ You currently allocate a single slab of memory for all of the
   "struct raparms".  Admittedly this is what the old code did, but I
   don't think that it is really necessary, and calling kmalloc
   multiple times would work just a well and would (arguably) be
   clearer.

Changed to use kmalloc.


4/ small point:  you added a hash table as the comment suggests might
   be needed, but you didn't change the comment accordingly:-)

Fixed (but didn't add much comment since it seems to be so
straight-forward).

5/ the calls to spin_lock/spin_unlock in nfsd_racache_init seem
   pointless. At this point, nothing else could possibly be accessing
   the racache, and if it was you would have even bigger problems.
   ditto for nfsd_racache_shutdown

Fixed.

6/ The lru list is now a list.h list, but the hash lists aren't.  Why
   is that?

Fixed.

7/ The old code kept a 'use' count for each cache entry to make sure
   that an entry was not reused while it was in use.  You have dropped
   this.  Now because of the lru ordering, and because each thread can
   use at most one entry, you wont have a problem if there are more
   cache entries than threads, and you currently have 2048 entries
   configured which is greater than NFSD_MAXSERVS.  However I think it
   would be best if this dependancy were made explicit.
   Maybe the call to nfsd_racache_init should tell the racache how
   many threads are being started, and nfsd_racache_init should record
   how many cache entries have been alloced, and it could alloc some
   more if needed.

I'd disagree on creating dependancy between # of NFSD threads and cache
entries.
The # of cache entries is more a function of open/read files than anything
else.
Of course, you can argue that more NFSD threads could mean a larger # of
files, but
using a sensible number (like 2048) would suffice for a huge number of NFSD
threads.
In practice, more than several hundreds of NFSD threads will probably never
happen, even on
large SMPs. Also, since 2048 entries really do not take much memory (couple
of hundred KBs),
it seems to be ok to simply go with it.

8/ I would like the stats collected to tell me a bit more about what
   was going on.  If find simple hit/miss numbers nearly useless, as
   you expect many lookups to be misses anyway (first time a file is
   accessed) but you don't know what percentage.
   As a first approximation, I would like to only count a miss if the
   seek address was > 0.
   What would be really nice would be to get stats on how long entries
   stayed in the cache between last use and re-use.  If we stored a
   'last-use' time in each entry, and on reuse, kept count of which
   range the age was is:

 0-62 msec
 63-125 msec
 125-250 msec
 250-500 msec
 500-1000 msec
  1-2 sec
  2-4 sec
  4-8 sec
  8-16sec
  16-32   sec

   This obviously isn't critical, but it would be nice to be able
   to see how the cache was working.

Sure. I haven't put in such things in this patch. I'd be happy to roll such
things in later on, since it's non-critical
at the moment.

9/ Actually, you don't need the spinlock at all, and nfsd is currently
   all under the BigKernelLock, but it doesn't hurt to have it around
   the nfsd_get_raparms function because we hopefully will get rid of
   the BKL one day.

Again, I explained why I had it in the previous email.

Regards,

Ying

Here is the patch. The only files changed since the last patch were
racache.h and nfsracache.c.

diff -ruN nfsd.orig/nfsd.h nfsd.opt/nfsd.h
--- nfsd.orig/nfsd.h Fri Nov 10 15:27:37 2000
+++ nfsd.opt/nfsd.h Fri Nov 10 16:03:43 2000
@@ -76,7 +76,7 @@

 /* nfsd/vfs.c */
 int  fh_lock_parent(struct svc_fh *, struct dentry *);
-int  nfsd_racache_init(int);
+int  nfsd_racache_init(void);
 void  nfsd_racache_shutdown(void);
 int  nfsd_lookup(struct svc_rqst *, struct svc_fh *,
const char *, int, struct svc_fh *);
diff -ruN nfsd.orig/racache.h nfsd.opt/racache.h
--- nfsd.orig/racache.h  Fri Nov 10 16:10:23 2000
+++ nfsd.opt/racache.h   Fri Nov 10 15:50:49 2000
@@ -0,0 +1,41 @@
+/*
+ * include/linux/nfsd/racache.h
+ *
+ * Read 

[patch] nfsd optimizations for test10 (yet another try)

2000-11-13 Thread Ying Chen/Almaden/IBM


Neil,

Here is a set of fixes and answers to you questions/points. The new patch
was tested in my own environment again and worked fine.


1/ Why did you change nfsd_busy into an atomic_t?  It is only ever
   used or updated inside the Big-Kernel-Lock, so it doesn't need
   to be atomic.

I think I described why this was there in the previous email.

2/ Your new nfsd_racache_init always allocates a new cache, were as
   the current one checks first to see if it has already been
   allocated.
   This is important because it is quite legal to run "rpc.nfsd"
   multiple times.  Subsequent invocations serve to change the number
   of nfsd threads running.

Fixed.


3/ You currently allocate a single slab of memory for all of the
   "struct raparms".  Admittedly this is what the old code did, but I
   don't think that it is really necessary, and calling kmalloc
   multiple times would work just a well and would (arguably) be
   clearer.

Changed to use kmalloc.


4/ small point:  you added a hash table as the comment suggests might
   be needed, but you didn't change the comment accordingly:-)

Fixed (but didn't add much comment since it seems to be so
straight-forward).

5/ the calls to spin_lock/spin_unlock in nfsd_racache_init seem
   pointless. At this point, nothing else could possibly be accessing
   the racache, and if it was you would have even bigger problems.
   ditto for nfsd_racache_shutdown

Fixed.

6/ The lru list is now a list.h list, but the hash lists aren't.  Why
   is that?

Fixed.

7/ The old code kept a 'use' count for each cache entry to make sure
   that an entry was not reused while it was in use.  You have dropped
   this.  Now because of the lru ordering, and because each thread can
   use at most one entry, you wont have a problem if there are more
   cache entries than threads, and you currently have 2048 entries
   configured which is greater than NFSD_MAXSERVS.  However I think it
   would be best if this dependancy were made explicit.
   Maybe the call to nfsd_racache_init should tell the racache how
   many threads are being started, and nfsd_racache_init should record
   how many cache entries have been alloced, and it could alloc some
   more if needed.

I'd disagree on creating dependancy between # of NFSD threads and cache
entries.
The # of cache entries is more a function of open/read files than anything
else.
Of course, you can argue that more NFSD threads could mean a larger # of
files, but
using a sensible number (like 2048) would suffice for a huge number of NFSD
threads.
In practice, more than several hundreds of NFSD threads will probably never
happen, even on
large SMPs. Also, since 2048 entries really do not take much memory (couple
of hundred KBs),
it seems to be ok to simply go with it.

8/ I would like the stats collected to tell me a bit more about what
   was going on.  If find simple hit/miss numbers nearly useless, as
   you expect many lookups to be misses anyway (first time a file is
   accessed) but you don't know what percentage.
   As a first approximation, I would like to only count a miss if the
   seek address was  0.
   What would be really nice would be to get stats on how long entries
   stayed in the cache between last use and re-use.  If we stored a
   'last-use' time in each entry, and on reuse, kept count of which
   range the age was is:

 0-62 msec
 63-125 msec
 125-250 msec
 250-500 msec
 500-1000 msec
  1-2 sec
  2-4 sec
  4-8 sec
  8-16sec
  16-32   sec

   This obviously isn't critical, but it would be nice to be able
   to see how the cache was working.

Sure. I haven't put in such things in this patch. I'd be happy to roll such
things in later on, since it's non-critical
at the moment.

9/ Actually, you don't need the spinlock at all, and nfsd is currently
   all under the BigKernelLock, but it doesn't hurt to have it around
   the nfsd_get_raparms function because we hopefully will get rid of
   the BKL one day.

Again, I explained why I had it in the previous email.

Regards,

Ying

Here is the patch. The only files changed since the last patch were
racache.h and nfsracache.c.

diff -ruN nfsd.orig/nfsd.h nfsd.opt/nfsd.h
--- nfsd.orig/nfsd.h Fri Nov 10 15:27:37 2000
+++ nfsd.opt/nfsd.h Fri Nov 10 16:03:43 2000
@@ -76,7 +76,7 @@

 /* nfsd/vfs.c */
 int  fh_lock_parent(struct svc_fh *, struct dentry *);
-int  nfsd_racache_init(int);
+int  nfsd_racache_init(void);
 void  nfsd_racache_shutdown(void);
 int  nfsd_lookup(struct svc_rqst *, struct svc_fh *,
const char *, int, struct svc_fh *);
diff -ruN nfsd.orig/racache.h nfsd.opt/racache.h
--- nfsd.orig/racache.h  Fri Nov 10 16:10:23 2000
+++ nfsd.opt/racache.h   Fri Nov 10 15:50:49 2000
@@ -0,0 +1,41 @@
+/*
+ * include/linux/nfsd/racache.h
+ *
+ * Read 

[patch] nfsd optimizations for test10 (recoded to use list_head)

2000-11-12 Thread Ying Chen/Almaden/IBM
_head;
+static struct list_head free_head;
+
+int
+nfsd_racache_init(void)
+{
+struct raparms *rp;
+struct list_head*rahead;
+size_t  i;
+unsigned long   order;
+
+i = CACHESIZE * sizeof (struct raparms);
+for (order = 0; (PAGE_SIZE << order) < i; order++)
+ ;
+raparm_cache = (struct raparms *)
+ __get_free_pages(GFP_KERNEL, order);
+if (!raparm_cache) {
+ printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for racache\n", i);
+ return -1;
+}
+memset(raparm_cache, 0, i);
+
+i = HASHSIZE * sizeof (struct list_head);
+hash_list = kmalloc (i, GFP_KERNEL);
+if (!hash_list) {
+ free_pages ((unsigned long)raparm_cache, order);
+ raparm_cache = NULL;
+ printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for hash list in 
+racache\n", i);
+ return -1;
+}
+
+spin_lock(_lock);
+for (i = 0, rahead = hash_list; i < HASHSIZE; i++, rahead++)
+ INIT_LIST_HEAD(rahead);
+
+INIT_LIST_HEAD(_head);
+for (i = 0, rp = raparm_cache; i < CACHESIZE; i++, rp++) {
+ rp->p_hash_next = rp->p_hash_prev = rp;
+ list_add(>p_lru, _head);
+}
+INIT_LIST_HEAD(_head);
+spin_unlock(_lock);
+
+nfsdstats.ra_size = CACHESIZE;
+return 0;
+}
+
+void
+nfsd_racache_shutdown(void)
+{
+size_t  i;
+unsigned long   order;
+
+i = CACHESIZE * sizeof (struct raparms);
+for (order = 0; (PAGE_SIZE << order) < i; order++)
+ ;
+spin_lock(_lock);
+free_pages ((unsigned long)raparm_cache, order);
+raparm_cache = NULL;
+kfree (hash_list);
+hash_list = NULL;
+spin_unlock(_lock);
+}
+
+/* Insert a new entry into the hash table. */
+static inline struct raparms *
+nfsd_racache_insert(ino_t ino, dev_t dev)
+{
+struct raparms *ra = NULL;
+struct list_head *rap;
+
+if (list_empty(_head)) {
+ /* Replace with LRU. */
+ struct raparms *prev, *next;
+ ra = list_entry(lru_head.prev, struct raparms, p_lru);
+ prev = ra->p_hash_prev,
+ next = ra->p_hash_next;
+ prev->p_hash_next = next;
+ next->p_hash_prev = prev;
+ ra->p_hash_next = NULL;
+ ra->p_hash_prev = NULL;
+ list_del(lru_head.prev);
+} else {
+ ra = list_entry(free_head.next, struct raparms, p_lru);
+ list_del(free_head.next);
+}
+
+memset(ra, 0, sizeof(*ra));
+ra->p_dev = dev;
+ra->p_ino = ino;
+rap = (struct list_head *) _list[REQHASH(ino, dev)];
+ra->p_hash_next = (struct raparms *)(rap->next);
+ra->p_hash_prev = (struct raparms *) rap;
+((struct raparms *)(rap->next))->p_hash_prev = ra;
+rap->next = (struct list_head *)ra;
+
+list_add(>p_lru, _head);
+return ra;
+}
+
+/*
+ * Try to find an entry matching the current call in the cache. When none
+ * is found, we grab the oldest unlocked entry off the LRU list.
+ * Note that no operation within the loop may sleep.
+ */
+struct raparms *
+nfsd_get_raparms(dev_t dev, ino_t ino)
+{
+struct raparms *rahead;
+struct raparms *ra = NULL;
+
+spin_lock(_lock);
+
+ra = rahead = (struct raparms *) _list[REQHASH(ino, dev)];
+while ((ra = ra->p_hash_next) != rahead) {
+  if ((ra->p_ino == ino) && (ra->p_dev == dev)) {
+  /* Do LRU reordering */
+  list_del(>p_lru);
+      list_add(>p_lru, _head);
+  nfsdstats.ra_hits++;
+  goto found;
+ }
+}
+
+/* Did not find one. Get a new item and insert it into the hash table. */
+ra = nfsd_racache_insert(ino, dev);
+nfsdstats.ra_misses++;
+found:
+spin_unlock(_lock);
+return ra;
+}

Ying Chen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] nfsd optimizations for test10 (recoded to use list_head)

2000-11-12 Thread Ying Chen/Almaden/IBM
ic struct raparms *  raparm_cache = NULL;
+static struct list_head *hash_list;
+static struct list_head lru_head;
+static struct list_head free_head;
+
+int
+nfsd_racache_init(void)
+{
+struct raparms *rp;
+struct list_head*rahead;
+size_t  i;
+unsigned long   order;
+
+i = CACHESIZE * sizeof (struct raparms);
+for (order = 0; (PAGE_SIZE  order)  i; order++)
+ ;
+raparm_cache = (struct raparms *)
+ __get_free_pages(GFP_KERNEL, order);
+if (!raparm_cache) {
+ printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for racache\n", i);
+ return -1;
+}
+memset(raparm_cache, 0, i);
+
+i = HASHSIZE * sizeof (struct list_head);
+hash_list = kmalloc (i, GFP_KERNEL);
+if (!hash_list) {
+ free_pages ((unsigned long)raparm_cache, order);
+ raparm_cache = NULL;
+ printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for hash list in 
+racache\n", i);
+ return -1;
+}
+
+spin_lock(racache_lock);
+for (i = 0, rahead = hash_list; i  HASHSIZE; i++, rahead++)
+ INIT_LIST_HEAD(rahead);
+
+INIT_LIST_HEAD(free_head);
+for (i = 0, rp = raparm_cache; i  CACHESIZE; i++, rp++) {
+ rp-p_hash_next = rp-p_hash_prev = rp;
+ list_add(rp-p_lru, free_head);
+}
+INIT_LIST_HEAD(lru_head);
+spin_unlock(racache_lock);
+
+nfsdstats.ra_size = CACHESIZE;
+return 0;
+}
+
+void
+nfsd_racache_shutdown(void)
+{
+size_t  i;
+unsigned long   order;
+
+i = CACHESIZE * sizeof (struct raparms);
+for (order = 0; (PAGE_SIZE  order)  i; order++)
+ ;
+spin_lock(racache_lock);
+free_pages ((unsigned long)raparm_cache, order);
+raparm_cache = NULL;
+kfree (hash_list);
+hash_list = NULL;
+spin_unlock(racache_lock);
+}
+
+/* Insert a new entry into the hash table. */
+static inline struct raparms *
+nfsd_racache_insert(ino_t ino, dev_t dev)
+{
+struct raparms *ra = NULL;
+struct list_head *rap;
+
+if (list_empty(free_head)) {
+ /* Replace with LRU. */
+ struct raparms *prev, *next;
+ ra = list_entry(lru_head.prev, struct raparms, p_lru);
+ prev = ra-p_hash_prev,
+ next = ra-p_hash_next;
+ prev-p_hash_next = next;
+ next-p_hash_prev = prev;
+ ra-p_hash_next = NULL;
+ ra-p_hash_prev = NULL;
+ list_del(lru_head.prev);
+} else {
+ ra = list_entry(free_head.next, struct raparms, p_lru);
+ list_del(free_head.next);
+}
+
+memset(ra, 0, sizeof(*ra));
+ra-p_dev = dev;
+ra-p_ino = ino;
+rap = (struct list_head *) hash_list[REQHASH(ino, dev)];
+ra-p_hash_next = (struct raparms *)(rap-next);
+ra-p_hash_prev = (struct raparms *) rap;
+((struct raparms *)(rap-next))-p_hash_prev = ra;
+rap-next = (struct list_head *)ra;
+
+list_add(ra-p_lru, lru_head);
+return ra;
+}
+
+/*
+ * Try to find an entry matching the current call in the cache. When none
+ * is found, we grab the oldest unlocked entry off the LRU list.
+ * Note that no operation within the loop may sleep.
+ */
+struct raparms *
+nfsd_get_raparms(dev_t dev, ino_t ino)
+{
+struct raparms *rahead;
+struct raparms *ra = NULL;
+
+spin_lock(racache_lock);
+
+ra = rahead = (struct raparms *) hash_list[REQHASH(ino, dev)];
+while ((ra = ra-p_hash_next) != rahead) {
+  if ((ra-p_ino == ino)  (ra-p_dev == dev)) {
+  /* Do LRU reordering */
+  list_del(ra-p_lru);
+  list_add(ra-p_lru, lru_head);
+  nfsdstats.ra_hits++;
+  goto found;
+ }
+}
+
+/* Did not find one. Get a new item and insert it into the hash table. */
+ra = nfsd_racache_insert(ino, dev);
+    nfsdstats.ra_misses++;
+found:
+spin_unlock(racache_lock);
+return ra;
+}

Ying Chen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



problems with sync_all_inode() in prune_icache() and kupdate()

2000-11-11 Thread Ying Chen/Almaden/IBM

Hi,

I'm wondering if someone can tell me why sync_all_inodes() is called in
prune_icache().
sync_all_inodes() can cause problems in some situations when memory is
short and shrink_icache_memory() is called.
For instance, when the system is really short of memory,
do_try_to_free_pages() is invoked (either by application or kswapd) and
shrink_icache_memory() is also invoked, but when prune_icache() is called,
the first thing is does is to sync_all_inodes(). If the inode block is not
in memory, it may have to bread the inode block in, so the kswapd() can
block until the inode block is brought into memory. Not only that, since
the system is short of memory, there may not even be memory available for
the inode block. Even if there is, given that there is only a single kswapd
thread who is doing sync_all_inodes(), if the dirty inode list if
relatively long (like a tens of thousands as in something like SPEC SFS),
it'll take practically forever for sync_all_inodes() to finish. To user,
this looks like the system is hang (although it isn't really). It's just
taking a looong time to do shrink_icache_memory!

One solution to this is not to call sync_all_inodes() at all in
prune_icache(), since other parts of the kernel, like kupdate() will also
try to sync_inodes periodically anyway, but I don't know if this has other
implications or not. I don't see a problem with this myself. In fact, I
have been using this fix in my own test9 kernel, and I get much smoother
kernel behavior when running high load SPEC SFS than using the default
prune_icache(). Actually if sync_all_inodes() is called, SPEC SFS sometimes
simply fails due to the long response time on the I/O requests.

The similar theory goes with kupdate() daemon. That is, since there is only
a single thread that does the inode and buffer flushing, under high load,
kupdate() would not get a chance to call flush_dirty_buffers() until after
sync_inodes() is completed. But sync_inodes() can take forever since inodes
are flushed serially to disk. Imagine how long it might take if each inode
flushing causes one read from disk! In my experience with SPEC SFS,
sometimes, if kupdate() is invoked during the SPEC SFS run, it simply
cannot finish sync_inode() until the entire benchmark run is finished! So,
all the dirty buffers that flush_dirty_buffer(1) is supposed to flush would
never be called during the benchmark run and system is constantly running
in the bdflush() mode, which is really supposed to be called only in a
panic mode!

Again, the solution can be simple, one can create multiple
dirty_buffer_flushing daemon threads that calls flush_dirty_buffer()
without sync_super or sync_inode stuff. I have done so in my own test9
kernel, and the results with SPEC SFS is much more pleasant.

Ying

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] wakeup_bdflush related fixes and nfsd optimizations for test10

2000-11-11 Thread Ying Chen/Almaden/IBM

Hi,

This patch includes two sets of things against test10:
First, there are several places where schedule() is called after
wakeup_bdflush(1) is called. This is completely unnecessary, since
wakeup_bdflush(1) already gave up the control, and when the control is
returned to the calling thread who called wakeup_bdflush(1), it should just
go on. Calling schedule() after wakeup_bdflush(1) will make the calling
thread give up control again. This is a problem for some of those latency
sensitive benchmarks (like SPEC SFS) and applications.

Second, (I have posted this to the kernel mailing list, but I forgot to cc
to Linus.) I made some optimizations on racache in nfsd in test10. The idea
is to replace with existing fixed length table for readahead cache in NFSD
with a hash table. The old racache is essentially ineffective in dealing
with large # of files, and yet eats CPU cycles in scanning the table (even
though the table is small),  the hash table-based is much more effective
and fast. I have generated the patch for test10 and tested it.

(See attached file: a)

Ying Chen
[EMAIL PROTECTED]
IBM Almaden Research Center

 a


[patch] wakeup_bdflush related fixes and nfsd optimizations for test10

2000-11-11 Thread Ying Chen/Almaden/IBM

Hi,

This patch includes two sets of things against test10:
First, there are several places where schedule() is called after
wakeup_bdflush(1) is called. This is completely unnecessary, since
wakeup_bdflush(1) already gave up the control, and when the control is
returned to the calling thread who called wakeup_bdflush(1), it should just
go on. Calling schedule() after wakeup_bdflush(1) will make the calling
thread give up control again. This is a problem for some of those latency
sensitive benchmarks (like SPEC SFS) and applications.

Second, (I have posted this to the kernel mailing list, but I forgot to cc
to Linus.) I made some optimizations on racache in nfsd in test10. The idea
is to replace with existing fixed length table for readahead cache in NFSD
with a hash table. The old racache is essentially ineffective in dealing
with large # of files, and yet eats CPU cycles in scanning the table (even
though the table is small),  the hash table-based is much more effective
and fast. I have generated the patch for test10 and tested it.

(See attached file: a)

Ying Chen
[EMAIL PROTECTED]
IBM Almaden Research Center

 a


[patch] nfsd optimizations for test10

2000-11-10 Thread Ying Chen/Almaden/IBM

Hi,

I made some optimizations on racache in nfsd in test10. The idea is to
replace with existing fixed length table for readahead cache in NFSD with a
hash table.
The old racache is essentially ineffective in dealing with large # of
files, and yet eats CPU cycles in scanning the table (even though the table
is small),
the hash table-based is much more effective and fast. I have generated the
patch for test10 and tested it.

(See attached file: nfshdiff)(See attached file: nfsdiff)


Ying
 nfshdiff
 nfsdiff


[patch] nfsd optimizations for test10

2000-11-10 Thread Ying Chen/Almaden/IBM

Hi,

I made some optimizations on racache in nfsd in test10. The idea is to
replace with existing fixed length table for readahead cache in NFSD with a
hash table.
The old racache is essentially ineffective in dealing with large # of
files, and yet eats CPU cycles in scanning the table (even though the table
is small),
the hash table-based is much more effective and fast. I have generated the
patch for test10 and tested it.

(See attached file: nfshdiff)(See attached file: nfsdiff)


Ying
 nfshdiff
 nfsdiff


Re: VM in v2.4.0test9

2000-10-04 Thread Ying Chen/Almaden/IBM


I'd second that this is most likely a VM related problem. Last few days I
sent you an example that I would make system hang simply by
doing a mkfs on 90 GB file system. This happens when low 1GB memory is used
up (but I still have high 1GB available). I think
David probably ran into the same problem as I did.

I'd trace the problem down a bit. It seems that the system goes into a loop
that consists of the following call sequence:
 alloc_pages --> try_to_free_pages --> do_try_to_free_pages (but
doesn't seem to go through page_launder in do_try_to_free_pages) -->
kmem_cache_reap (so skipped the refill_inactive() in the if statement) -->
backto try_again in alloc_pages.
If I kill mkfs, the system would go on nicely for some small applications.
It is still a bit slow.
If I do a "make bzImage", the system would hang again. If I look at the
Sysrq output, I had more than 2000 pages available in inactive_clean list.
I wonder why
alloc_pages doesn't take it.

I had no problem with these kinds of things with test6 or test7.

Ying

Rik van Riel <[EMAIL PROTECTED]>@vger.kernel.org on 10/04/2000 09:31:21
AM

Sent by:  [EMAIL PROTECTED]


To:   David Weinehall <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], Linus Torvalds <[EMAIL PROTECTED]>
Subject:  Re: VM in v2.4.0test9



On Wed, 4 Oct 2000, David Weinehall wrote:

> Running the included program on a clean v2.4.0test9 kernel I can
> hang the computer practically in no time.

> What seems most strange is that the doesn't even get depleated.
> The machine still answers to SysRq and ping, but nothing else.

Looking again at this report in more detail, something
very strange is going on ...

> This is what I got from SysRq+M (manual copy):
>
> Free pages: 500 kB (0 highmem)
> Active: 8 inactive dirty: 1009, inactive clean:0
> free: 125 (31 62 93)

First, you have MORE free memory than freepages.high. In this
case I really don't see why __alloc_pages() wouldn't give the
memory to your processes 

> Free swap: 64772

And there is tons of swap free...

Are you absolutely sure this is VM related?  This almost looks
like the system puts a in a read request but the request queue
doesn't get unplugged, or something strange like that ...

There is more than enough memory to satisfy all VM requests and
the loop in __alloc_pages() is straightforward enough to give
your processes their memory without strange bugs ...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/  http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: VM in v2.4.0test9

2000-10-04 Thread Ying Chen/Almaden/IBM


I'd second that this is most likely a VM related problem. Last few days I
sent you an example that I would make system hang simply by
doing a mkfs on 90 GB file system. This happens when low 1GB memory is used
up (but I still have high 1GB available). I think
David probably ran into the same problem as I did.

I'd trace the problem down a bit. It seems that the system goes into a loop
that consists of the following call sequence:
 alloc_pages -- try_to_free_pages -- do_try_to_free_pages (but
doesn't seem to go through page_launder in do_try_to_free_pages) --
kmem_cache_reap (so skipped the refill_inactive() in the if statement) --
backto try_again in alloc_pages.
If I kill mkfs, the system would go on nicely for some small applications.
It is still a bit slow.
If I do a "make bzImage", the system would hang again. If I look at the
Sysrq output, I had more than 2000 pages available in inactive_clean list.
I wonder why
alloc_pages doesn't take it.

I had no problem with these kinds of things with test6 or test7.

Ying

Rik van Riel [EMAIL PROTECTED]@vger.kernel.org on 10/04/2000 09:31:21
AM

Sent by:  [EMAIL PROTECTED]


To:   David Weinehall [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED], Linus Torvalds [EMAIL PROTECTED]
Subject:  Re: VM in v2.4.0test9



On Wed, 4 Oct 2000, David Weinehall wrote:

 Running the included program on a clean v2.4.0test9 kernel I can
 hang the computer practically in no time.

 What seems most strange is that the doesn't even get depleated.
 The machine still answers to SysRq and ping, but nothing else.

Looking again at this report in more detail, something
very strange is going on ...

 This is what I got from SysRq+M (manual copy):

 Free pages: 500 kB (0 highmem)
 Active: 8 inactive dirty: 1009, inactive clean:0
 free: 125 (31 62 93)

First, you have MORE free memory than freepages.high. In this
case I really don't see why __alloc_pages() wouldn't give the
memory to your processes 

 Free swap: 64772

And there is tons of swap free...

Are you absolutely sure this is VM related?  This almost looks
like the system puts a in a read request but the request queue
doesn't get unplugged, or something strange like that ...

There is more than enough memory to satisfy all VM requests and
the loop in __alloc_pages() is straightforward enough to give
your processes their memory without strange bugs ...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/  http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/