raw device and linux scheduling performance weirdness
Hi, I ran some really trivial raw disk performance tests on 2.4.0 using the raw disk support in it. I seem to be getting some really strange performance results. My program opens up a raw device, then does a sequence of sequential/random reads/writes on the raw device using pread/pwrite. I put timing around both the sequence and the individual requests. I noticed that in some of the runs the elapsed time for the whole sequence of I/O requests is significantly longer than the sum of the individual I/O request response times (like 100 times longer say), yet my program does nothing in between the requests but a gettimeofday call to record the request starting time. The system has nothing else running when the tests were run, so the process should not be contenting with other things. This seems to me that somehow the raw device I/O process is either stuck or the linux scheduler is skewing things up somewhere. I tried to nice the process to higher priority values, it didn't seem to help. Any ideas? Thanks, Ying _ Get your FREE download of MSN Explorer at http://explorer.msn.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
raw device and linux scheduling performance weirdness
Hi, I ran some really trivial raw disk performance tests on 2.4.0 using the raw disk support in it. I seem to be getting some really strange performance results. My program opens up a raw device, then does a sequence of sequential/random reads/writes on the raw device using pread/pwrite. I put timing around both the sequence and the individual requests. I noticed that in some of the runs the elapsed time for the whole sequence of I/O requests is significantly longer than the sum of the individual I/O request response times (like 100 times longer say), yet my program does nothing in between the requests but a gettimeofday call to record the request starting time. The system has nothing else running when the tests were run, so the process should not be contenting with other things. This seems to me that somehow the raw device I/O process is either stuck or the linux scheduler is skewing things up somewhere. I tried to nice the process to higher priority values, it didn't seem to help. Any ideas? Thanks, Ying _ Get your FREE download of MSN Explorer at http://explorer.msn.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pthreads related issues
Hi, I think I forgot to include the subject on the email I sent last time. Not sure how many people saw it. I'm trying to send this message again... I have two questions on Linux pthread related issues. Would anyone be able to help? 1. Does any one have some suggestions (pointers) on good kernel Linux thread libraries? 2. We ran multi-threaded application using Linux pthread library on 2-way SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see significant increase in context switching when moving from UP to SMP, and high CPU usage with no performance gain in turns of actual work being done when moving to SMP, despite the fact the benchmark we are running is CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks went to scheduling and signaling overheads. Has anyone seen something like this before with pthread applications running on SMP platforms? Any suggestions or pointers on this subject? Ying _ Get your FREE download of MSN Explorer at http://explorer.msn.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pthreads related issues
Hi, I think I forgot to include the subject on the email I sent last time. Not sure how many people saw it. I'm trying to send this message again... I have two questions on Linux pthread related issues. Would anyone be able to help? 1. Does any one have some suggestions (pointers) on good kernel Linux thread libraries? 2. We ran multi-threaded application using Linux pthread library on 2-way SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see significant increase in context switching when moving from UP to SMP, and high CPU usage with no performance gain in turns of actual work being done when moving to SMP, despite the fact the benchmark we are running is CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks went to scheduling and signaling overheads. Has anyone seen something like this before with pthread applications running on SMP platforms? Any suggestions or pointers on this subject? Ying _ Get your FREE download of MSN Explorer at http://explorer.msn.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
No Subject
Hi, I have two questions on Linux pthread related issues. Would anyone be able to help? 1. Does any one have some suggestions (pointers) on good kernel Linux thread libraries? 2. We ran multi-threaded application using Linux pthread library on 2-way SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see significant increase in context switching when moving from UP to SMP, and high CPU usage with no performance gain in turns of actual work being done when moving to SMP, despite the fact the benchmark we are running is CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks went to scheduling and signaling overheads. Has anyone seen something like this before with pthread applications running on SMP platforms? Any suggestions or pointers on this subject? Thanks a lot! Ying _ Get your FREE download of MSN Explorer at http://explorer.msn.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
No Subject
Hi, I have two questions on Linux pthread related issues. Would anyone be able to help? 1. Does any one have some suggestions (pointers) on good kernel Linux thread libraries? 2. We ran multi-threaded application using Linux pthread library on 2-way SMP and UP intel platforms (with both 2.2 and 2.4 kernels). We see significant increase in context switching when moving from UP to SMP, and high CPU usage with no performance gain in turns of actual work being done when moving to SMP, despite the fact the benchmark we are running is CPU-bound. The kernel profiler indicates that the a lot of kernel CPU ticks went to scheduling and signaling overheads. Has anyone seen something like this before with pthread applications running on SMP platforms? Any suggestions or pointers on this subject? Thanks a lot! Ying _ Get your FREE download of MSN Explorer at http://explorer.msn.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: test11-pre6
Linus, You forgot about wakeup_bdflush(1) stuff. Here is the patch again (against test10). === There are several places where schedule() is called after wakeup_bdflush(1) is called. This is completely unnecessary, since wakeup_bdflush(1) already gave up the control, and when the control is returned to the calling thread who called wakeup_bdflush(1), it should just go on. Calling schedule() after wakeup_bdflush(1) will make the calling thread give up control again. This is a problem for some of those latency sensitive benchmarks (like SPEC SFS) and applications. diff -ruN mm.orig/highmem.c mm.opt/highmem.c --- mm.orig/highmem.c Wed Oct 18 14:25:46 2000 +++ mm.opt/highmem.cFri Nov 10 17:51:39 2000 @@ -310,8 +310,6 @@ bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER); if (!bh) { wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */ - current->policy |= SCHED_YIELD; - schedule(); goto repeat_bh; } /* @@ -324,8 +322,6 @@ page = alloc_page(GFP_BUFFER); if (!page) { wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */ - current->policy |= SCHED_YIELD; - schedule(); goto repeat_page; } set_bh_page(bh, page, 0); diff -ruN fs.orig/buffer.c fs.opt/buffer.c --- fs.orig/buffer.cThu Oct 12 14:19:32 2000 +++ fs.opt/buffer.c Fri Nov 10 20:05:44 2000 @@ -707,11 +707,8 @@ */ static void refill_freelist(int size) { - if (!grow_buffers(size)) { + if (!grow_buffers(size)) wakeup_bdflush(1); /* Sets task->state to TASK_RUNNING */ - current->policy |= SCHED_YIELD; - schedule(); - } } void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) ====== Ying Chen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: test11-pre6
Linus, You forgot about wakeup_bdflush(1) stuff. Here is the patch again (against test10). === There are several places where schedule() is called after wakeup_bdflush(1) is called. This is completely unnecessary, since wakeup_bdflush(1) already gave up the control, and when the control is returned to the calling thread who called wakeup_bdflush(1), it should just go on. Calling schedule() after wakeup_bdflush(1) will make the calling thread give up control again. This is a problem for some of those latency sensitive benchmarks (like SPEC SFS) and applications. diff -ruN mm.orig/highmem.c mm.opt/highmem.c --- mm.orig/highmem.c Wed Oct 18 14:25:46 2000 +++ mm.opt/highmem.cFri Nov 10 17:51:39 2000 @@ -310,8 +310,6 @@ bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER); if (!bh) { wakeup_bdflush(1); /* Sets task-state to TASK_RUNNING */ - current-policy |= SCHED_YIELD; - schedule(); goto repeat_bh; } /* @@ -324,8 +322,6 @@ page = alloc_page(GFP_BUFFER); if (!page) { wakeup_bdflush(1); /* Sets task-state to TASK_RUNNING */ - current-policy |= SCHED_YIELD; - schedule(); goto repeat_page; } set_bh_page(bh, page, 0); diff -ruN fs.orig/buffer.c fs.opt/buffer.c --- fs.orig/buffer.cThu Oct 12 14:19:32 2000 +++ fs.opt/buffer.c Fri Nov 10 20:05:44 2000 @@ -707,11 +707,8 @@ */ static void refill_freelist(int size) { - if (!grow_buffers(size)) { + if (!grow_buffers(size)) wakeup_bdflush(1); /* Sets task-state to TASK_RUNNING */ - current-policy |= SCHED_YIELD; - schedule(); - } } void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) == Ying Chen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] nfsd optimizations for test10 (yet another try)
Neil, Here is a set of fixes and answers to you questions/points. The new patch was tested in my own environment again and worked fine. 1/ Why did you change nfsd_busy into an atomic_t? It is only ever used or updated inside the Big-Kernel-Lock, so it doesn't need to be atomic. I think I described why this was there in the previous email. 2/ Your new nfsd_racache_init always allocates a new cache, were as the current one checks first to see if it has already been allocated. This is important because it is quite legal to run "rpc.nfsd" multiple times. Subsequent invocations serve to change the number of nfsd threads running. Fixed. 3/ You currently allocate a single slab of memory for all of the "struct raparms". Admittedly this is what the old code did, but I don't think that it is really necessary, and calling kmalloc multiple times would work just a well and would (arguably) be clearer. Changed to use kmalloc. 4/ small point: you added a hash table as the comment suggests might be needed, but you didn't change the comment accordingly:-) Fixed (but didn't add much comment since it seems to be so straight-forward). 5/ the calls to spin_lock/spin_unlock in nfsd_racache_init seem pointless. At this point, nothing else could possibly be accessing the racache, and if it was you would have even bigger problems. ditto for nfsd_racache_shutdown Fixed. 6/ The lru list is now a list.h list, but the hash lists aren't. Why is that? Fixed. 7/ The old code kept a 'use' count for each cache entry to make sure that an entry was not reused while it was in use. You have dropped this. Now because of the lru ordering, and because each thread can use at most one entry, you wont have a problem if there are more cache entries than threads, and you currently have 2048 entries configured which is greater than NFSD_MAXSERVS. However I think it would be best if this dependancy were made explicit. Maybe the call to nfsd_racache_init should tell the racache how many threads are being started, and nfsd_racache_init should record how many cache entries have been alloced, and it could alloc some more if needed. I'd disagree on creating dependancy between # of NFSD threads and cache entries. The # of cache entries is more a function of open/read files than anything else. Of course, you can argue that more NFSD threads could mean a larger # of files, but using a sensible number (like 2048) would suffice for a huge number of NFSD threads. In practice, more than several hundreds of NFSD threads will probably never happen, even on large SMPs. Also, since 2048 entries really do not take much memory (couple of hundred KBs), it seems to be ok to simply go with it. 8/ I would like the stats collected to tell me a bit more about what was going on. If find simple hit/miss numbers nearly useless, as you expect many lookups to be misses anyway (first time a file is accessed) but you don't know what percentage. As a first approximation, I would like to only count a miss if the seek address was > 0. What would be really nice would be to get stats on how long entries stayed in the cache between last use and re-use. If we stored a 'last-use' time in each entry, and on reuse, kept count of which range the age was is: 0-62 msec 63-125 msec 125-250 msec 250-500 msec 500-1000 msec 1-2 sec 2-4 sec 4-8 sec 8-16sec 16-32 sec This obviously isn't critical, but it would be nice to be able to see how the cache was working. Sure. I haven't put in such things in this patch. I'd be happy to roll such things in later on, since it's non-critical at the moment. 9/ Actually, you don't need the spinlock at all, and nfsd is currently all under the BigKernelLock, but it doesn't hurt to have it around the nfsd_get_raparms function because we hopefully will get rid of the BKL one day. Again, I explained why I had it in the previous email. Regards, Ying Here is the patch. The only files changed since the last patch were racache.h and nfsracache.c. diff -ruN nfsd.orig/nfsd.h nfsd.opt/nfsd.h --- nfsd.orig/nfsd.h Fri Nov 10 15:27:37 2000 +++ nfsd.opt/nfsd.h Fri Nov 10 16:03:43 2000 @@ -76,7 +76,7 @@ /* nfsd/vfs.c */ int fh_lock_parent(struct svc_fh *, struct dentry *); -int nfsd_racache_init(int); +int nfsd_racache_init(void); void nfsd_racache_shutdown(void); int nfsd_lookup(struct svc_rqst *, struct svc_fh *, const char *, int, struct svc_fh *); diff -ruN nfsd.orig/racache.h nfsd.opt/racache.h --- nfsd.orig/racache.h Fri Nov 10 16:10:23 2000 +++ nfsd.opt/racache.h Fri Nov 10 15:50:49 2000 @@ -0,0 +1,41 @@ +/* + * include/linux/nfsd/racache.h + * + * Read
[patch] nfsd optimizations for test10 (yet another try)
Neil, Here is a set of fixes and answers to you questions/points. The new patch was tested in my own environment again and worked fine. 1/ Why did you change nfsd_busy into an atomic_t? It is only ever used or updated inside the Big-Kernel-Lock, so it doesn't need to be atomic. I think I described why this was there in the previous email. 2/ Your new nfsd_racache_init always allocates a new cache, were as the current one checks first to see if it has already been allocated. This is important because it is quite legal to run "rpc.nfsd" multiple times. Subsequent invocations serve to change the number of nfsd threads running. Fixed. 3/ You currently allocate a single slab of memory for all of the "struct raparms". Admittedly this is what the old code did, but I don't think that it is really necessary, and calling kmalloc multiple times would work just a well and would (arguably) be clearer. Changed to use kmalloc. 4/ small point: you added a hash table as the comment suggests might be needed, but you didn't change the comment accordingly:-) Fixed (but didn't add much comment since it seems to be so straight-forward). 5/ the calls to spin_lock/spin_unlock in nfsd_racache_init seem pointless. At this point, nothing else could possibly be accessing the racache, and if it was you would have even bigger problems. ditto for nfsd_racache_shutdown Fixed. 6/ The lru list is now a list.h list, but the hash lists aren't. Why is that? Fixed. 7/ The old code kept a 'use' count for each cache entry to make sure that an entry was not reused while it was in use. You have dropped this. Now because of the lru ordering, and because each thread can use at most one entry, you wont have a problem if there are more cache entries than threads, and you currently have 2048 entries configured which is greater than NFSD_MAXSERVS. However I think it would be best if this dependancy were made explicit. Maybe the call to nfsd_racache_init should tell the racache how many threads are being started, and nfsd_racache_init should record how many cache entries have been alloced, and it could alloc some more if needed. I'd disagree on creating dependancy between # of NFSD threads and cache entries. The # of cache entries is more a function of open/read files than anything else. Of course, you can argue that more NFSD threads could mean a larger # of files, but using a sensible number (like 2048) would suffice for a huge number of NFSD threads. In practice, more than several hundreds of NFSD threads will probably never happen, even on large SMPs. Also, since 2048 entries really do not take much memory (couple of hundred KBs), it seems to be ok to simply go with it. 8/ I would like the stats collected to tell me a bit more about what was going on. If find simple hit/miss numbers nearly useless, as you expect many lookups to be misses anyway (first time a file is accessed) but you don't know what percentage. As a first approximation, I would like to only count a miss if the seek address was 0. What would be really nice would be to get stats on how long entries stayed in the cache between last use and re-use. If we stored a 'last-use' time in each entry, and on reuse, kept count of which range the age was is: 0-62 msec 63-125 msec 125-250 msec 250-500 msec 500-1000 msec 1-2 sec 2-4 sec 4-8 sec 8-16sec 16-32 sec This obviously isn't critical, but it would be nice to be able to see how the cache was working. Sure. I haven't put in such things in this patch. I'd be happy to roll such things in later on, since it's non-critical at the moment. 9/ Actually, you don't need the spinlock at all, and nfsd is currently all under the BigKernelLock, but it doesn't hurt to have it around the nfsd_get_raparms function because we hopefully will get rid of the BKL one day. Again, I explained why I had it in the previous email. Regards, Ying Here is the patch. The only files changed since the last patch were racache.h and nfsracache.c. diff -ruN nfsd.orig/nfsd.h nfsd.opt/nfsd.h --- nfsd.orig/nfsd.h Fri Nov 10 15:27:37 2000 +++ nfsd.opt/nfsd.h Fri Nov 10 16:03:43 2000 @@ -76,7 +76,7 @@ /* nfsd/vfs.c */ int fh_lock_parent(struct svc_fh *, struct dentry *); -int nfsd_racache_init(int); +int nfsd_racache_init(void); void nfsd_racache_shutdown(void); int nfsd_lookup(struct svc_rqst *, struct svc_fh *, const char *, int, struct svc_fh *); diff -ruN nfsd.orig/racache.h nfsd.opt/racache.h --- nfsd.orig/racache.h Fri Nov 10 16:10:23 2000 +++ nfsd.opt/racache.h Fri Nov 10 15:50:49 2000 @@ -0,0 +1,41 @@ +/* + * include/linux/nfsd/racache.h + * + * Read
[patch] nfsd optimizations for test10 (recoded to use list_head)
_head; +static struct list_head free_head; + +int +nfsd_racache_init(void) +{ +struct raparms *rp; +struct list_head*rahead; +size_t i; +unsigned long order; + +i = CACHESIZE * sizeof (struct raparms); +for (order = 0; (PAGE_SIZE << order) < i; order++) + ; +raparm_cache = (struct raparms *) + __get_free_pages(GFP_KERNEL, order); +if (!raparm_cache) { + printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for racache\n", i); + return -1; +} +memset(raparm_cache, 0, i); + +i = HASHSIZE * sizeof (struct list_head); +hash_list = kmalloc (i, GFP_KERNEL); +if (!hash_list) { + free_pages ((unsigned long)raparm_cache, order); + raparm_cache = NULL; + printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for hash list in +racache\n", i); + return -1; +} + +spin_lock(_lock); +for (i = 0, rahead = hash_list; i < HASHSIZE; i++, rahead++) + INIT_LIST_HEAD(rahead); + +INIT_LIST_HEAD(_head); +for (i = 0, rp = raparm_cache; i < CACHESIZE; i++, rp++) { + rp->p_hash_next = rp->p_hash_prev = rp; + list_add(>p_lru, _head); +} +INIT_LIST_HEAD(_head); +spin_unlock(_lock); + +nfsdstats.ra_size = CACHESIZE; +return 0; +} + +void +nfsd_racache_shutdown(void) +{ +size_t i; +unsigned long order; + +i = CACHESIZE * sizeof (struct raparms); +for (order = 0; (PAGE_SIZE << order) < i; order++) + ; +spin_lock(_lock); +free_pages ((unsigned long)raparm_cache, order); +raparm_cache = NULL; +kfree (hash_list); +hash_list = NULL; +spin_unlock(_lock); +} + +/* Insert a new entry into the hash table. */ +static inline struct raparms * +nfsd_racache_insert(ino_t ino, dev_t dev) +{ +struct raparms *ra = NULL; +struct list_head *rap; + +if (list_empty(_head)) { + /* Replace with LRU. */ + struct raparms *prev, *next; + ra = list_entry(lru_head.prev, struct raparms, p_lru); + prev = ra->p_hash_prev, + next = ra->p_hash_next; + prev->p_hash_next = next; + next->p_hash_prev = prev; + ra->p_hash_next = NULL; + ra->p_hash_prev = NULL; + list_del(lru_head.prev); +} else { + ra = list_entry(free_head.next, struct raparms, p_lru); + list_del(free_head.next); +} + +memset(ra, 0, sizeof(*ra)); +ra->p_dev = dev; +ra->p_ino = ino; +rap = (struct list_head *) _list[REQHASH(ino, dev)]; +ra->p_hash_next = (struct raparms *)(rap->next); +ra->p_hash_prev = (struct raparms *) rap; +((struct raparms *)(rap->next))->p_hash_prev = ra; +rap->next = (struct list_head *)ra; + +list_add(>p_lru, _head); +return ra; +} + +/* + * Try to find an entry matching the current call in the cache. When none + * is found, we grab the oldest unlocked entry off the LRU list. + * Note that no operation within the loop may sleep. + */ +struct raparms * +nfsd_get_raparms(dev_t dev, ino_t ino) +{ +struct raparms *rahead; +struct raparms *ra = NULL; + +spin_lock(_lock); + +ra = rahead = (struct raparms *) _list[REQHASH(ino, dev)]; +while ((ra = ra->p_hash_next) != rahead) { + if ((ra->p_ino == ino) && (ra->p_dev == dev)) { + /* Do LRU reordering */ + list_del(>p_lru); + list_add(>p_lru, _head); + nfsdstats.ra_hits++; + goto found; + } +} + +/* Did not find one. Get a new item and insert it into the hash table. */ +ra = nfsd_racache_insert(ino, dev); +nfsdstats.ra_misses++; +found: +spin_unlock(_lock); +return ra; +} Ying Chen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] nfsd optimizations for test10 (recoded to use list_head)
ic struct raparms * raparm_cache = NULL; +static struct list_head *hash_list; +static struct list_head lru_head; +static struct list_head free_head; + +int +nfsd_racache_init(void) +{ +struct raparms *rp; +struct list_head*rahead; +size_t i; +unsigned long order; + +i = CACHESIZE * sizeof (struct raparms); +for (order = 0; (PAGE_SIZE order) i; order++) + ; +raparm_cache = (struct raparms *) + __get_free_pages(GFP_KERNEL, order); +if (!raparm_cache) { + printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for racache\n", i); + return -1; +} +memset(raparm_cache, 0, i); + +i = HASHSIZE * sizeof (struct list_head); +hash_list = kmalloc (i, GFP_KERNEL); +if (!hash_list) { + free_pages ((unsigned long)raparm_cache, order); + raparm_cache = NULL; + printk (KERN_ERR "nfsd: cannot allocate %Zd bytes for hash list in +racache\n", i); + return -1; +} + +spin_lock(racache_lock); +for (i = 0, rahead = hash_list; i HASHSIZE; i++, rahead++) + INIT_LIST_HEAD(rahead); + +INIT_LIST_HEAD(free_head); +for (i = 0, rp = raparm_cache; i CACHESIZE; i++, rp++) { + rp-p_hash_next = rp-p_hash_prev = rp; + list_add(rp-p_lru, free_head); +} +INIT_LIST_HEAD(lru_head); +spin_unlock(racache_lock); + +nfsdstats.ra_size = CACHESIZE; +return 0; +} + +void +nfsd_racache_shutdown(void) +{ +size_t i; +unsigned long order; + +i = CACHESIZE * sizeof (struct raparms); +for (order = 0; (PAGE_SIZE order) i; order++) + ; +spin_lock(racache_lock); +free_pages ((unsigned long)raparm_cache, order); +raparm_cache = NULL; +kfree (hash_list); +hash_list = NULL; +spin_unlock(racache_lock); +} + +/* Insert a new entry into the hash table. */ +static inline struct raparms * +nfsd_racache_insert(ino_t ino, dev_t dev) +{ +struct raparms *ra = NULL; +struct list_head *rap; + +if (list_empty(free_head)) { + /* Replace with LRU. */ + struct raparms *prev, *next; + ra = list_entry(lru_head.prev, struct raparms, p_lru); + prev = ra-p_hash_prev, + next = ra-p_hash_next; + prev-p_hash_next = next; + next-p_hash_prev = prev; + ra-p_hash_next = NULL; + ra-p_hash_prev = NULL; + list_del(lru_head.prev); +} else { + ra = list_entry(free_head.next, struct raparms, p_lru); + list_del(free_head.next); +} + +memset(ra, 0, sizeof(*ra)); +ra-p_dev = dev; +ra-p_ino = ino; +rap = (struct list_head *) hash_list[REQHASH(ino, dev)]; +ra-p_hash_next = (struct raparms *)(rap-next); +ra-p_hash_prev = (struct raparms *) rap; +((struct raparms *)(rap-next))-p_hash_prev = ra; +rap-next = (struct list_head *)ra; + +list_add(ra-p_lru, lru_head); +return ra; +} + +/* + * Try to find an entry matching the current call in the cache. When none + * is found, we grab the oldest unlocked entry off the LRU list. + * Note that no operation within the loop may sleep. + */ +struct raparms * +nfsd_get_raparms(dev_t dev, ino_t ino) +{ +struct raparms *rahead; +struct raparms *ra = NULL; + +spin_lock(racache_lock); + +ra = rahead = (struct raparms *) hash_list[REQHASH(ino, dev)]; +while ((ra = ra-p_hash_next) != rahead) { + if ((ra-p_ino == ino) (ra-p_dev == dev)) { + /* Do LRU reordering */ + list_del(ra-p_lru); + list_add(ra-p_lru, lru_head); + nfsdstats.ra_hits++; + goto found; + } +} + +/* Did not find one. Get a new item and insert it into the hash table. */ +ra = nfsd_racache_insert(ino, dev); + nfsdstats.ra_misses++; +found: +spin_unlock(racache_lock); +return ra; +} Ying Chen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
problems with sync_all_inode() in prune_icache() and kupdate()
Hi, I'm wondering if someone can tell me why sync_all_inodes() is called in prune_icache(). sync_all_inodes() can cause problems in some situations when memory is short and shrink_icache_memory() is called. For instance, when the system is really short of memory, do_try_to_free_pages() is invoked (either by application or kswapd) and shrink_icache_memory() is also invoked, but when prune_icache() is called, the first thing is does is to sync_all_inodes(). If the inode block is not in memory, it may have to bread the inode block in, so the kswapd() can block until the inode block is brought into memory. Not only that, since the system is short of memory, there may not even be memory available for the inode block. Even if there is, given that there is only a single kswapd thread who is doing sync_all_inodes(), if the dirty inode list if relatively long (like a tens of thousands as in something like SPEC SFS), it'll take practically forever for sync_all_inodes() to finish. To user, this looks like the system is hang (although it isn't really). It's just taking a looong time to do shrink_icache_memory! One solution to this is not to call sync_all_inodes() at all in prune_icache(), since other parts of the kernel, like kupdate() will also try to sync_inodes periodically anyway, but I don't know if this has other implications or not. I don't see a problem with this myself. In fact, I have been using this fix in my own test9 kernel, and I get much smoother kernel behavior when running high load SPEC SFS than using the default prune_icache(). Actually if sync_all_inodes() is called, SPEC SFS sometimes simply fails due to the long response time on the I/O requests. The similar theory goes with kupdate() daemon. That is, since there is only a single thread that does the inode and buffer flushing, under high load, kupdate() would not get a chance to call flush_dirty_buffers() until after sync_inodes() is completed. But sync_inodes() can take forever since inodes are flushed serially to disk. Imagine how long it might take if each inode flushing causes one read from disk! In my experience with SPEC SFS, sometimes, if kupdate() is invoked during the SPEC SFS run, it simply cannot finish sync_inode() until the entire benchmark run is finished! So, all the dirty buffers that flush_dirty_buffer(1) is supposed to flush would never be called during the benchmark run and system is constantly running in the bdflush() mode, which is really supposed to be called only in a panic mode! Again, the solution can be simple, one can create multiple dirty_buffer_flushing daemon threads that calls flush_dirty_buffer() without sync_super or sync_inode stuff. I have done so in my own test9 kernel, and the results with SPEC SFS is much more pleasant. Ying - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[patch] wakeup_bdflush related fixes and nfsd optimizations for test10
Hi, This patch includes two sets of things against test10: First, there are several places where schedule() is called after wakeup_bdflush(1) is called. This is completely unnecessary, since wakeup_bdflush(1) already gave up the control, and when the control is returned to the calling thread who called wakeup_bdflush(1), it should just go on. Calling schedule() after wakeup_bdflush(1) will make the calling thread give up control again. This is a problem for some of those latency sensitive benchmarks (like SPEC SFS) and applications. Second, (I have posted this to the kernel mailing list, but I forgot to cc to Linus.) I made some optimizations on racache in nfsd in test10. The idea is to replace with existing fixed length table for readahead cache in NFSD with a hash table. The old racache is essentially ineffective in dealing with large # of files, and yet eats CPU cycles in scanning the table (even though the table is small), the hash table-based is much more effective and fast. I have generated the patch for test10 and tested it. (See attached file: a) Ying Chen [EMAIL PROTECTED] IBM Almaden Research Center a
[patch] wakeup_bdflush related fixes and nfsd optimizations for test10
Hi, This patch includes two sets of things against test10: First, there are several places where schedule() is called after wakeup_bdflush(1) is called. This is completely unnecessary, since wakeup_bdflush(1) already gave up the control, and when the control is returned to the calling thread who called wakeup_bdflush(1), it should just go on. Calling schedule() after wakeup_bdflush(1) will make the calling thread give up control again. This is a problem for some of those latency sensitive benchmarks (like SPEC SFS) and applications. Second, (I have posted this to the kernel mailing list, but I forgot to cc to Linus.) I made some optimizations on racache in nfsd in test10. The idea is to replace with existing fixed length table for readahead cache in NFSD with a hash table. The old racache is essentially ineffective in dealing with large # of files, and yet eats CPU cycles in scanning the table (even though the table is small), the hash table-based is much more effective and fast. I have generated the patch for test10 and tested it. (See attached file: a) Ying Chen [EMAIL PROTECTED] IBM Almaden Research Center a
[patch] nfsd optimizations for test10
Hi, I made some optimizations on racache in nfsd in test10. The idea is to replace with existing fixed length table for readahead cache in NFSD with a hash table. The old racache is essentially ineffective in dealing with large # of files, and yet eats CPU cycles in scanning the table (even though the table is small), the hash table-based is much more effective and fast. I have generated the patch for test10 and tested it. (See attached file: nfshdiff)(See attached file: nfsdiff) Ying nfshdiff nfsdiff
[patch] nfsd optimizations for test10
Hi, I made some optimizations on racache in nfsd in test10. The idea is to replace with existing fixed length table for readahead cache in NFSD with a hash table. The old racache is essentially ineffective in dealing with large # of files, and yet eats CPU cycles in scanning the table (even though the table is small), the hash table-based is much more effective and fast. I have generated the patch for test10 and tested it. (See attached file: nfshdiff)(See attached file: nfsdiff) Ying nfshdiff nfsdiff
Re: VM in v2.4.0test9
I'd second that this is most likely a VM related problem. Last few days I sent you an example that I would make system hang simply by doing a mkfs on 90 GB file system. This happens when low 1GB memory is used up (but I still have high 1GB available). I think David probably ran into the same problem as I did. I'd trace the problem down a bit. It seems that the system goes into a loop that consists of the following call sequence: alloc_pages --> try_to_free_pages --> do_try_to_free_pages (but doesn't seem to go through page_launder in do_try_to_free_pages) --> kmem_cache_reap (so skipped the refill_inactive() in the if statement) --> backto try_again in alloc_pages. If I kill mkfs, the system would go on nicely for some small applications. It is still a bit slow. If I do a "make bzImage", the system would hang again. If I look at the Sysrq output, I had more than 2000 pages available in inactive_clean list. I wonder why alloc_pages doesn't take it. I had no problem with these kinds of things with test6 or test7. Ying Rik van Riel <[EMAIL PROTECTED]>@vger.kernel.org on 10/04/2000 09:31:21 AM Sent by: [EMAIL PROTECTED] To: David Weinehall <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED], Linus Torvalds <[EMAIL PROTECTED]> Subject: Re: VM in v2.4.0test9 On Wed, 4 Oct 2000, David Weinehall wrote: > Running the included program on a clean v2.4.0test9 kernel I can > hang the computer practically in no time. > What seems most strange is that the doesn't even get depleated. > The machine still answers to SysRq and ping, but nothing else. Looking again at this report in more detail, something very strange is going on ... > This is what I got from SysRq+M (manual copy): > > Free pages: 500 kB (0 highmem) > Active: 8 inactive dirty: 1009, inactive clean:0 > free: 125 (31 62 93) First, you have MORE free memory than freepages.high. In this case I really don't see why __alloc_pages() wouldn't give the memory to your processes > Free swap: 64772 And there is tons of swap free... Are you absolutely sure this is VM related? This almost looks like the system puts a in a read request but the request queue doesn't get unplugged, or something strange like that ... There is more than enough memory to satisfy all VM requests and the loop in __alloc_pages() is straightforward enough to give your processes their memory without strange bugs ... regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: VM in v2.4.0test9
I'd second that this is most likely a VM related problem. Last few days I sent you an example that I would make system hang simply by doing a mkfs on 90 GB file system. This happens when low 1GB memory is used up (but I still have high 1GB available). I think David probably ran into the same problem as I did. I'd trace the problem down a bit. It seems that the system goes into a loop that consists of the following call sequence: alloc_pages -- try_to_free_pages -- do_try_to_free_pages (but doesn't seem to go through page_launder in do_try_to_free_pages) -- kmem_cache_reap (so skipped the refill_inactive() in the if statement) -- backto try_again in alloc_pages. If I kill mkfs, the system would go on nicely for some small applications. It is still a bit slow. If I do a "make bzImage", the system would hang again. If I look at the Sysrq output, I had more than 2000 pages available in inactive_clean list. I wonder why alloc_pages doesn't take it. I had no problem with these kinds of things with test6 or test7. Ying Rik van Riel [EMAIL PROTECTED]@vger.kernel.org on 10/04/2000 09:31:21 AM Sent by: [EMAIL PROTECTED] To: David Weinehall [EMAIL PROTECTED] cc: [EMAIL PROTECTED], Linus Torvalds [EMAIL PROTECTED] Subject: Re: VM in v2.4.0test9 On Wed, 4 Oct 2000, David Weinehall wrote: Running the included program on a clean v2.4.0test9 kernel I can hang the computer practically in no time. What seems most strange is that the doesn't even get depleated. The machine still answers to SysRq and ping, but nothing else. Looking again at this report in more detail, something very strange is going on ... This is what I got from SysRq+M (manual copy): Free pages: 500 kB (0 highmem) Active: 8 inactive dirty: 1009, inactive clean:0 free: 125 (31 62 93) First, you have MORE free memory than freepages.high. In this case I really don't see why __alloc_pages() wouldn't give the memory to your processes Free swap: 64772 And there is tons of swap free... Are you absolutely sure this is VM related? This almost looks like the system puts a in a read request but the request queue doesn't get unplugged, or something strange like that ... There is more than enough memory to satisfy all VM requests and the loop in __alloc_pages() is straightforward enough to give your processes their memory without strange bugs ... regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/