Re: raid5 high cpu usage during reads - oprofile results
On Sat, 1 Apr 2006, Alex Izvorski wrote: > On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote: > > i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause > > otherwise it'd be cool to run with STRIPE_SIZE the same as your raid > > chunksize... which would decrease the number of entries -- much more > > desirable than increasing the number of buckets. > > Dean - that is an interesting thought. I can't think of a reason why > not, except that it is the same as the page size? But offhand I don't > see any reason why that is a particularly good choice either. Would the > code work with other sizes? What about a variable (per array) size? > How would that interact with small reads? i don't understand the code well enough... > Do you happen to know how many find_stripe calls there are for each > read? I rather suspect it is several (many) times per sector, since it > uses up something on the order of several thousand clock cycles per > *sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz > cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock > cycles per sector just in that one function). I really don't see any way > a single hash lookup even in a table with ~30 entries per bucket could > do anything close to that. well the lists are all struct stripe_heads... which on i386 seem to be 0x30 + 0x6c*(devs - 1) bytes each. that's pretty big. they're allocated in a slab, so they're relatively well packed into pages... but still, unless i've messed up somewhere that's 480 bytes for a 5 disk raid5. so that's only 8 per page... so a chain of length 30 touches at least 4 pages. if you're hitting all 512 buckets, chains of length 30, then you're looking at somewhere on the order of 2048 pages... that causes a lot of thrashing in the TLBs... and isn't so great on the cache either. it's even worse on x86_64 ... it looks like 0xf8 + 0xb0*(devs - 1) bytes per stripe_head ... (i'm pulling these numbers from the call setup for kmem_cache_create in the disassembly of raid5.ko from kernels on my boxes). oh btw you might get a small improvement by moving the "sector" field of struct stripe_head close to the hash field... right now the sector field is at 0x28 (x86_64) and so it's probably on a different cache line from the "hash" field at offset 0 about half the time (64 byte cache line). if you move sector to right after the "hash" field it'll more likely be on the same cache line... but still, i think the tlb is the problem. oh you can probably ask oprofile to tell you if you're seeing cache miss or tlb miss stalls there (not sure on the syntax). > Short of changing STRIPE_SIZE, it should be enough to make sure the > average bucket occupancy is considerably less than one - as long as the > occupancy is kept low the the speed of access is independent of the > number of entries. 256 stripe cache entries and 512 hash buckets works > well with a 0.5 max occupancy; we should ideally have at least 32k > buckets (or 64 pages) for 16k entries. Yeah, ok, it's quite a bit more > memory than is used now, but considering that the box I'm running this > on has 4GB, it's not that much ;) i still don't understand all the code well enough... but if i assume there's a good reason for STRIPE_SIZE == PAGE_SIZE then it seems like you need to improve the cache locality of the hash chaining... a linked list of struct stripe_heads doesn't have very good locality because they're such large structures. one possibility is a linked list of: struct stripe_hash_entry { struct hlist_node hash; sector_tsector; struct stripe_head *sh; }; but that's still 32 bytes on x86_64 ... you can get it down to 16 bytes by getting rid of chaining and using open addressing... eh ... this still isn't that hot... really there's too much pressure because there's a hash table entry per 4KiB of disk i/o... anyhow i'm only eyeballing code here, i could easily have missed some critical details. -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 high cpu usage during reads - oprofile results
On Sat, 2006-04-01 at 14:28 -0800, dean gaudet wrote: > i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause > otherwise it'd be cool to run with STRIPE_SIZE the same as your raid > chunksize... which would decrease the number of entries -- much more > desirable than increasing the number of buckets. Dean - that is an interesting thought. I can't think of a reason why not, except that it is the same as the page size? But offhand I don't see any reason why that is a particularly good choice either. Would the code work with other sizes? What about a variable (per array) size? How would that interact with small reads? Do you happen to know how many find_stripe calls there are for each read? I rather suspect it is several (many) times per sector, since it uses up something on the order of several thousand clock cycles per *sector* (reading 400k sectors per second produces 80% load of 2x 2.4GHz cpus, of which get_active_stripe accounts for ~30% - that's 2.8k clock cycles per sector just in that one function). I really don't see any way a single hash lookup even in a table with ~30 entries per bucket could do anything close to that. Short of changing STRIPE_SIZE, it should be enough to make sure the average bucket occupancy is considerably less than one - as long as the occupancy is kept low the the speed of access is independent of the number of entries. 256 stripe cache entries and 512 hash buckets works well with a 0.5 max occupancy; we should ideally have at least 32k buckets (or 64 pages) for 16k entries. Yeah, ok, it's quite a bit more memory than is used now, but considering that the box I'm running this on has 4GB, it's not that much ;) --Alex - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 high cpu usage during reads - oprofile results
On Sat, 1 Apr 2006, Alex Izvorski wrote: > Dean - I think I see what you mean, you're looking at this line in the > assembly? > > 65830 16.8830 : c1f: cmp%rcx,0x28(%rax) yup that's the one... that's probably a fair number of cache (or tlb) misses going on right there. > I looked at the hash stuff, I think the problem is not that the hash > function is poor, but rather that the number of entries in all buckets > gets to be pretty high. yeah... your analysis seems more likely. i suppose increasing the number of buckets is the only option. it looks to me like you'd just need to change NR_HASH and the kzalloc in run() in order to increase the number of buckets. i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause otherwise it'd be cool to run with STRIPE_SIZE the same as your raid chunksize... which would decrease the number of entries -- much more desirable than increasing the number of buckets. -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 high cpu usage during reads - oprofile results
On Sat, 2006-03-25 at 21:38 -0800, dean gaudet wrote: > On Sat, 25 Mar 2006, Alex Izvorski wrote: > > > http://linuxraid.pastebin.com/621363 - oprofile annotated assembly > > it looks to me like a lot of time is spent in __find_stripe() ... i wonder > if the hash is working properly. > > in raid5.c you could try changing > > #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> > STRIPE_SHIFT) & HASH_MASK])) > > to > > #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> > STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK])) > > or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ... > > -dean Dean - I think I see what you mean, you're looking at this line in the assembly? 65830 16.8830 : c1f: cmp%rcx,0x28(%rax) I looked at the hash stuff, I think the problem is not that the hash function is poor, but rather that the number of entries in all buckets gets to be pretty high. I haven't actually run this with extra debugging statements yet, but as far as I can tell from reading the code, the hash-related defines will get the following values: NR_HASH = 512 HASH_MASK = 0x1ff STRIPE_SHIFT = 3 With this, every successive stripe will go to a different bucket, so for a sequential access this is a good hash function (actually, to break it you'd need to read 4k every 2M, which is not a common access pattern). The problem is that there are too few buckets? Since with 16k entries in the stripe cache and 512 buckets we'd have 32 elements per bucket. Would redefining HASH_PAGES to 16 or so help? For that matter, is the code even expected to work with HASH_PAGES > 1? Perhaps someone who is more familiar with the code can comment on this. Regards, --Alex - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 high cpu usage during reads - oprofile results
On Sat, 25 Mar 2006, Alex Izvorski wrote: > http://linuxraid.pastebin.com/621363 - oprofile annotated assembly it looks to me like a lot of time is spent in __find_stripe() ... i wonder if the hash is working properly. in raid5.c you could try changing #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])) to #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[(((sect) >> STRIPE_SHIFT) ^ ((sect) >> (2*STRIPE_SHIFT))) & HASH_MASK])) or maybe try using jhash_1word((sect) >> STRIPE_SHIFT, 0) ... -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid5 high cpu usage during reads - oprofile results
Hello, I profiled some raid5 reads using oprofile to try to track down the suspiciously high cpu load I see. This uses the same 8-disk SATA setup as I had described earlier. One of runs is on a 1MB chunk raid5, the other on a 32MB chunk raid5. As Neil suggested memcpy is a big part of the cpu load. The rest of it apprears to be in handle_stripe and get_active_stripe - these three account for most of the load, with the remainder fairly evenly distributed among a dozen other routines. If using a large chunk size, handle_stripe and get_active_stripe will predominate (and result in some truly abnormal cpu loads). I am attaching annotated (from opannotate) source and assembly for raid5.c from the second (32MB chunk) run. The annotated results do not make much sense to me, but I suspect that the exact line numbers etc may be shifted slightly as usually happens with optimized builds. I hope this would be useful. Regards, --Alex P.S. I had originally mailed the oprofile results as attachments to the list, but I think they didn't go through. I put them at: http://linuxraid.pastebin.com/621363 - oprofile annotated assembly http://linuxraid.pastebin.com/621364 - oprofile annotated source Sorry if you get this email twice. --- mdadm --create /dev/md0 --level=raid5 --chunk=1024 --raid-devices=8 \ --size=10485760 /dev/sd[abcdefgh]1 echo "8192" > /sys/block/md0/md/stripe_cache_size ./test_aio -f /dev/md0 -T 10 -s 60G -r 8M -n 14 throughput 205MB/s, cpu load 40% opreport --symbols --image-path=/lib/modules/2.6.15-gentoo-r7/kernel/ samples %image name app name symbol name 9183942.9501 vmlinuxvmlinuxmemcpy 2594612.1341 raid5.ko raid5 handle_stripe 17732 8.2927 raid5.ko raid5 get_active_stripe 5454 2.5507 vmlinuxvmlinuxblk_rq_map_sg 4850 2.2682 vmlinuxvmlinux__delay 4726 2.2102 raid5.ko raid5 raid5_compute_sector 4588 2.1457 raid5.ko raid5 copy_data 4389 2.0526 raid5.ko raid5 .text 4362 2.0400 raid5.ko raid5 make_request 3688 1.7248 vmlinuxvmlinuxclear_page 3548 1.6593 raid5.ko raid5 raid5_end_read_request 2594 1.2131 vmlinuxvmlinuxblk_recount_segments 1944 0.9091 vmlinuxvmlinuxgeneric_make_request 1627 0.7609 vmlinuxvmlinuxdma_map_sg 1555 0.7272 vmlinuxvmlinuxget_user_pages 1540 0.7202 libata.ko libata ata_bmdma_status 1464 0.6847 vmlinuxvmlinux__make_request 1350 0.6314 vmlinuxvmlinuxfollow_page 1098 0.5135 vmlinuxvmlinuxfinish_wait ...(others under 0.5%) --- mdadm --create /dev/md0 --level=raid5 --chunk=32678 --raid-devices=8 \ --size=10485760 /dev/sd[abcdefgh]1 echo "16384" > /sys/block/md0/md/stripe_cache_size ./test_aio -f /dev/md0 -T 10 -s 60G -r 8M -n 14 throughput 207MB/s, cpu load 80% opreport --symbols --image-path=/lib/modules/2.6.15-gentoo-r7/kernel/ samples %image name app name symbol name 112441 28.2297 raid5.ko raid5 get_active_stripe 8682621.7988 vmlinuxvmlinuxmemcpy 7868819.7556 raid5.ko raid5 handle_stripe 18796 4.7190 raid5.ko raid5 .text 13301 3.3394 raid5.ko raid5 raid5_compute_sector 8339 2.0936 raid5.ko raid5 make_request 5881 1.4765 vmlinuxvmlinuxblk_rq_map_sg 5463 1.3716 raid5.ko raid5 raid5_end_read_request 4269 1.0718 vmlinuxvmlinux__delay 4131 1.0371 raid5.ko raid5 copy_data 3531 0.8865 vmlinuxvmlinuxclear_page 2964 0.7441 vmlinuxvmlinuxblk_recount_segments 2617 0.6570 vmlinuxvmlinuxget_user_pages 2025 0.5084 vmlinuxvmlinuxdma_map_sg ...(others under 0.5%) --- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html