[PATCH][WIP] dio rewrite
Last posting: http://marc.info/?l=linux-fsdevel=136063048002755=2 Got it working and ran some benchmarks. On a high end SSD, doing 4k random reads with fio I got around a 30% increase in throughput. (That was without the aio changes I recently did. With those, throughput was aproximately doubled). The decrease in compiled binary size is even more dramatic than the reduction in LOC: 3.8-rc7 textdata bss dec hex filename 11609 16 8 116332d71 fs/direct-io.o My version: textdata bss dec hex filename 3545 16 03561 de9 fs/direct-io.o It's only been lightly tested - I haven't run xfstests yet - but there shouldn't be anything broken excluding btrfs. There's a few more performance optimizations I may do, but aside from the btrfs issues I think it's essentially done. Due to the sheer number of hairy corner cases in the dio code, I'd really like to get as much review as possible. The new code should be vastly easier to review and understand, I think. Git repo: http://evilpiepirate.org/git/linux-bcache.git block_stuff --- fs/direct-io.c | 1323 +++- 1 file changed, 357 insertions(+), 966 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 8e838b1..1fb9fb4 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -8,7 +8,7 @@ * 04Jul2002 Andrew Morton * Initial version * 11Sep2002 janet...@us.ibm.com - * added readv/writev support. + * added readv/writev support. * 29Oct2002 Andrew Morton * rewrote bio_add_page() support. * 30Oct2002 pbad...@us.ibm.com @@ -38,183 +38,37 @@ #include #include -/* - * How many user pages to map in one call to get_user_pages(). This determines - * the size of a structure in the slab cache - */ -#define DIO_PAGES 64 - -/* - * This code generally works in units of "dio_blocks". A dio_block is - * somewhere between the hard sector size and the filesystem block size. it - * is determined on a per-invocation basis. When talking to the filesystem - * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity - * down by dio->blkfactor. Similarly, fs-blocksize quantities are converted - * to bio_block quantities by shifting left by blkfactor. - * - * If blkfactor is zero then the user's request was aligned to the filesystem's - * blocksize. - */ - /* dio_state only used in the submission path */ - struct dio_submit { - struct bio *bio;/* bio under assembly */ - unsigned blkbits; /* doesn't change */ - unsigned blkfactor; /* When we're using an alignment which - is finer than the filesystem's soft - blocksize, this specifies how much - finer. blkfactor=2 means 1/4-block - alignment. Does not change */ - unsigned start_zero_done; /* flag: sub-blocksize zeroing has - been performed at the start of a - write */ - int pages_in_io;/* approximate total IO pages */ - size_t size; /* total request size (doesn't change)*/ - sector_t block_in_file; /* Current offset into the underlying - file in dio_block units. */ - unsigned blocks_available; /* At block_in_file. changes */ - int reap_counter; /* rate limit reaping */ - sector_t final_block_in_request;/* doesn't change */ - unsigned first_block_in_page; /* doesn't change, Used only once */ - int boundary; /* prev block is at a boundary */ - get_block_t *get_block; /* block mapping function */ - dio_submit_t *submit_io;/* IO submition function */ - - loff_t logical_offset_in_bio; /* current first logical block in bio */ - sector_t final_block_in_bio;/* current final block in bio + 1 */ - sector_t next_block_for_io; /* next block to be put under IO, - in dio_blocks units */ - - /* -* Deferred addition of a page to the dio. These variables are -* private to dio_send_cur_page(), submit_page_section() and -* dio_bio_add_page(). -*/ - struct page *cur_page; /* The page */ - unsigned cur_page_offset; /* Offset into it, in bytes */ - unsigned cur_page_len; /* Nr of bytes at cur_page_offset */ - sector_t cur_page_block;/* Where it starts */ - loff_t cur_page_fs_offset; /* Offset in file */ - - /* -* Page fetching state. These variables belong to dio_refill_pages(). -*/ - int curr_page; /*
[PATCH][WIP] dio rewrite
Last posting: http://marc.info/?l=linux-fsdevelm=136063048002755w=2 Got it working and ran some benchmarks. On a high end SSD, doing 4k random reads with fio I got around a 30% increase in throughput. (That was without the aio changes I recently did. With those, throughput was aproximately doubled). The decrease in compiled binary size is even more dramatic than the reduction in LOC: 3.8-rc7 textdata bss dec hex filename 11609 16 8 116332d71 fs/direct-io.o My version: textdata bss dec hex filename 3545 16 03561 de9 fs/direct-io.o It's only been lightly tested - I haven't run xfstests yet - but there shouldn't be anything broken excluding btrfs. There's a few more performance optimizations I may do, but aside from the btrfs issues I think it's essentially done. Due to the sheer number of hairy corner cases in the dio code, I'd really like to get as much review as possible. The new code should be vastly easier to review and understand, I think. Git repo: http://evilpiepirate.org/git/linux-bcache.git block_stuff --- fs/direct-io.c | 1323 +++- 1 file changed, 357 insertions(+), 966 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 8e838b1..1fb9fb4 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -8,7 +8,7 @@ * 04Jul2002 Andrew Morton * Initial version * 11Sep2002 janet...@us.ibm.com - * added readv/writev support. + * added readv/writev support. * 29Oct2002 Andrew Morton * rewrote bio_add_page() support. * 30Oct2002 pbad...@us.ibm.com @@ -38,183 +38,37 @@ #include linux/atomic.h #include linux/prefetch.h -/* - * How many user pages to map in one call to get_user_pages(). This determines - * the size of a structure in the slab cache - */ -#define DIO_PAGES 64 - -/* - * This code generally works in units of dio_blocks. A dio_block is - * somewhere between the hard sector size and the filesystem block size. it - * is determined on a per-invocation basis. When talking to the filesystem - * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity - * down by dio-blkfactor. Similarly, fs-blocksize quantities are converted - * to bio_block quantities by shifting left by blkfactor. - * - * If blkfactor is zero then the user's request was aligned to the filesystem's - * blocksize. - */ - /* dio_state only used in the submission path */ - struct dio_submit { - struct bio *bio;/* bio under assembly */ - unsigned blkbits; /* doesn't change */ - unsigned blkfactor; /* When we're using an alignment which - is finer than the filesystem's soft - blocksize, this specifies how much - finer. blkfactor=2 means 1/4-block - alignment. Does not change */ - unsigned start_zero_done; /* flag: sub-blocksize zeroing has - been performed at the start of a - write */ - int pages_in_io;/* approximate total IO pages */ - size_t size; /* total request size (doesn't change)*/ - sector_t block_in_file; /* Current offset into the underlying - file in dio_block units. */ - unsigned blocks_available; /* At block_in_file. changes */ - int reap_counter; /* rate limit reaping */ - sector_t final_block_in_request;/* doesn't change */ - unsigned first_block_in_page; /* doesn't change, Used only once */ - int boundary; /* prev block is at a boundary */ - get_block_t *get_block; /* block mapping function */ - dio_submit_t *submit_io;/* IO submition function */ - - loff_t logical_offset_in_bio; /* current first logical block in bio */ - sector_t final_block_in_bio;/* current final block in bio + 1 */ - sector_t next_block_for_io; /* next block to be put under IO, - in dio_blocks units */ - - /* -* Deferred addition of a page to the dio. These variables are -* private to dio_send_cur_page(), submit_page_section() and -* dio_bio_add_page(). -*/ - struct page *cur_page; /* The page */ - unsigned cur_page_offset; /* Offset into it, in bytes */ - unsigned cur_page_len; /* Nr of bytes at cur_page_offset */ - sector_t cur_page_block;/* Where it starts */ - loff_t cur_page_fs_offset; /* Offset in file */ - - /* -* Page fetching state. These variables belong to dio_refill_pages(). -*/ - int
Re: [PATCH][WIP] dio rewrite
On Mon, Feb 11, 2013 at 05:24:13PM -0800, Andi Kleen wrote: > On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote: > > I finally started hacking on the dio code, and it's far from done but > > it's turning out better than I expected so I thought I'd show off what > > I've got so far. > > The critical metric for some of the highend workloads > is the number of cache lines touched in the submission path > (especially potentially modified ones). How does your new code fare on that? I haven't started looking at performance yet, except to sanity check it - but it should be substantially better, mostly due to getting rid of dio->pages. As far as touching shared cachelines I don't think it's any different than the old code, though. There's also icache, which should be hugely better. But the code's changed so much I'll have to spend some time optimizing it before I can make a reasonable comparison, after I've finished with all the correctness issues. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][WIP] dio rewrite
On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote: > I finally started hacking on the dio code, and it's far from done but > it's turning out better than I expected so I thought I'd show off what > I've got so far. The critical metric for some of the highend workloads is the number of cache lines touched in the submission path (especially potentially modified ones). How does your new code fare on that? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][WIP] dio rewrite
I finally started hacking on the dio code, and it's far from done but it's turning out better than I expected so I thought I'd show off what I've got so far. The end result is _vastly_ simpler - direct-io.c is now ~700 lines, vs. ~1300 previously. dio_submit is almost gone, I'm down to 4 things left in it. It relies heavily on my block layer patches for efficient bio splitting, and making generic_make_request() take arbitrary size bios. I just rebased those on 3.8-rc the other day, I need to start sending them to Jens again. The approach it takes sort of inverts the way the code was structured before. It starts out by allocating a bio, then it calls get_user_pages() to fill up the bvec. Then, it figures out what to do with the bio by calling getblocks() repeatedly - where to send it if it's mapped, splitting the bio as needed. This gets rid of a _ton_ of state in struct dio_submit. It also gets rid of the various differences between async and sync requests - previously, for async reads it marked pages dirty before submitting the io (in process context), then on completion punts to worqueue to redirty the pages if any need to be. This now happens for sync reads, too. There was also some level of throttling for the number of pages pinned - but this only worked for sync requests. I tossed it out - if we do need that throttling (which I'm skeptical of considering it didn't work for aio), we'll want to come up with something saner. There's still various broken stuff - dio_zero_block() needs to be reimplemented, and more critically dio_end_io() (and maybe the filesystem dio completions) do stuff with the biovec - the bio splitting breaks this, because the bvec is shared between bios. --- fs/direct-io.c | 1217 +++- 1 file changed, 312 insertions(+), 905 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 8e838b1..eb4479f 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -8,7 +8,7 @@ * 04Jul2002 Andrew Morton * Initial version * 11Sep2002 janet...@us.ibm.com - * added readv/writev support. + * added readv/writev support. * 29Oct2002 Andrew Morton * rewrote bio_add_page() support. * 30Oct2002 pbad...@us.ibm.com @@ -38,184 +38,38 @@ #include #include -/* - * How many user pages to map in one call to get_user_pages(). This determines - * the size of a structure in the slab cache - */ -#define DIO_PAGES 64 - -/* - * This code generally works in units of "dio_blocks". A dio_block is - * somewhere between the hard sector size and the filesystem block size. it - * is determined on a per-invocation basis. When talking to the filesystem - * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity - * down by dio->blkfactor. Similarly, fs-blocksize quantities are converted - * to bio_block quantities by shifting left by blkfactor. - * - * If blkfactor is zero then the user's request was aligned to the filesystem's - * blocksize. - */ - /* dio_state only used in the submission path */ struct dio_submit { - struct bio *bio;/* bio under assembly */ - unsigned blkbits; /* doesn't change */ - unsigned blkfactor; /* When we're using an alignment which - is finer than the filesystem's soft - blocksize, this specifies how much - finer. blkfactor=2 means 1/4-block - alignment. Does not change */ - unsigned start_zero_done; /* flag: sub-blocksize zeroing has - been performed at the start of a - write */ - int pages_in_io;/* approximate total IO pages */ - size_t size; /* total request size (doesn't change)*/ - sector_t block_in_file; /* Current offset into the underlying - file in dio_block units. */ - unsigned blocks_available; /* At block_in_file. changes */ - int reap_counter; /* rate limit reaping */ - sector_t final_block_in_request;/* doesn't change */ - unsigned first_block_in_page; /* doesn't change, Used only once */ - int boundary; /* prev block is at a boundary */ - get_block_t *get_block; /* block mapping function */ - dio_submit_t *submit_io;/* IO submition function */ - - loff_t logical_offset_in_bio; /* current first logical block in bio */ - sector_t final_block_in_bio;/* current final block in bio + 1 */ - sector_t next_block_for_io; /* next block to be put under IO, - in dio_blocks units */ - - /* -* Deferred addition of a page to the dio.
[PATCH][WIP] dio rewrite
I finally started hacking on the dio code, and it's far from done but it's turning out better than I expected so I thought I'd show off what I've got so far. The end result is _vastly_ simpler - direct-io.c is now ~700 lines, vs. ~1300 previously. dio_submit is almost gone, I'm down to 4 things left in it. It relies heavily on my block layer patches for efficient bio splitting, and making generic_make_request() take arbitrary size bios. I just rebased those on 3.8-rc the other day, I need to start sending them to Jens again. The approach it takes sort of inverts the way the code was structured before. It starts out by allocating a bio, then it calls get_user_pages() to fill up the bvec. Then, it figures out what to do with the bio by calling getblocks() repeatedly - where to send it if it's mapped, splitting the bio as needed. This gets rid of a _ton_ of state in struct dio_submit. It also gets rid of the various differences between async and sync requests - previously, for async reads it marked pages dirty before submitting the io (in process context), then on completion punts to worqueue to redirty the pages if any need to be. This now happens for sync reads, too. There was also some level of throttling for the number of pages pinned - but this only worked for sync requests. I tossed it out - if we do need that throttling (which I'm skeptical of considering it didn't work for aio), we'll want to come up with something saner. There's still various broken stuff - dio_zero_block() needs to be reimplemented, and more critically dio_end_io() (and maybe the filesystem dio completions) do stuff with the biovec - the bio splitting breaks this, because the bvec is shared between bios. --- fs/direct-io.c | 1217 +++- 1 file changed, 312 insertions(+), 905 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 8e838b1..eb4479f 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -8,7 +8,7 @@ * 04Jul2002 Andrew Morton * Initial version * 11Sep2002 janet...@us.ibm.com - * added readv/writev support. + * added readv/writev support. * 29Oct2002 Andrew Morton * rewrote bio_add_page() support. * 30Oct2002 pbad...@us.ibm.com @@ -38,184 +38,38 @@ #include linux/atomic.h #include linux/prefetch.h -/* - * How many user pages to map in one call to get_user_pages(). This determines - * the size of a structure in the slab cache - */ -#define DIO_PAGES 64 - -/* - * This code generally works in units of dio_blocks. A dio_block is - * somewhere between the hard sector size and the filesystem block size. it - * is determined on a per-invocation basis. When talking to the filesystem - * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity - * down by dio-blkfactor. Similarly, fs-blocksize quantities are converted - * to bio_block quantities by shifting left by blkfactor. - * - * If blkfactor is zero then the user's request was aligned to the filesystem's - * blocksize. - */ - /* dio_state only used in the submission path */ struct dio_submit { - struct bio *bio;/* bio under assembly */ - unsigned blkbits; /* doesn't change */ - unsigned blkfactor; /* When we're using an alignment which - is finer than the filesystem's soft - blocksize, this specifies how much - finer. blkfactor=2 means 1/4-block - alignment. Does not change */ - unsigned start_zero_done; /* flag: sub-blocksize zeroing has - been performed at the start of a - write */ - int pages_in_io;/* approximate total IO pages */ - size_t size; /* total request size (doesn't change)*/ - sector_t block_in_file; /* Current offset into the underlying - file in dio_block units. */ - unsigned blocks_available; /* At block_in_file. changes */ - int reap_counter; /* rate limit reaping */ - sector_t final_block_in_request;/* doesn't change */ - unsigned first_block_in_page; /* doesn't change, Used only once */ - int boundary; /* prev block is at a boundary */ - get_block_t *get_block; /* block mapping function */ - dio_submit_t *submit_io;/* IO submition function */ - - loff_t logical_offset_in_bio; /* current first logical block in bio */ - sector_t final_block_in_bio;/* current final block in bio + 1 */ - sector_t next_block_for_io; /* next block to be put under IO, - in dio_blocks units */ - - /* -* Deferred
Re: [PATCH][WIP] dio rewrite
On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote: I finally started hacking on the dio code, and it's far from done but it's turning out better than I expected so I thought I'd show off what I've got so far. The critical metric for some of the highend workloads is the number of cache lines touched in the submission path (especially potentially modified ones). How does your new code fare on that? -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][WIP] dio rewrite
On Mon, Feb 11, 2013 at 05:24:13PM -0800, Andi Kleen wrote: On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote: I finally started hacking on the dio code, and it's far from done but it's turning out better than I expected so I thought I'd show off what I've got so far. The critical metric for some of the highend workloads is the number of cache lines touched in the submission path (especially potentially modified ones). How does your new code fare on that? I haven't started looking at performance yet, except to sanity check it - but it should be substantially better, mostly due to getting rid of dio-pages. As far as touching shared cachelines I don't think it's any different than the old code, though. There's also icache, which should be hugely better. But the code's changed so much I'll have to spend some time optimizing it before I can make a reasonable comparison, after I've finished with all the correctness issues. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/