[PATCH][WIP] dio rewrite

2013-02-13 Thread Kent Overstreet
Last posting: http://marc.info/?l=linux-fsdevel=136063048002755=2

Got it working and ran some benchmarks. On a high end SSD, doing 4k
random reads with fio I got around a 30% increase in throughput.

(That was without the aio changes I recently did. With those, throughput
was aproximately doubled).

The decrease in compiled binary size is even more dramatic than the
reduction in LOC:

3.8-rc7
   textdata bss dec hex filename
  11609  16   8   116332d71 fs/direct-io.o

My version:
   textdata bss dec hex filename
   3545  16   03561 de9 fs/direct-io.o

It's only been lightly tested - I haven't run xfstests yet - but there
shouldn't be anything broken excluding btrfs.

There's a few more performance optimizations I may do, but aside from
the btrfs issues I think it's essentially done.

Due to the sheer number of hairy corner cases in the dio code, I'd
really like to get as much review as possible. The new code should be
vastly easier to review and understand, I think.

Git repo:
http://evilpiepirate.org/git/linux-bcache.git block_stuff

---
 fs/direct-io.c | 1323 +++-
 1 file changed, 357 insertions(+), 966 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8e838b1..1fb9fb4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -8,7 +8,7 @@
  * 04Jul2002   Andrew Morton
  * Initial version
  * 11Sep2002   janet...@us.ibm.com
- * added readv/writev support.
+ * added readv/writev support.
  * 29Oct2002   Andrew Morton
  * rewrote bio_add_page() support.
  * 30Oct2002   pbad...@us.ibm.com
@@ -38,183 +38,37 @@
 #include 
 #include 
 
-/*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
- */
-#define DIO_PAGES  64
-
-/*
- * This code generally works in units of "dio_blocks".  A dio_block is
- * somewhere between the hard sector size and the filesystem block size.  it
- * is determined on a per-invocation basis.   When talking to the filesystem
- * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity
- * down by dio->blkfactor.  Similarly, fs-blocksize quantities are converted
- * to bio_block quantities by shifting left by blkfactor.
- *
- * If blkfactor is zero then the user's request was aligned to the filesystem's
- * blocksize.
- */
-
 /* dio_state only used in the submission path */
-
 struct dio_submit {
-   struct bio *bio;/* bio under assembly */
-   unsigned blkbits;   /* doesn't change */
-   unsigned blkfactor; /* When we're using an alignment which
-  is finer than the filesystem's soft
-  blocksize, this specifies how much
-  finer.  blkfactor=2 means 1/4-block
-  alignment.  Does not change */
-   unsigned start_zero_done;   /* flag: sub-blocksize zeroing has
-  been performed at the start of a
-  write */
-   int pages_in_io;/* approximate total IO pages */
-   size_t  size;   /* total request size (doesn't change)*/
-   sector_t block_in_file; /* Current offset into the underlying
-  file in dio_block units. */
-   unsigned blocks_available;  /* At block_in_file.  changes */
-   int reap_counter;   /* rate limit reaping */
-   sector_t final_block_in_request;/* doesn't change */
-   unsigned first_block_in_page;   /* doesn't change, Used only once */
-   int boundary;   /* prev block is at a boundary */
-   get_block_t *get_block; /* block mapping function */
-   dio_submit_t *submit_io;/* IO submition function */
-
-   loff_t logical_offset_in_bio;   /* current first logical block in bio */
-   sector_t final_block_in_bio;/* current final block in bio + 1 */
-   sector_t next_block_for_io; /* next block to be put under IO,
-  in dio_blocks units */
-
-   /*
-* Deferred addition of a page to the dio.  These variables are
-* private to dio_send_cur_page(), submit_page_section() and
-* dio_bio_add_page().
-*/
-   struct page *cur_page;  /* The page */
-   unsigned cur_page_offset;   /* Offset into it, in bytes */
-   unsigned cur_page_len;  /* Nr of bytes at cur_page_offset */
-   sector_t cur_page_block;/* Where it starts */
-   loff_t cur_page_fs_offset;  /* Offset in file */
-
-   /*
-* Page fetching state. These variables belong to dio_refill_pages().
-*/
-   int curr_page;  /* 

[PATCH][WIP] dio rewrite

2013-02-13 Thread Kent Overstreet
Last posting: http://marc.info/?l=linux-fsdevelm=136063048002755w=2

Got it working and ran some benchmarks. On a high end SSD, doing 4k
random reads with fio I got around a 30% increase in throughput.

(That was without the aio changes I recently did. With those, throughput
was aproximately doubled).

The decrease in compiled binary size is even more dramatic than the
reduction in LOC:

3.8-rc7
   textdata bss dec hex filename
  11609  16   8   116332d71 fs/direct-io.o

My version:
   textdata bss dec hex filename
   3545  16   03561 de9 fs/direct-io.o

It's only been lightly tested - I haven't run xfstests yet - but there
shouldn't be anything broken excluding btrfs.

There's a few more performance optimizations I may do, but aside from
the btrfs issues I think it's essentially done.

Due to the sheer number of hairy corner cases in the dio code, I'd
really like to get as much review as possible. The new code should be
vastly easier to review and understand, I think.

Git repo:
http://evilpiepirate.org/git/linux-bcache.git block_stuff

---
 fs/direct-io.c | 1323 +++-
 1 file changed, 357 insertions(+), 966 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8e838b1..1fb9fb4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -8,7 +8,7 @@
  * 04Jul2002   Andrew Morton
  * Initial version
  * 11Sep2002   janet...@us.ibm.com
- * added readv/writev support.
+ * added readv/writev support.
  * 29Oct2002   Andrew Morton
  * rewrote bio_add_page() support.
  * 30Oct2002   pbad...@us.ibm.com
@@ -38,183 +38,37 @@
 #include linux/atomic.h
 #include linux/prefetch.h
 
-/*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
- */
-#define DIO_PAGES  64
-
-/*
- * This code generally works in units of dio_blocks.  A dio_block is
- * somewhere between the hard sector size and the filesystem block size.  it
- * is determined on a per-invocation basis.   When talking to the filesystem
- * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity
- * down by dio-blkfactor.  Similarly, fs-blocksize quantities are converted
- * to bio_block quantities by shifting left by blkfactor.
- *
- * If blkfactor is zero then the user's request was aligned to the filesystem's
- * blocksize.
- */
-
 /* dio_state only used in the submission path */
-
 struct dio_submit {
-   struct bio *bio;/* bio under assembly */
-   unsigned blkbits;   /* doesn't change */
-   unsigned blkfactor; /* When we're using an alignment which
-  is finer than the filesystem's soft
-  blocksize, this specifies how much
-  finer.  blkfactor=2 means 1/4-block
-  alignment.  Does not change */
-   unsigned start_zero_done;   /* flag: sub-blocksize zeroing has
-  been performed at the start of a
-  write */
-   int pages_in_io;/* approximate total IO pages */
-   size_t  size;   /* total request size (doesn't change)*/
-   sector_t block_in_file; /* Current offset into the underlying
-  file in dio_block units. */
-   unsigned blocks_available;  /* At block_in_file.  changes */
-   int reap_counter;   /* rate limit reaping */
-   sector_t final_block_in_request;/* doesn't change */
-   unsigned first_block_in_page;   /* doesn't change, Used only once */
-   int boundary;   /* prev block is at a boundary */
-   get_block_t *get_block; /* block mapping function */
-   dio_submit_t *submit_io;/* IO submition function */
-
-   loff_t logical_offset_in_bio;   /* current first logical block in bio */
-   sector_t final_block_in_bio;/* current final block in bio + 1 */
-   sector_t next_block_for_io; /* next block to be put under IO,
-  in dio_blocks units */
-
-   /*
-* Deferred addition of a page to the dio.  These variables are
-* private to dio_send_cur_page(), submit_page_section() and
-* dio_bio_add_page().
-*/
-   struct page *cur_page;  /* The page */
-   unsigned cur_page_offset;   /* Offset into it, in bytes */
-   unsigned cur_page_len;  /* Nr of bytes at cur_page_offset */
-   sector_t cur_page_block;/* Where it starts */
-   loff_t cur_page_fs_offset;  /* Offset in file */
-
-   /*
-* Page fetching state. These variables belong to dio_refill_pages().
-*/
-   int 

Re: [PATCH][WIP] dio rewrite

2013-02-11 Thread Kent Overstreet
On Mon, Feb 11, 2013 at 05:24:13PM -0800, Andi Kleen wrote:
> On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote:
> > I finally started hacking on the dio code, and it's far from done but
> > it's turning out better than I expected so I thought I'd show off what
> > I've got so far.
> 
> The critical metric for some of the highend workloads 
> is the number of cache lines touched in the submission path
> (especially potentially modified ones). How does your new code fare on that?

I haven't started looking at performance yet, except to sanity check it
- but it should be substantially better, mostly due to getting rid of
dio->pages.

As far as touching shared cachelines I don't think it's any different
than the old code, though.

There's also icache, which should be hugely better.

But the code's changed so much I'll have to spend some time optimizing
it before I can make a reasonable comparison, after I've finished with
all the correctness issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][WIP] dio rewrite

2013-02-11 Thread Andi Kleen
On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote:
> I finally started hacking on the dio code, and it's far from done but
> it's turning out better than I expected so I thought I'd show off what
> I've got so far.

The critical metric for some of the highend workloads 
is the number of cache lines touched in the submission path
(especially potentially modified ones). How does your new code fare on that?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][WIP] dio rewrite

2013-02-11 Thread Kent Overstreet
I finally started hacking on the dio code, and it's far from done but
it's turning out better than I expected so I thought I'd show off what
I've got so far.

The end result is _vastly_ simpler - direct-io.c is now ~700 lines, vs.
~1300 previously. dio_submit is almost gone, I'm down to 4 things left
in it.

It relies heavily on my block layer patches for efficient bio splitting,
and making generic_make_request() take arbitrary size bios. I just
rebased those on 3.8-rc the other day, I need to start sending them to
Jens again.

The approach it takes sort of inverts the way the code was structured
before. It starts out by allocating a bio, then it calls
get_user_pages() to fill up the bvec.

Then, it figures out what to do with the bio by calling getblocks()
repeatedly - where to send it if it's mapped, splitting the bio as
needed.

This gets rid of a _ton_ of state in struct dio_submit.

It also gets rid of the various differences between async and sync
requests - previously, for async reads it marked pages dirty before
submitting the io (in process context), then on completion punts to
worqueue to redirty the pages if any need to be. This now happens for
sync reads, too.

There was also some level of throttling for the number of pages pinned -
but this only worked for sync requests. I tossed it out - if we do need
that throttling (which I'm skeptical of considering it didn't work for
aio), we'll want to come up with something saner.

There's still various broken stuff - dio_zero_block() needs to be
reimplemented, and more critically dio_end_io() (and maybe the
filesystem dio completions) do stuff with the biovec - the bio splitting
breaks this, because the bvec is shared between bios.
---
 fs/direct-io.c | 1217 +++-
 1 file changed, 312 insertions(+), 905 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8e838b1..eb4479f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -8,7 +8,7 @@
  * 04Jul2002   Andrew Morton
  * Initial version
  * 11Sep2002   janet...@us.ibm.com
- * added readv/writev support.
+ * added readv/writev support.
  * 29Oct2002   Andrew Morton
  * rewrote bio_add_page() support.
  * 30Oct2002   pbad...@us.ibm.com
@@ -38,184 +38,38 @@
 #include 
 #include 
 
-/*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
- */
-#define DIO_PAGES  64
-
-/*
- * This code generally works in units of "dio_blocks".  A dio_block is
- * somewhere between the hard sector size and the filesystem block size.  it
- * is determined on a per-invocation basis.   When talking to the filesystem
- * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity
- * down by dio->blkfactor.  Similarly, fs-blocksize quantities are converted
- * to bio_block quantities by shifting left by blkfactor.
- *
- * If blkfactor is zero then the user's request was aligned to the filesystem's
- * blocksize.
- */
-
 /* dio_state only used in the submission path */
 
 struct dio_submit {
-   struct bio *bio;/* bio under assembly */
-   unsigned blkbits;   /* doesn't change */
-   unsigned blkfactor; /* When we're using an alignment which
-  is finer than the filesystem's soft
-  blocksize, this specifies how much
-  finer.  blkfactor=2 means 1/4-block
-  alignment.  Does not change */
-   unsigned start_zero_done;   /* flag: sub-blocksize zeroing has
-  been performed at the start of a
-  write */
-   int pages_in_io;/* approximate total IO pages */
-   size_t  size;   /* total request size (doesn't change)*/
-   sector_t block_in_file; /* Current offset into the underlying
-  file in dio_block units. */
-   unsigned blocks_available;  /* At block_in_file.  changes */
-   int reap_counter;   /* rate limit reaping */
-   sector_t final_block_in_request;/* doesn't change */
-   unsigned first_block_in_page;   /* doesn't change, Used only once */
-   int boundary;   /* prev block is at a boundary */
-   get_block_t *get_block; /* block mapping function */
-   dio_submit_t *submit_io;/* IO submition function */
-
-   loff_t logical_offset_in_bio;   /* current first logical block in bio */
-   sector_t final_block_in_bio;/* current final block in bio + 1 */
-   sector_t next_block_for_io; /* next block to be put under IO,
-  in dio_blocks units */
-
-   /*
-* Deferred addition of a page to the dio.  

[PATCH][WIP] dio rewrite

2013-02-11 Thread Kent Overstreet
I finally started hacking on the dio code, and it's far from done but
it's turning out better than I expected so I thought I'd show off what
I've got so far.

The end result is _vastly_ simpler - direct-io.c is now ~700 lines, vs.
~1300 previously. dio_submit is almost gone, I'm down to 4 things left
in it.

It relies heavily on my block layer patches for efficient bio splitting,
and making generic_make_request() take arbitrary size bios. I just
rebased those on 3.8-rc the other day, I need to start sending them to
Jens again.

The approach it takes sort of inverts the way the code was structured
before. It starts out by allocating a bio, then it calls
get_user_pages() to fill up the bvec.

Then, it figures out what to do with the bio by calling getblocks()
repeatedly - where to send it if it's mapped, splitting the bio as
needed.

This gets rid of a _ton_ of state in struct dio_submit.

It also gets rid of the various differences between async and sync
requests - previously, for async reads it marked pages dirty before
submitting the io (in process context), then on completion punts to
worqueue to redirty the pages if any need to be. This now happens for
sync reads, too.

There was also some level of throttling for the number of pages pinned -
but this only worked for sync requests. I tossed it out - if we do need
that throttling (which I'm skeptical of considering it didn't work for
aio), we'll want to come up with something saner.

There's still various broken stuff - dio_zero_block() needs to be
reimplemented, and more critically dio_end_io() (and maybe the
filesystem dio completions) do stuff with the biovec - the bio splitting
breaks this, because the bvec is shared between bios.
---
 fs/direct-io.c | 1217 +++-
 1 file changed, 312 insertions(+), 905 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8e838b1..eb4479f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -8,7 +8,7 @@
  * 04Jul2002   Andrew Morton
  * Initial version
  * 11Sep2002   janet...@us.ibm.com
- * added readv/writev support.
+ * added readv/writev support.
  * 29Oct2002   Andrew Morton
  * rewrote bio_add_page() support.
  * 30Oct2002   pbad...@us.ibm.com
@@ -38,184 +38,38 @@
 #include linux/atomic.h
 #include linux/prefetch.h
 
-/*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
- */
-#define DIO_PAGES  64
-
-/*
- * This code generally works in units of dio_blocks.  A dio_block is
- * somewhere between the hard sector size and the filesystem block size.  it
- * is determined on a per-invocation basis.   When talking to the filesystem
- * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity
- * down by dio-blkfactor.  Similarly, fs-blocksize quantities are converted
- * to bio_block quantities by shifting left by blkfactor.
- *
- * If blkfactor is zero then the user's request was aligned to the filesystem's
- * blocksize.
- */
-
 /* dio_state only used in the submission path */
 
 struct dio_submit {
-   struct bio *bio;/* bio under assembly */
-   unsigned blkbits;   /* doesn't change */
-   unsigned blkfactor; /* When we're using an alignment which
-  is finer than the filesystem's soft
-  blocksize, this specifies how much
-  finer.  blkfactor=2 means 1/4-block
-  alignment.  Does not change */
-   unsigned start_zero_done;   /* flag: sub-blocksize zeroing has
-  been performed at the start of a
-  write */
-   int pages_in_io;/* approximate total IO pages */
-   size_t  size;   /* total request size (doesn't change)*/
-   sector_t block_in_file; /* Current offset into the underlying
-  file in dio_block units. */
-   unsigned blocks_available;  /* At block_in_file.  changes */
-   int reap_counter;   /* rate limit reaping */
-   sector_t final_block_in_request;/* doesn't change */
-   unsigned first_block_in_page;   /* doesn't change, Used only once */
-   int boundary;   /* prev block is at a boundary */
-   get_block_t *get_block; /* block mapping function */
-   dio_submit_t *submit_io;/* IO submition function */
-
-   loff_t logical_offset_in_bio;   /* current first logical block in bio */
-   sector_t final_block_in_bio;/* current final block in bio + 1 */
-   sector_t next_block_for_io; /* next block to be put under IO,
-  in dio_blocks units */
-
-   /*
-* Deferred 

Re: [PATCH][WIP] dio rewrite

2013-02-11 Thread Andi Kleen
On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote:
 I finally started hacking on the dio code, and it's far from done but
 it's turning out better than I expected so I thought I'd show off what
 I've got so far.

The critical metric for some of the highend workloads 
is the number of cache lines touched in the submission path
(especially potentially modified ones). How does your new code fare on that?

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][WIP] dio rewrite

2013-02-11 Thread Kent Overstreet
On Mon, Feb 11, 2013 at 05:24:13PM -0800, Andi Kleen wrote:
 On Mon, Feb 11, 2013 at 04:53:26PM -0800, Kent Overstreet wrote:
  I finally started hacking on the dio code, and it's far from done but
  it's turning out better than I expected so I thought I'd show off what
  I've got so far.
 
 The critical metric for some of the highend workloads 
 is the number of cache lines touched in the submission path
 (especially potentially modified ones). How does your new code fare on that?

I haven't started looking at performance yet, except to sanity check it
- but it should be substantially better, mostly due to getting rid of
dio-pages.

As far as touching shared cachelines I don't think it's any different
than the old code, though.

There's also icache, which should be hugely better.

But the code's changed so much I'll have to spend some time optimizing
it before I can make a reasonable comparison, after I've finished with
all the correctness issues.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/