Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Akira Hayakawa
Mike,

I am happy to see that
guys from filesystem to the block subsystem
have been discussing how to handle barriers in each layer
almost independently.

>> Merging the barriers and replacing it with a single FLUSH
>> by accepting a lot of writes
>> is the reason for deferring barriers in writeboost.
>> If you want to know further I recommend you to
>> look at the source code to see
>> how queue_barrier_io() is used and
>> how the barriers are kidnapped in queue_flushing().
> 
> AFAICT, this is an unfortunate hack resulting from dm-writeboost being a
> bio-based DM target.  The block layer already has support for FLUSH
> merging, see commit ae1b1539622fb4 ("block: reimplement FLUSH/FUA to
> support merge")

I have read the comments on this patch.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae

My understanding is that
REQ_FUA and REQ_FLUSH are decomposed to more primitive flags
in accordance with the property of the device.
{PRE|POST}FLUSH request are queued in flush_queue[one of the two]
(which is often called "pending" queue) and
calls blk_kick_flush that defers flushing and later
if few conditions are satisfied it actually inserts "a single" flush request
no matter how many flush requests are in the pending queue
(just judged by !list_empty(pending)).

If my understanding is correct,
we are deferring flush across three layers.

Let me summarize.
- For filesystem, Dave said that metadata journaling defers
  barriers.
- For device-mapper, writeboost, dm-cache and dm-thin defers
  barriers.
- For block, it defers barriers and results it to
  merging several requests into one after all.

I think writeboost can not discard this deferring hack because
deferring the barriers is usually very effective to
make it likely to fulfill the RAM buffer which
makes the write throughput higher and decrease the CPU usage.
However, for particular case such as what Dave pointed out,
this hack is just a disturbance.
Even for writeboost, the hack in the patch
is just a disturbance too unfortunately.
Upper layer dislikes the lower layers hidden optimization is
just a limitation of the layered architecture of Linux kernel.

I think these three layers are thinking almost the same thing
is that these hacks are all good and each layer
preparing a switch to turn on/off the optimization
is what we have to do for compromise.

All the problems originates from the fact that
we have volatile cache and persistent memory can
take these problems away.

With persistent memory provided
writeboost can switch off the deferring barriers.
However,
I think all the servers are equipped with
persistent memory is the future tale.
So, my idea is to maintain both modes
for RAM buffer type (volatile, non-volatile)
and in case of the former type
deferring hack is a good compromise.

Akira
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Akira Hayakawa
Dave,

> i.e. there's no point justifying a behaviour with "we could do this
> in future so lets ignore the impact on current users"...
Sure, I am happy if we find a solution that
is good for both of us or filesystem and block in other word.

> e.g. what happens if a user has a mixed workload - one where
> performance benefits are only seen by delaying FUA, and another that
> is seriously slowed down by delaying FUA requests?  This is where
> knobs are problematic
You are right.
But there is no perfect solution to satisfy all.
Dealing with each requirement will only complicate the code.
Stepping away from the user and
focusing on filesystem-block boundary
>> Maybe, writeboost should disable deferring barriers
>> if barrier_deadline_ms parameter is especially 0.
adding the switch for the mounted filesystem to decides on/off
is a simple but effective solution I believe.

Deciding per bio basis instead of per device could be an another solution.
I am happy if I can check the bio if it "may or may not defer the barrier". 

Akira
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Akira Hayakawa
Christoph,

> You can detect O_DIRECT writes by second guession a special combination
> of REQ_ flags only used there, as cfg tries to treat it special:
> 
> #define WRITE_SYNC  (WRITE | REQ_SYNC | REQ_NOIDLE)
> #define WRITE_ODIRECT   (WRITE | REQ_SYNC)
> 
> the lack of REQ_NOIDLE when REQ_SYNC is set gives it away.  Not related
> to the FLUSH or FUA flags in any way, though.
Thanks.
But, our problem is to detect the bio may or may not be deferred.
The flag REQ_NOIDLE is the one?

> Akira, can you explain the workloads where your delay of FLUSH or FUA
> requests helps you in any way?  I very much agree with Dave's reasoning,
> but if you found workloads where your hack helps we should make sure we
> fix them at the place where they are issued.
One of the examples is a fileserver accessed by multiple users.
A barrier is submitted when a user closes a file for example.

As I said in my previous post
https://lkml.org/lkml/2013/10/4/186
writeboost has RAM buffer and we want one to be
fulfilled with writes and then flushed to the cache device
that takes all the barriers away with the completion.
In that case we pay the minimum penalty for the barriers.
Interestingly, writeboost is happy with a lot of writes.

By deferring these barriers (FLUSH and FUA)
multiple barriers are likely to be merged on a RAM buffer
and then processed by replacing with only one FLUSH.

Merging the barriers and replacing it with a single FLUSH
by accepting a lot of writes
is the reason for deferring barriers in writeboost.
If you want to know further I recommend you to
look at the source code to see
how queue_barrier_io() is used and
how the barriers are kidnapped in queue_flushing().

Akira
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Christoph Hellwig
On Tue, Oct 08, 2013 at 10:43:07AM +1100, Dave Chinner wrote:
> > Maybe, writeboost should disable deferring barriers
> > if barrier_deadline_ms parameter is especially 0.
> > Linux kernel's layered architecture is obviously not always perfect
> > so there are similar cases in other boundaries
> > such as O_DIRECT to bypass the page cache.
> 
> Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
> relying on the user tweaking the corect knobs for their workload.

You can detect O_DIRECT writes by second guession a special combination
of REQ_ flags only used there, as cfg tries to treat it special:

#define WRITE_SYNC  (WRITE | REQ_SYNC | REQ_NOIDLE)
#define WRITE_ODIRECT   (WRITE | REQ_SYNC)

the lack of REQ_NOIDLE when REQ_SYNC is set gives it away.  Not related
to the FLUSH or FUA flags in any way, though.

Akira, can you explain the workloads where your delay of FLUSH or FUA
requests helps you in any way?  I very much agree with Dave's reasoning,
but if you found workloads where your hack helps we should make sure we
fix them at the place where they are issued.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Christoph Hellwig
On Tue, Oct 08, 2013 at 10:43:07AM +1100, Dave Chinner wrote:
  Maybe, writeboost should disable deferring barriers
  if barrier_deadline_ms parameter is especially 0.
  Linux kernel's layered architecture is obviously not always perfect
  so there are similar cases in other boundaries
  such as O_DIRECT to bypass the page cache.
 
 Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
 relying on the user tweaking the corect knobs for their workload.

You can detect O_DIRECT writes by second guession a special combination
of REQ_ flags only used there, as cfg tries to treat it special:

#define WRITE_SYNC  (WRITE | REQ_SYNC | REQ_NOIDLE)
#define WRITE_ODIRECT   (WRITE | REQ_SYNC)

the lack of REQ_NOIDLE when REQ_SYNC is set gives it away.  Not related
to the FLUSH or FUA flags in any way, though.

Akira, can you explain the workloads where your delay of FLUSH or FUA
requests helps you in any way?  I very much agree with Dave's reasoning,
but if you found workloads where your hack helps we should make sure we
fix them at the place where they are issued.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Akira Hayakawa
Christoph,

 You can detect O_DIRECT writes by second guession a special combination
 of REQ_ flags only used there, as cfg tries to treat it special:
 
 #define WRITE_SYNC  (WRITE | REQ_SYNC | REQ_NOIDLE)
 #define WRITE_ODIRECT   (WRITE | REQ_SYNC)
 
 the lack of REQ_NOIDLE when REQ_SYNC is set gives it away.  Not related
 to the FLUSH or FUA flags in any way, though.
Thanks.
But, our problem is to detect the bio may or may not be deferred.
The flag REQ_NOIDLE is the one?

 Akira, can you explain the workloads where your delay of FLUSH or FUA
 requests helps you in any way?  I very much agree with Dave's reasoning,
 but if you found workloads where your hack helps we should make sure we
 fix them at the place where they are issued.
One of the examples is a fileserver accessed by multiple users.
A barrier is submitted when a user closes a file for example.

As I said in my previous post
https://lkml.org/lkml/2013/10/4/186
writeboost has RAM buffer and we want one to be
fulfilled with writes and then flushed to the cache device
that takes all the barriers away with the completion.
In that case we pay the minimum penalty for the barriers.
Interestingly, writeboost is happy with a lot of writes.

By deferring these barriers (FLUSH and FUA)
multiple barriers are likely to be merged on a RAM buffer
and then processed by replacing with only one FLUSH.

Merging the barriers and replacing it with a single FLUSH
by accepting a lot of writes
is the reason for deferring barriers in writeboost.
If you want to know further I recommend you to
look at the source code to see
how queue_barrier_io() is used and
how the barriers are kidnapped in queue_flushing().

Akira
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Akira Hayakawa
Dave,

 i.e. there's no point justifying a behaviour with we could do this
 in future so lets ignore the impact on current users...
Sure, I am happy if we find a solution that
is good for both of us or filesystem and block in other word.

 e.g. what happens if a user has a mixed workload - one where
 performance benefits are only seen by delaying FUA, and another that
 is seriously slowed down by delaying FUA requests?  This is where
 knobs are problematic
You are right.
But there is no perfect solution to satisfy all.
Dealing with each requirement will only complicate the code.
Stepping away from the user and
focusing on filesystem-block boundary
 Maybe, writeboost should disable deferring barriers
 if barrier_deadline_ms parameter is especially 0.
adding the switch for the mounted filesystem to decides on/off
is a simple but effective solution I believe.

Deciding per bio basis instead of per device could be an another solution.
I am happy if I can check the bio if it may or may not defer the barrier. 

Akira
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-08 Thread Akira Hayakawa
Mike,

I am happy to see that
guys from filesystem to the block subsystem
have been discussing how to handle barriers in each layer
almost independently.

 Merging the barriers and replacing it with a single FLUSH
 by accepting a lot of writes
 is the reason for deferring barriers in writeboost.
 If you want to know further I recommend you to
 look at the source code to see
 how queue_barrier_io() is used and
 how the barriers are kidnapped in queue_flushing().
 
 AFAICT, this is an unfortunate hack resulting from dm-writeboost being a
 bio-based DM target.  The block layer already has support for FLUSH
 merging, see commit ae1b1539622fb4 (block: reimplement FLUSH/FUA to
 support merge)

I have read the comments on this patch.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae

My understanding is that
REQ_FUA and REQ_FLUSH are decomposed to more primitive flags
in accordance with the property of the device.
{PRE|POST}FLUSH request are queued in flush_queue[one of the two]
(which is often called pending queue) and
calls blk_kick_flush that defers flushing and later
if few conditions are satisfied it actually inserts a single flush request
no matter how many flush requests are in the pending queue
(just judged by !list_empty(pending)).

If my understanding is correct,
we are deferring flush across three layers.

Let me summarize.
- For filesystem, Dave said that metadata journaling defers
  barriers.
- For device-mapper, writeboost, dm-cache and dm-thin defers
  barriers.
- For block, it defers barriers and results it to
  merging several requests into one after all.

I think writeboost can not discard this deferring hack because
deferring the barriers is usually very effective to
make it likely to fulfill the RAM buffer which
makes the write throughput higher and decrease the CPU usage.
However, for particular case such as what Dave pointed out,
this hack is just a disturbance.
Even for writeboost, the hack in the patch
is just a disturbance too unfortunately.
Upper layer dislikes the lower layers hidden optimization is
just a limitation of the layered architecture of Linux kernel.

I think these three layers are thinking almost the same thing
is that these hacks are all good and each layer
preparing a switch to turn on/off the optimization
is what we have to do for compromise.

All the problems originates from the fact that
we have volatile cache and persistent memory can
take these problems away.

With persistent memory provided
writeboost can switch off the deferring barriers.
However,
I think all the servers are equipped with
persistent memory is the future tale.
So, my idea is to maintain both modes
for RAM buffer type (volatile, non-volatile)
and in case of the former type
deferring hack is a good compromise.

Akira
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-07 Thread Dave Chinner
On Sat, Oct 05, 2013 at 04:51:16PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> > That's where arbitrary delays in the storage stack below XFS cause
> > problems - if the first FUA log write is delayed, the next log
> > buffer will get filled, issued and delayed, and when we run out of
> > log buffers (there are 8 maximum) the entire log subsystem will
> > stall, stopping *all* log commit operations until log buffer
> > IOs complete and become free again. i.e. it can stall modifications
> > across the entire filesystem while we wait for batch timeouts to
> > expire and issue and complete FUA requests.
> To me, this sounds like design failure in XFS log subsystem.

If you say so. As it is, XFS is the best of all the linux
filesystems when it comes to performance under a heavy fsync
workload, so if you consider it broken by design then you've got a
horror show waiting for you on any other filesystem...

> Or just the limitation of metadata journal.

It's a recovery limitation - the more uncompleted log buffers we
have outstanding, the more space in the log will be considered
unrecoverable during a crash...

> > IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
> > point where they are issued - any attempt to further optimise them
> > by adding delays down in the stack to aggregate FUA operations will
> > only increase latency of the operations that the issuer want to have
> > complete as fast as possible
> That lower layer stack attempts to optimize further
> can benefit any filesystems.
> So, your opinion is not always correct although
> it is always correct in error handling or memory management.
> 
> I have proposed future plan of using persistent memory.
> I believe with this leap forward
> filesystems are free from doing such optimization
> relevant to write barriers. For more detail, please see my post.
> https://lkml.org/lkml/2013/10/4/186

Sure, we already do that in the storage stack to minimise the impact
of FUA operations - it's called a non-volatile write cache, and most
RAID controllers have them. They rely on immediate dispatch of FUA
operations to get them into the write cache as quickly as possible
(i.e. what filesystems do right now), and that is something your
proposed behaviour will prevent.

i.e. there's no point justifying a behaviour with "we could do this
in future so lets ignore the impact on current users"...

> However,
> I think I should leave option to disable the optimization
> in case the upper layer doesn't like it.
> Maybe, writeboost should disable deferring barriers
> if barrier_deadline_ms parameter is especially 0.
> Linux kernel's layered architecture is obviously not always perfect
> so there are similar cases in other boundaries
> such as O_DIRECT to bypass the page cache.

Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
relying on the user tweaking the corect knobs for their workload.

e.g. what happens if a user has a mixed workload - one where
performance benefits are only seen by delaying FUA, and another that
is seriously slowed down by delaying FUA requests?  This is where
knobs are problematic

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-07 Thread Dave Chinner
On Sat, Oct 05, 2013 at 04:51:16PM +0900, Akira Hayakawa wrote:
 Dave,
 
  That's where arbitrary delays in the storage stack below XFS cause
  problems - if the first FUA log write is delayed, the next log
  buffer will get filled, issued and delayed, and when we run out of
  log buffers (there are 8 maximum) the entire log subsystem will
  stall, stopping *all* log commit operations until log buffer
  IOs complete and become free again. i.e. it can stall modifications
  across the entire filesystem while we wait for batch timeouts to
  expire and issue and complete FUA requests.
 To me, this sounds like design failure in XFS log subsystem.

If you say so. As it is, XFS is the best of all the linux
filesystems when it comes to performance under a heavy fsync
workload, so if you consider it broken by design then you've got a
horror show waiting for you on any other filesystem...

 Or just the limitation of metadata journal.

It's a recovery limitation - the more uncompleted log buffers we
have outstanding, the more space in the log will be considered
unrecoverable during a crash...

  IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
  point where they are issued - any attempt to further optimise them
  by adding delays down in the stack to aggregate FUA operations will
  only increase latency of the operations that the issuer want to have
  complete as fast as possible
 That lower layer stack attempts to optimize further
 can benefit any filesystems.
 So, your opinion is not always correct although
 it is always correct in error handling or memory management.
 
 I have proposed future plan of using persistent memory.
 I believe with this leap forward
 filesystems are free from doing such optimization
 relevant to write barriers. For more detail, please see my post.
 https://lkml.org/lkml/2013/10/4/186

Sure, we already do that in the storage stack to minimise the impact
of FUA operations - it's called a non-volatile write cache, and most
RAID controllers have them. They rely on immediate dispatch of FUA
operations to get them into the write cache as quickly as possible
(i.e. what filesystems do right now), and that is something your
proposed behaviour will prevent.

i.e. there's no point justifying a behaviour with we could do this
in future so lets ignore the impact on current users...

 However,
 I think I should leave option to disable the optimization
 in case the upper layer doesn't like it.
 Maybe, writeboost should disable deferring barriers
 if barrier_deadline_ms parameter is especially 0.
 Linux kernel's layered architecture is obviously not always perfect
 so there are similar cases in other boundaries
 such as O_DIRECT to bypass the page cache.

Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're
relying on the user tweaking the corect knobs for their workload.

e.g. what happens if a user has a mixed workload - one where
performance benefits are only seen by delaying FUA, and another that
is seriously slowed down by delaying FUA requests?  This is where
knobs are problematic

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-05 Thread Akira Hayakawa
Dave,

> That's where arbitrary delays in the storage stack below XFS cause
> problems - if the first FUA log write is delayed, the next log
> buffer will get filled, issued and delayed, and when we run out of
> log buffers (there are 8 maximum) the entire log subsystem will
> stall, stopping *all* log commit operations until log buffer
> IOs complete and become free again. i.e. it can stall modifications
> across the entire filesystem while we wait for batch timeouts to
> expire and issue and complete FUA requests.
To me, this sounds like design failure in XFS log subsystem.
Or just the limitation of metadata journal.

> IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
> point where they are issued - any attempt to further optimise them
> by adding delays down in the stack to aggregate FUA operations will
> only increase latency of the operations that the issuer want to have
> complete as fast as possible
That lower layer stack attempts to optimize further
can benefit any filesystems.
So, your opinion is not always correct although
it is always correct in error handling or memory management.

I have proposed future plan of using persistent memory.
I believe with this leap forward
filesystems are free from doing such optimization
relevant to write barriers. For more detail, please see my post.
https://lkml.org/lkml/2013/10/4/186

However,
I think I should leave option to disable the optimization
in case the upper layer doesn't like it.
Maybe, writeboost should disable deferring barriers
if barrier_deadline_ms parameter is especially 0.
Linux kernel's layered architecture is obviously not always perfect
so there are similar cases in other boundaries
such as O_DIRECT to bypass the page cache.

Maybe, dm-thin and dm-cache should add such switch.

Akira
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-05 Thread Akira Hayakawa
Dave,

 That's where arbitrary delays in the storage stack below XFS cause
 problems - if the first FUA log write is delayed, the next log
 buffer will get filled, issued and delayed, and when we run out of
 log buffers (there are 8 maximum) the entire log subsystem will
 stall, stopping *all* log commit operations until log buffer
 IOs complete and become free again. i.e. it can stall modifications
 across the entire filesystem while we wait for batch timeouts to
 expire and issue and complete FUA requests.
To me, this sounds like design failure in XFS log subsystem.
Or just the limitation of metadata journal.

 IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
 point where they are issued - any attempt to further optimise them
 by adding delays down in the stack to aggregate FUA operations will
 only increase latency of the operations that the issuer want to have
 complete as fast as possible
That lower layer stack attempts to optimize further
can benefit any filesystems.
So, your opinion is not always correct although
it is always correct in error handling or memory management.

I have proposed future plan of using persistent memory.
I believe with this leap forward
filesystems are free from doing such optimization
relevant to write barriers. For more detail, please see my post.
https://lkml.org/lkml/2013/10/4/186

However,
I think I should leave option to disable the optimization
in case the upper layer doesn't like it.
Maybe, writeboost should disable deferring barriers
if barrier_deadline_ms parameter is especially 0.
Linux kernel's layered architecture is obviously not always perfect
so there are similar cases in other boundaries
such as O_DIRECT to bypass the page cache.

Maybe, dm-thin and dm-cache should add such switch.

Akira
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-03 Thread Dave Chinner
On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 1 Oct 2013, Joe Thornber wrote:
> 
> > > Alternatively, delaying them will stall the filesystem because it's
> > > waiting for said REQ_FUA IO to complete. For example, journal writes
> > > in XFS are extremely IO latency sensitive in workloads that have a
> > > signifincant number of ordering constraints (e.g. O_SYNC writes,
> > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> > > filesystem for the majority of that barrier_deadline_ms.
> > 
> > Yes, this is a valid concern, but I assume Akira has benchmarked.
> > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
> > see if there are any other FUA requests on my queue that can be
> > aggregated into a single flush.  I agree with you that the target
> > should never delay waiting for new io; that's asking for trouble.
> > 
> > - Joe
> 
> You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
> and aggregate all the requests that were received while you processed the 
> initial request. This way, you can do request batching without introducing 
> artifical delays.

Yes, that's what XFS does with it's log when lots of fsync requests
come in. i.e. the first is dispatched immmediately, and the others
are gathered into the next log buffer until it is either full or the
original REQ_FUA log write completes.

That's where arbitrary delays in the storage stack below XFS cause
problems - if the first FUA log write is delayed, the next log
buffer will get filled, issued and delayed, and when we run out of
log buffers (there are 8 maximum) the entire log subsystem will
stall, stopping *all* log commit operations until log buffer
IOs complete and become free again. i.e. it can stall modifications
across the entire filesystem while we wait for batch timeouts to
expire and issue and complete FUA requests.

IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
point where they are issued - any attempt to further optimise them
by adding delays down in the stack to aggregate FUA operations will
only increase latency of the operations that the issuer want to have
complete as fast as possible

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-03 Thread Dave Chinner
On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
 
 
 On Tue, 1 Oct 2013, Joe Thornber wrote:
 
   Alternatively, delaying them will stall the filesystem because it's
   waiting for said REQ_FUA IO to complete. For example, journal writes
   in XFS are extremely IO latency sensitive in workloads that have a
   signifincant number of ordering constraints (e.g. O_SYNC writes,
   fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
   filesystem for the majority of that barrier_deadline_ms.
  
  Yes, this is a valid concern, but I assume Akira has benchmarked.
  With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
  see if there are any other FUA requests on my queue that can be
  aggregated into a single flush.  I agree with you that the target
  should never delay waiting for new io; that's asking for trouble.
  
  - Joe
 
 You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
 and aggregate all the requests that were received while you processed the 
 initial request. This way, you can do request batching without introducing 
 artifical delays.

Yes, that's what XFS does with it's log when lots of fsync requests
come in. i.e. the first is dispatched immmediately, and the others
are gathered into the next log buffer until it is either full or the
original REQ_FUA log write completes.

That's where arbitrary delays in the storage stack below XFS cause
problems - if the first FUA log write is delayed, the next log
buffer will get filled, issued and delayed, and when we run out of
log buffers (there are 8 maximum) the entire log subsystem will
stall, stopping *all* log commit operations until log buffer
IOs complete and become free again. i.e. it can stall modifications
across the entire filesystem while we wait for batch timeouts to
expire and issue and complete FUA requests.

IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the
point where they are issued - any attempt to further optimise them
by adding delays down in the stack to aggregate FUA operations will
only increase latency of the operations that the issuer want to have
complete as fast as possible

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-02 Thread Mikulas Patocka


On Tue, 1 Oct 2013, Joe Thornber wrote:

> > Alternatively, delaying them will stall the filesystem because it's
> > waiting for said REQ_FUA IO to complete. For example, journal writes
> > in XFS are extremely IO latency sensitive in workloads that have a
> > signifincant number of ordering constraints (e.g. O_SYNC writes,
> > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> > filesystem for the majority of that barrier_deadline_ms.
> 
> Yes, this is a valid concern, but I assume Akira has benchmarked.
> With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
> see if there are any other FUA requests on my queue that can be
> aggregated into a single flush.  I agree with you that the target
> should never delay waiting for new io; that's asking for trouble.
> 
> - Joe

You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
and aggregate all the requests that were received while you processed the 
initial request. This way, you can do request batching without introducing 
artifical delays.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-02 Thread Mikulas Patocka


On Tue, 1 Oct 2013, Joe Thornber wrote:

  Alternatively, delaying them will stall the filesystem because it's
  waiting for said REQ_FUA IO to complete. For example, journal writes
  in XFS are extremely IO latency sensitive in workloads that have a
  signifincant number of ordering constraints (e.g. O_SYNC writes,
  fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
  filesystem for the majority of that barrier_deadline_ms.
 
 Yes, this is a valid concern, but I assume Akira has benchmarked.
 With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
 see if there are any other FUA requests on my queue that can be
 aggregated into a single flush.  I agree with you that the target
 should never delay waiting for new io; that's asking for trouble.
 
 - Joe

You could send the first REQ_FUA/REQ_FLUSH request directly to the disk 
and aggregate all the requests that were received while you processed the 
initial request. This way, you can do request batching without introducing 
artifical delays.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-01 Thread Joe Thornber
On Thu, Sep 26, 2013 at 01:43:25PM +1000, Dave Chinner wrote:
> On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> > * Deferring ACK for barrier writes
> > Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> > Immediately handling these bios badly slows down writeboost.
> > It surveils the bios with these flags and forcefully flushes them
> > at worst case within `barrier_deadline_ms` period.
> 
> That rings alarm bells.
> 
> If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
> delaying them to allow other IOs to be submitted and dispatched may
> very well violate the IO ordering constraints the filesystem is
> trying to acheive.

If the fs is using REQ_FUA for ordering they need to wait for
completion of that bio before issuing any subsequent bio that needs to
be strictly ordered.  So I don't think there is any issue here.

> Alternatively, delaying them will stall the filesystem because it's
> waiting for said REQ_FUA IO to complete. For example, journal writes
> in XFS are extremely IO latency sensitive in workloads that have a
> signifincant number of ordering constraints (e.g. O_SYNC writes,
> fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
> filesystem for the majority of that barrier_deadline_ms.

Yes, this is a valid concern, but I assume Akira has benchmarked.
With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
see if there are any other FUA requests on my queue that can be
aggregated into a single flush.  I agree with you that the target
should never delay waiting for new io; that's asking for trouble.

- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-10-01 Thread Joe Thornber
On Thu, Sep 26, 2013 at 01:43:25PM +1000, Dave Chinner wrote:
 On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
  * Deferring ACK for barrier writes
  Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
  Immediately handling these bios badly slows down writeboost.
  It surveils the bios with these flags and forcefully flushes them
  at worst case within `barrier_deadline_ms` period.
 
 That rings alarm bells.
 
 If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
 delaying them to allow other IOs to be submitted and dispatched may
 very well violate the IO ordering constraints the filesystem is
 trying to acheive.

If the fs is using REQ_FUA for ordering they need to wait for
completion of that bio before issuing any subsequent bio that needs to
be strictly ordered.  So I don't think there is any issue here.

 Alternatively, delaying them will stall the filesystem because it's
 waiting for said REQ_FUA IO to complete. For example, journal writes
 in XFS are extremely IO latency sensitive in workloads that have a
 signifincant number of ordering constraints (e.g. O_SYNC writes,
 fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
 filesystem for the majority of that barrier_deadline_ms.

Yes, this is a valid concern, but I assume Akira has benchmarked.
With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to
see if there are any other FUA requests on my queue that can be
aggregated into a single flush.  I agree with you that the target
should never delay waiting for new io; that's asking for trouble.

- Joe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-28 Thread Akira Hayakawa
Hi,

Two major progress:
1) .ctr accepts segment size so .ctr now accepts 3 arguments:  
 .
2) fold the small files splitted that I suggested in the previous progress 
report.

For 1) 
I use zero length array to dynamically accept the segment size.
writeboost had the parameter embedded previously and one must re-compile the 
code
to change the parameter which badly loses usability was the problem.

For 2)
> Unfortunately I think you went too far with all these different small
> files, I was hoping to see 2 or 3 .c files and a couple .h files.
> 
> Maybe fold all the daemon code into a 1 .c and 1 .h ?
> 
> The core of the writeboost target in dm-writeboost-target.c ?
> 
> And fold all the other data structures into a 1 .c and 1 .h ?
> 
> When folding these files together feel free to use dividers in the code
> like dm-thin.c and dm-cache-target.c do, e.g.:
> 
> /*-*/
As Mike pointed out splitting into almost 20 files went too far.
I aggregated these files into 3 .c files and 3 .h files in total which are 
shown below.

-- Summary --
39 dm-writeboost-daemon.h
46 dm-writeboost-metadata.h
413 dm-writeboost.h
577 dm-writeboost-daemon.c
1129 dm-writeboost-metadata.c
1212 dm-writeboost-target.c
81 dm-writeboost.mod.c

The responsibilities of each .c file
is the policy of this splitting.

a) dm-writeboost-metadata.c
This file knows how the metadata is laid out on cache device.
It can audit/format the cache device metadata
and resume/free the in-core cache metadata from that on the cache device.
Also provides accessor to the in-core metadata resumed.

b) dm-writeboost-target.c
This file contains all the methods to define target type.
In terms of I/O processing, this files only defines
from when bio is accepted to when flush job is queued
which is described as "foreground processing" in the document.
What happens after the job is queued is defined in -daemon.c file.

c) dm-writeboost-daemon.c
This file contains all the daemons as Mike suggested.
Maybe, superblock_recorder should be in the -metadata.c file
but I chose to put it on this file since for unity.

Thanks,
Akira


followed by the current .h files.

-- dm-writeboost-daemon.h --
/*
 * Copyright (C) 2012-2013 Akira Hayakawa 
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_DAEMON_H
#define DM_WRITEBOOST_DAEMON_H

/**/

void flush_proc(struct work_struct *);

/**/

void queue_barrier_io(struct wb_cache *, struct bio *);
void barrier_deadline_proc(unsigned long data);
void flush_barrier_ios(struct work_struct *);

/**/

void migrate_proc(struct work_struct *);
void wait_for_migration(struct wb_cache *, u64 id);

/**/

void modulator_proc(struct work_struct *);

/**/

void sync_proc(struct work_struct *);

/**/

void recorder_proc(struct work_struct *);

/**/

#endif

-- dm-writeboost-metadata.h --
/*
 * Copyright (C) 2012-2013 Akira Hayakawa 
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_METADATA_H
#define DM_WRITEBOOST_METADATA_H

/**/

struct segment_header *get_segment_header_by_id(struct wb_cache *, u64 
segment_id);
sector_t calc_mb_start_sector(struct wb_cache *, struct segment_header *, 
cache_nr mb_idx);
bool is_on_buffer(struct wb_cache *, cache_nr mb_idx);

/**/

struct ht_head *ht_get_head(struct wb_cache *, struct lookup_key *);
struct metablock *ht_lookup(struct wb_cache *,
struct ht_head *, struct lookup_key *);
void ht_register(struct wb_cache *, struct ht_head *,
 struct lookup_key *, struct metablock *);
void ht_del(struct wb_cache *, struct metablock *);
void discard_caches_inseg(struct wb_cache *,
  struct segment_header *);

/**/

int __must_check audit_cache_device(struct dm_dev *, struct wb_cache *,
bool *need_format, bool *allow_format);
int __must_check format_cache_device(struct dm_dev *, struct wb_cache *);

/**/

void prepare_segment_header_device(struct segment_header_device *dest,
   struct wb_cache *,
   struct segment_header *src);

/**/

int __must_check 

Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-28 Thread Akira Hayakawa
Hi,

Two major progress:
1) .ctr accepts segment size so .ctr now accepts 3 arguments: backing dev 
cache dev segment size order.
2) fold the small files splitted that I suggested in the previous progress 
report.

For 1) 
I use zero length array to dynamically accept the segment size.
writeboost had the parameter embedded previously and one must re-compile the 
code
to change the parameter which badly loses usability was the problem.

For 2)
 Unfortunately I think you went too far with all these different small
 files, I was hoping to see 2 or 3 .c files and a couple .h files.
 
 Maybe fold all the daemon code into a 1 .c and 1 .h ?
 
 The core of the writeboost target in dm-writeboost-target.c ?
 
 And fold all the other data structures into a 1 .c and 1 .h ?
 
 When folding these files together feel free to use dividers in the code
 like dm-thin.c and dm-cache-target.c do, e.g.:
 
 /*-*/
As Mike pointed out splitting into almost 20 files went too far.
I aggregated these files into 3 .c files and 3 .h files in total which are 
shown below.

-- Summary --
39 dm-writeboost-daemon.h
46 dm-writeboost-metadata.h
413 dm-writeboost.h
577 dm-writeboost-daemon.c
1129 dm-writeboost-metadata.c
1212 dm-writeboost-target.c
81 dm-writeboost.mod.c

The responsibilities of each .c file
is the policy of this splitting.

a) dm-writeboost-metadata.c
This file knows how the metadata is laid out on cache device.
It can audit/format the cache device metadata
and resume/free the in-core cache metadata from that on the cache device.
Also provides accessor to the in-core metadata resumed.

b) dm-writeboost-target.c
This file contains all the methods to define target type.
In terms of I/O processing, this files only defines
from when bio is accepted to when flush job is queued
which is described as foreground processing in the document.
What happens after the job is queued is defined in -daemon.c file.

c) dm-writeboost-daemon.c
This file contains all the daemons as Mike suggested.
Maybe, superblock_recorder should be in the -metadata.c file
but I chose to put it on this file since for unity.

Thanks,
Akira


followed by the current .h files.

-- dm-writeboost-daemon.h --
/*
 * Copyright (C) 2012-2013 Akira Hayakawa ruby.w...@gmail.com
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_DAEMON_H
#define DM_WRITEBOOST_DAEMON_H

/**/

void flush_proc(struct work_struct *);

/**/

void queue_barrier_io(struct wb_cache *, struct bio *);
void barrier_deadline_proc(unsigned long data);
void flush_barrier_ios(struct work_struct *);

/**/

void migrate_proc(struct work_struct *);
void wait_for_migration(struct wb_cache *, u64 id);

/**/

void modulator_proc(struct work_struct *);

/**/

void sync_proc(struct work_struct *);

/**/

void recorder_proc(struct work_struct *);

/**/

#endif

-- dm-writeboost-metadata.h --
/*
 * Copyright (C) 2012-2013 Akira Hayakawa ruby.w...@gmail.com
 *
 * This file is released under the GPL.
 */

#ifndef DM_WRITEBOOST_METADATA_H
#define DM_WRITEBOOST_METADATA_H

/**/

struct segment_header *get_segment_header_by_id(struct wb_cache *, u64 
segment_id);
sector_t calc_mb_start_sector(struct wb_cache *, struct segment_header *, 
cache_nr mb_idx);
bool is_on_buffer(struct wb_cache *, cache_nr mb_idx);

/**/

struct ht_head *ht_get_head(struct wb_cache *, struct lookup_key *);
struct metablock *ht_lookup(struct wb_cache *,
struct ht_head *, struct lookup_key *);
void ht_register(struct wb_cache *, struct ht_head *,
 struct lookup_key *, struct metablock *);
void ht_del(struct wb_cache *, struct metablock *);
void discard_caches_inseg(struct wb_cache *,
  struct segment_header *);

/**/

int __must_check audit_cache_device(struct dm_dev *, struct wb_cache *,
bool *need_format, bool *allow_format);
int __must_check format_cache_device(struct dm_dev *, struct wb_cache *);

/**/

void prepare_segment_header_device(struct segment_header_device *dest,
   struct wb_cache *,
   struct segment_header *src);


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-27 Thread Mike Snitzer
On Wed, Sep 25 2013 at  9:47pm -0400,
Akira Hayakawa  wrote:

> Hi, Mike
> 
> The monolithic source code (3.2k)
> is nicely splitted into almost 20 *.c files
> according to the functionality and
> data strucutures in OOP style.
> 
> The aim of this posting
> is to share how the splitting looks like.
> 
> I believe that
> at least reading the *.h files
> can convince you the splitting is clear.
> 
> The code is now tainted with
> almost 20 version switch macros
> and WB* debug macros
> but I will clean them up
> for sending patch.
> 
> Again,
> the latest code can be cloned by
> git clone https://github.com/akiradeveloper/dm-writeboost.git
> 
> I will make few updates to the source codes on this weekend
> so please track it to follow the latest version.
> Below is only the snapshot.
> 
> Akira
> 
> -- Summary --
> 33 Makefile
> 10 bigarray.h
> 19 cache-alloc.h
> 10 defer-barrier.h
> 8 dirty-sync.h
> 8 flush-daemon.h
> 10 format-cache.h
> 24 handle-io.h
> 16 hashtable.h
> 18 migrate-daemon.h
> 7 migrate-modulator.h
> 12 queue-flush-job.h
> 8 rambuf.h
> 13 recover.h
> 18 segment.h
> 8 superblock-recorder.h
> 9 target.h
> 30 util.h
> 384 writeboost.h
> 99 bigarray.c
> 192 cache-alloc.c
> 36 defer-barrier.c
> 33 dirty-sync.c
> 85 flush-daemon.c
> 234 format-cache.c
> 553 handle-io.c
> 109 hashtable.c
> 345 migrate-daemon.c
> 41 migrate-modulator.c
> 169 queue-flush-job.c
> 52 rambuf.c
> 308 recover.c
> 118 segment.c
> 61 superblock-recorder.c
> 376 target.c
> 126 util.c

Unfortunately I think you went too far with all these different small
files, I was hoping to see 2 or 3 .c files and a couple .h files.

Maybe fold all the daemon code into a 1 .c and 1 .h ?

The core of the writeboost target in dm-writeboost-target.c ?

And fold all the other data structures into a 1 .c and 1 .h ?

When folding these files together feel free to use dividers in the code
like dm-thin.c and dm-cache-target.c do, e.g.:

/*-*/

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-27 Thread Mike Snitzer
On Wed, Sep 25 2013 at  9:47pm -0400,
Akira Hayakawa ruby.w...@gmail.com wrote:

 Hi, Mike
 
 The monolithic source code (3.2k)
 is nicely splitted into almost 20 *.c files
 according to the functionality and
 data strucutures in OOP style.
 
 The aim of this posting
 is to share how the splitting looks like.
 
 I believe that
 at least reading the *.h files
 can convince you the splitting is clear.
 
 The code is now tainted with
 almost 20 version switch macros
 and WB* debug macros
 but I will clean them up
 for sending patch.
 
 Again,
 the latest code can be cloned by
 git clone https://github.com/akiradeveloper/dm-writeboost.git
 
 I will make few updates to the source codes on this weekend
 so please track it to follow the latest version.
 Below is only the snapshot.
 
 Akira
 
 -- Summary --
 33 Makefile
 10 bigarray.h
 19 cache-alloc.h
 10 defer-barrier.h
 8 dirty-sync.h
 8 flush-daemon.h
 10 format-cache.h
 24 handle-io.h
 16 hashtable.h
 18 migrate-daemon.h
 7 migrate-modulator.h
 12 queue-flush-job.h
 8 rambuf.h
 13 recover.h
 18 segment.h
 8 superblock-recorder.h
 9 target.h
 30 util.h
 384 writeboost.h
 99 bigarray.c
 192 cache-alloc.c
 36 defer-barrier.c
 33 dirty-sync.c
 85 flush-daemon.c
 234 format-cache.c
 553 handle-io.c
 109 hashtable.c
 345 migrate-daemon.c
 41 migrate-modulator.c
 169 queue-flush-job.c
 52 rambuf.c
 308 recover.c
 118 segment.c
 61 superblock-recorder.c
 376 target.c
 126 util.c

Unfortunately I think you went too far with all these different small
files, I was hoping to see 2 or 3 .c files and a couple .h files.

Maybe fold all the daemon code into a 1 .c and 1 .h ?

The core of the writeboost target in dm-writeboost-target.c ?

And fold all the other data structures into a 1 .c and 1 .h ?

When folding these files together feel free to use dividers in the code
like dm-thin.c and dm-cache-target.c do, e.g.:

/*-*/

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Dave Chinner
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> * Deferring ACK for barrier writes
> Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> Immediately handling these bios badly slows down writeboost.
> It surveils the bios with these flags and forcefully flushes them
> at worst case within `barrier_deadline_ms` period.

That rings alarm bells.

If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
delaying them to allow other IOs to be submitted and dispatched may
very well violate the IO ordering constraints the filesystem is
trying to acheive.

Alternatively, delaying them will stall the filesystem because it's
waiting for said REQ_FUA IO to complete. For example, journal writes
in XFS are extremely IO latency sensitive in workloads that have a
signifincant number of ordering constraints (e.g. O_SYNC writes,
fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
filesystem for the majority of that barrier_deadline_ms.

i.e. this says to me that the best performance you can get from such
workloas is one synchronous operation per process per
barrier_deadline_ms, even when the storage and filesystem might be
capable of executing hundreds of synchronous operations per
barrier_deadline_ms..

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Akira Hayakawa
Hi, Mike

I have made another progress yesterday:
Splitting the monolithic source code into
meaningful pieces is done.
It will follow in the next mail.

> Yes, please share your plan.  Anything that can simplify the code layout
> is best done earlier to simplfy code review.
Sorry, should have been done in earlier stage.

First, I reply to each of your comments.

> OK, but the thing is upper level consumers in the IO stack, like ext4,
> expect that when the REQ_FLUSH completes that the device has in fact
> flushed any transient state in memory.  So I'm not seeing how handling
> these lazily is an option.  Though I do appreciate that dm-cache (and
> dm-thin) do take similar approaches.  Would like to get Joe Thornber's
> insight here.
When the upper level consumers receives
the completion of bio with REQ_FLUSH sent
all the transient states are persistent.
writeboost do four steps to accomplish this:
1. queue the flush job with the current transient state (RAM buffer).
2. wait for the completion of the flush job to be written in cache device.
3. blkdev_issue_flush() to the cache device to make all the writes persistent.
4. bio_endio() to the said flagged bios.

If the implementation isn't wrong
It could be working as the consumers expect is what I believe.


> These seem reasonable to me.  Will need to have a look at thread naming
> to make sure the names reflect they are part of a dm-writeboost service.
I change former "Cache Synchronizer" to "Dirty Synchronizer"
but it sounds little bit odd still.
Naming is truly difficult.


> You don't allow user to specify the "segment size"?  I'd expect tuning
> that could be important based on the underlying storage capabilities
> (e.g. having the segment size match that of the SSD's erase block or
> matching the backing device's full stripe width?).  SO something similar
> to what we have in dm-cache's blocksize.
For the current implementation, No.
The segment size is hard-coded in the source code and
one has to re-compile the module to change the segment size.

But hard-coding the size has reasonable background
for performance and simplification.

Please look at the code fragment from .map method which does
(1) writeboost first sees hit/miss. Get metablock (mb).
(2) And then have to get the segment_header "logically" containing the 
metablock.

mb = ht_lookup(cache, head, ); // (1)
if (mb) {
seg = ((void *) mb) - (mb->idx % NR_CACHES_INSEG) * // (2)
  sizeof(struct metablock);
atomic_inc(>nr_inflight_ios);
}

#define NR_CACHES_INSEG ((1 << (WB_SEGMENTSIZE_ORDER - 3)) - 1)

(3)
struct segment_header {
struct metablock mb_array[NR_CACHES_INSEG];

In the current implementation
I place metablocks especially "physically" in the segment header (3)
so calculation of the segment header containing the metablock
is a just a simple address calculation which performs good.
Since writeboost focuses on the peak write performance
the light-weighted lookup is the lifeline.

If I re-design writeboost to accept segment size in .ctr
this technique will be impossible
since knowing NR_CACHES_INSEG before accepting it is impossible.

It is just a matter of tradeoff.

But probably,
having purged cache-sharing gave me some another chance of
fancy technique to do the same thing with reasonable overhead
and code complexity. I will try to think of it.
I know forcing re-compiling the kernel
to the ordinary users sounds harsh.


> I'll look at the code but it strikes me as odd that the first sector of
> the cache device is checked yet the last sector of the first MB of the
> cache is wher ethe superblock resides.  I'd think you'd want to have the
> check on whether to format or not to be the same location as the
> superblock?
The first sector of the first 1MB is called Superblock Header and
the last sector of the first 1MB is called Superblock Record.
The former contains information fixed at initialization and
the latter contains information updated runtime
by Superblock Recorder daemon.

The latter is also checked in initialization step.
The logic is in recover_cache().
If it contains `last_migrated_segment_id` updated,
the time for recover_cache() becomes short.


> So this "<16 stat info (r/w)", is that like /proc/diskstats ?  Are you
> aware that dm-stats exists now and can be used instead of needing to
> tracking these stats in dm-writeboost?
Sort of.
But the difference is that
these information is relevant to
how a bio went through the path in writeboost.
They are like "read hits", "read misses" ... in dm-cache status.
So I don't think I need to discard it.

I read through the document of statistics
https://lwn.net/Articles/566273/
and I understand the dm-stats only surveils the
external I/O statistics
but not the internal conditional branch in detail.


> Whatever name you come up with, please add a "dm_" prefix.
add dm_ prefix to only struct or
to including all filenames and function 

Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Greg KH
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
> Hi, Mike
> 
> I am now working on redesigning and implementation
> of dm-writeboost.

Ok, I'm dropping your original patch, please resend when you have
something you want merged into drivers/staging/

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Mike Snitzer
On Tue, Sep 24 2013 at  8:20am -0400,
Akira Hayakawa  wrote:

> Hi, Mike
> 
> I am now working on redesigning and implementation
> of dm-writeboost.
> 
> This is a progress report. 
> 
> Please run
> git clone https://github.com/akiradeveloper/dm-writeboost.git 
> to see full set of the code.

I likely won't be able to look closely at the code until Monday (9/30);
I have some higher priority reviews and issues to take care of this
week.

But I'm very encouraged by what you've shared below; looks like things
are moving in the right direction.  Great job.

> * 1. Current Status
> writeboost in new design passed my test.
> Documentations are ongoing.
> 
> * 2. Big Changes 
> - Cache-sharing purged
> - All Sysfs purged.
> - All Userland tools in Python purged.
> -- dmsetup is the only user interface now.
> - The daemon in userland is ported to kernel.
> - On-disk metadata are in little endian.
> - 300 lines of codes shed in kernel
> -- Python scripts were 500 LOC so 800 LOC in total.
> -- It is now about 3.2k LOC all in kernel.
> - Comments are added neatly.
> - Reorder the codes so that it gets more readable.
> 
> * 3. Documentation in Draft
> This is a current document that will be under Documentation/device-mapper
> 
> dm-writeboost
> =
> writeboost target provides log-structured caching.
> It batches random writes into a big sequential write to a cache device.
> 
> It is like dm-cache but the difference is
> that writeboost focuses on handling bursty writes and lifetime of SSD cache 
> device.
> 
> Auxiliary PDF documents and Quick-start scripts are available in
> https://github.com/akiradeveloper/dm-writeboost
> 
> Design
> ==
> There are foreground path and 6 background daemons.
> 
> Foreground
> --
> It accepts bios and put writes to RAM buffer.
> When the buffer is full, it creates a "flush job" and queues it.
> 
> Background
> --
> * Flush Daemon
> Pop a flush job from the queue and executes it.
> 
> * Deferring ACK for barrier writes
> Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
> Immediately handling these bios badly slows down writeboost.
> It surveils the bios with these flags and forcefully flushes them
> at worst case within `barrier_deadline_ms` period.

OK, but the thing is upper level consumers in the IO stack, like ext4,
expect that when the REQ_FLUSH completes that the device has in fact
flushed any transient state in memory.  So I'm not seeing how handling
these lazily is an option.  Though I do appreciate that dm-cache (and
dm-thin) do take similar approaches.  Would like to get Joe Thornber's
insight here.

> * Migration Daemon
> It migrates, writes back cache data to backing store,
> the data on the cache device in segment granurality.
> 
> If `allow_migrate` is true, it migrates without impending situation.
> Being in impending situation is that there are no room in cache device
> for writing further flush jobs.
> 
> Migration at a time is done batching `nr_max_batched_migration` segments at 
> maximum.
> Therefore, unlike existing I/O scheduler,
> two dirty writes distant in time space can be merged.
> 
> * Migration Modulator
> Migration while the backing store is heavily loaded
> grows the device queue and thus makes the situation ever worse.
> This daemon modulates the migration by switching `allow_migrate`.
> 
> * Superblock Recorder
> Superblock record is a last sector of first 1MB region in cache device.
> It contains what id of the segment lastly migrated.
> This daemon periodically update the region every `update_record_interval` 
> seconds.
> 
> * Cache Synchronizer
> This daemon forcefully makes all the dirty writes persistent
> every `sync_interval` seconds.
> Since writeboost correctly implements the bio semantics
> writing the dirties out forcefully out of the main path is needless.
> However, some user want to be on the safe side by enabling this.

These seem reasonable to me.  Will need to have a look at thread naming
to make sure the names reflect they are part of a dm-writeboost service.

> Target Interface
> 
> All the operations are via dmsetup command.
> 
> Constructor
> ---
> writeboost  
> 
> backing dev : slow device holding original data blocks.
> cache dev   : fast device holding cached data and its metadata.

You don't allow user to specify the "segment size"?  I'd expect tuning
that could be important based on the underlying storage capabilities
(e.g. having the segment size match that of the SSD's erase block or
matching the backing device's full stripe width?).  SO something similar
to what we have in dm-cache's blocksize.

> Note that cache device is re-formatted
> if the first sector of the cache device is zeroed out.

I'll look at the code but it strikes me as odd that the first sector of
the cache device is checked yet the last sector of the first MB of the
cache is wher ethe superblock resides.  I'd think you'd want to have the
check on whether to format or not to be the same 

Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Mike Snitzer
On Tue, Sep 24 2013 at  8:20am -0400,
Akira Hayakawa ruby.w...@gmail.com wrote:

 Hi, Mike
 
 I am now working on redesigning and implementation
 of dm-writeboost.
 
 This is a progress report. 
 
 Please run
 git clone https://github.com/akiradeveloper/dm-writeboost.git 
 to see full set of the code.

I likely won't be able to look closely at the code until Monday (9/30);
I have some higher priority reviews and issues to take care of this
week.

But I'm very encouraged by what you've shared below; looks like things
are moving in the right direction.  Great job.

 * 1. Current Status
 writeboost in new design passed my test.
 Documentations are ongoing.
 
 * 2. Big Changes 
 - Cache-sharing purged
 - All Sysfs purged.
 - All Userland tools in Python purged.
 -- dmsetup is the only user interface now.
 - The daemon in userland is ported to kernel.
 - On-disk metadata are in little endian.
 - 300 lines of codes shed in kernel
 -- Python scripts were 500 LOC so 800 LOC in total.
 -- It is now about 3.2k LOC all in kernel.
 - Comments are added neatly.
 - Reorder the codes so that it gets more readable.
 
 * 3. Documentation in Draft
 This is a current document that will be under Documentation/device-mapper
 
 dm-writeboost
 =
 writeboost target provides log-structured caching.
 It batches random writes into a big sequential write to a cache device.
 
 It is like dm-cache but the difference is
 that writeboost focuses on handling bursty writes and lifetime of SSD cache 
 device.
 
 Auxiliary PDF documents and Quick-start scripts are available in
 https://github.com/akiradeveloper/dm-writeboost
 
 Design
 ==
 There are foreground path and 6 background daemons.
 
 Foreground
 --
 It accepts bios and put writes to RAM buffer.
 When the buffer is full, it creates a flush job and queues it.
 
 Background
 --
 * Flush Daemon
 Pop a flush job from the queue and executes it.
 
 * Deferring ACK for barrier writes
 Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
 Immediately handling these bios badly slows down writeboost.
 It surveils the bios with these flags and forcefully flushes them
 at worst case within `barrier_deadline_ms` period.

OK, but the thing is upper level consumers in the IO stack, like ext4,
expect that when the REQ_FLUSH completes that the device has in fact
flushed any transient state in memory.  So I'm not seeing how handling
these lazily is an option.  Though I do appreciate that dm-cache (and
dm-thin) do take similar approaches.  Would like to get Joe Thornber's
insight here.

 * Migration Daemon
 It migrates, writes back cache data to backing store,
 the data on the cache device in segment granurality.
 
 If `allow_migrate` is true, it migrates without impending situation.
 Being in impending situation is that there are no room in cache device
 for writing further flush jobs.
 
 Migration at a time is done batching `nr_max_batched_migration` segments at 
 maximum.
 Therefore, unlike existing I/O scheduler,
 two dirty writes distant in time space can be merged.
 
 * Migration Modulator
 Migration while the backing store is heavily loaded
 grows the device queue and thus makes the situation ever worse.
 This daemon modulates the migration by switching `allow_migrate`.
 
 * Superblock Recorder
 Superblock record is a last sector of first 1MB region in cache device.
 It contains what id of the segment lastly migrated.
 This daemon periodically update the region every `update_record_interval` 
 seconds.
 
 * Cache Synchronizer
 This daemon forcefully makes all the dirty writes persistent
 every `sync_interval` seconds.
 Since writeboost correctly implements the bio semantics
 writing the dirties out forcefully out of the main path is needless.
 However, some user want to be on the safe side by enabling this.

These seem reasonable to me.  Will need to have a look at thread naming
to make sure the names reflect they are part of a dm-writeboost service.

 Target Interface
 
 All the operations are via dmsetup command.
 
 Constructor
 ---
 writeboost backing dev cache dev
 
 backing dev : slow device holding original data blocks.
 cache dev   : fast device holding cached data and its metadata.

You don't allow user to specify the segment size?  I'd expect tuning
that could be important based on the underlying storage capabilities
(e.g. having the segment size match that of the SSD's erase block or
matching the backing device's full stripe width?).  SO something similar
to what we have in dm-cache's blocksize.

 Note that cache device is re-formatted
 if the first sector of the cache device is zeroed out.

I'll look at the code but it strikes me as odd that the first sector of
the cache device is checked yet the last sector of the first MB of the
cache is wher ethe superblock resides.  I'd think you'd want to have the
check on whether to format or not to be the same location as the
superblock?

 Status
 --
 #dirty caches 

Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Greg KH
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
 Hi, Mike
 
 I am now working on redesigning and implementation
 of dm-writeboost.

Ok, I'm dropping your original patch, please resend when you have
something you want merged into drivers/staging/

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Akira Hayakawa
Hi, Mike

I have made another progress yesterday:
Splitting the monolithic source code into
meaningful pieces is done.
It will follow in the next mail.

 Yes, please share your plan.  Anything that can simplify the code layout
 is best done earlier to simplfy code review.
Sorry, should have been done in earlier stage.

First, I reply to each of your comments.

 OK, but the thing is upper level consumers in the IO stack, like ext4,
 expect that when the REQ_FLUSH completes that the device has in fact
 flushed any transient state in memory.  So I'm not seeing how handling
 these lazily is an option.  Though I do appreciate that dm-cache (and
 dm-thin) do take similar approaches.  Would like to get Joe Thornber's
 insight here.
When the upper level consumers receives
the completion of bio with REQ_FLUSH sent
all the transient states are persistent.
writeboost do four steps to accomplish this:
1. queue the flush job with the current transient state (RAM buffer).
2. wait for the completion of the flush job to be written in cache device.
3. blkdev_issue_flush() to the cache device to make all the writes persistent.
4. bio_endio() to the said flagged bios.

If the implementation isn't wrong
It could be working as the consumers expect is what I believe.


 These seem reasonable to me.  Will need to have a look at thread naming
 to make sure the names reflect they are part of a dm-writeboost service.
I change former Cache Synchronizer to Dirty Synchronizer
but it sounds little bit odd still.
Naming is truly difficult.


 You don't allow user to specify the segment size?  I'd expect tuning
 that could be important based on the underlying storage capabilities
 (e.g. having the segment size match that of the SSD's erase block or
 matching the backing device's full stripe width?).  SO something similar
 to what we have in dm-cache's blocksize.
For the current implementation, No.
The segment size is hard-coded in the source code and
one has to re-compile the module to change the segment size.

But hard-coding the size has reasonable background
for performance and simplification.

Please look at the code fragment from .map method which does
(1) writeboost first sees hit/miss. Get metablock (mb).
(2) And then have to get the segment_header logically containing the 
metablock.

mb = ht_lookup(cache, head, key); // (1)
if (mb) {
seg = ((void *) mb) - (mb-idx % NR_CACHES_INSEG) * // (2)
  sizeof(struct metablock);
atomic_inc(seg-nr_inflight_ios);
}

#define NR_CACHES_INSEG ((1  (WB_SEGMENTSIZE_ORDER - 3)) - 1)

(3)
struct segment_header {
struct metablock mb_array[NR_CACHES_INSEG];

In the current implementation
I place metablocks especially physically in the segment header (3)
so calculation of the segment header containing the metablock
is a just a simple address calculation which performs good.
Since writeboost focuses on the peak write performance
the light-weighted lookup is the lifeline.

If I re-design writeboost to accept segment size in .ctr
this technique will be impossible
since knowing NR_CACHES_INSEG before accepting it is impossible.

It is just a matter of tradeoff.

But probably,
having purged cache-sharing gave me some another chance of
fancy technique to do the same thing with reasonable overhead
and code complexity. I will try to think of it.
I know forcing re-compiling the kernel
to the ordinary users sounds harsh.


 I'll look at the code but it strikes me as odd that the first sector of
 the cache device is checked yet the last sector of the first MB of the
 cache is wher ethe superblock resides.  I'd think you'd want to have the
 check on whether to format or not to be the same location as the
 superblock?
The first sector of the first 1MB is called Superblock Header and
the last sector of the first 1MB is called Superblock Record.
The former contains information fixed at initialization and
the latter contains information updated runtime
by Superblock Recorder daemon.

The latter is also checked in initialization step.
The logic is in recover_cache().
If it contains `last_migrated_segment_id` updated,
the time for recover_cache() becomes short.


 So this 16 stat info (r/w), is that like /proc/diskstats ?  Are you
 aware that dm-stats exists now and can be used instead of needing to
 tracking these stats in dm-writeboost?
Sort of.
But the difference is that
these information is relevant to
how a bio went through the path in writeboost.
They are like read hits, read misses ... in dm-cache status.
So I don't think I need to discard it.

I read through the document of statistics
https://lwn.net/Articles/566273/
and I understand the dm-stats only surveils the
external I/O statistics
but not the internal conditional branch in detail.


 Whatever name you come up with, please add a dm_ prefix.
add dm_ prefix to only struct or
to including all filenames and function names?
If so, needs really big fixing.


 

Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-25 Thread Dave Chinner
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote:
 * Deferring ACK for barrier writes
 Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
 Immediately handling these bios badly slows down writeboost.
 It surveils the bios with these flags and forcefully flushes them
 at worst case within `barrier_deadline_ms` period.

That rings alarm bells.

If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons,
delaying them to allow other IOs to be submitted and dispatched may
very well violate the IO ordering constraints the filesystem is
trying to acheive.

Alternatively, delaying them will stall the filesystem because it's
waiting for said REQ_FUA IO to complete. For example, journal writes
in XFS are extremely IO latency sensitive in workloads that have a
signifincant number of ordering constraints (e.g. O_SYNC writes,
fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the
filesystem for the majority of that barrier_deadline_ms.

i.e. this says to me that the best performance you can get from such
workloas is one synchronous operation per process per
barrier_deadline_ms, even when the storage and filesystem might be
capable of executing hundreds of synchronous operations per
barrier_deadline_ms..

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-24 Thread Akira Hayakawa
Hi, Mike

I am now working on redesigning and implementation
of dm-writeboost.

This is a progress report. 

Please run
git clone https://github.com/akiradeveloper/dm-writeboost.git 
to see full set of the code.

* 1. Current Status
writeboost in new design passed my test.
Documentations are ongoing.

* 2. Big Changes 
- Cache-sharing purged
- All Sysfs purged.
- All Userland tools in Python purged.
-- dmsetup is the only user interface now.
- The daemon in userland is ported to kernel.
- On-disk metadata are in little endian.
- 300 lines of codes shed in kernel
-- Python scripts were 500 LOC so 800 LOC in total.
-- It is now about 3.2k LOC all in kernel.
- Comments are added neatly.
- Reorder the codes so that it gets more readable.

* 3. Documentation in Draft
This is a current document that will be under Documentation/device-mapper

dm-writeboost
=
writeboost target provides log-structured caching.
It batches random writes into a big sequential write to a cache device.

It is like dm-cache but the difference is
that writeboost focuses on handling bursty writes and lifetime of SSD cache 
device.

Auxiliary PDF documents and Quick-start scripts are available in
https://github.com/akiradeveloper/dm-writeboost

Design
==
There are foreground path and 6 background daemons.

Foreground
--
It accepts bios and put writes to RAM buffer.
When the buffer is full, it creates a "flush job" and queues it.

Background
--
* Flush Daemon
Pop a flush job from the queue and executes it.

* Deferring ACK for barrier writes
Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
Immediately handling these bios badly slows down writeboost.
It surveils the bios with these flags and forcefully flushes them
at worst case within `barrier_deadline_ms` period.

* Migration Daemon
It migrates, writes back cache data to backing store,
the data on the cache device in segment granurality.

If `allow_migrate` is true, it migrates without impending situation.
Being in impending situation is that there are no room in cache device
for writing further flush jobs.

Migration at a time is done batching `nr_max_batched_migration` segments at 
maximum.
Therefore, unlike existing I/O scheduler,
two dirty writes distant in time space can be merged.

* Migration Modulator
Migration while the backing store is heavily loaded
grows the device queue and thus makes the situation ever worse.
This daemon modulates the migration by switching `allow_migrate`.

* Superblock Recorder
Superblock record is a last sector of first 1MB region in cache device.
It contains what id of the segment lastly migrated.
This daemon periodically update the region every `update_record_interval` 
seconds.

* Cache Synchronizer
This daemon forcefully makes all the dirty writes persistent
every `sync_interval` seconds.
Since writeboost correctly implements the bio semantics
writing the dirties out forcefully out of the main path is needless.
However, some user want to be on the safe side by enabling this.

Target Interface

All the operations are via dmsetup command.

Constructor
---
writeboost  

backing dev : slow device holding original data blocks.
cache dev   : fast device holding cached data and its metadata.

Note that cache device is re-formatted
if the first sector of the cache device is zeroed out.

Status
--
<#dirty caches> <#segments>




<16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not)>
<# of kv pairs>


Messages

You can tune up writeboost via message interface.

* barrier_deadline_ms (ms)
Default: 3
All the bios with barrier flags like REQ_FUA or REQ_FLUSH
are guaranteed to be acked within this deadline.

* allow_migrate (bool)
Default: 1
Set to 1 to start migration.

* enable_migration_modulator (bool) and
  migrate_threshold (%)
Default: 1
Set to 1 to run migration modulator.
Migration modulator surveils the load of backing store
and set the migration started when the load is
lower than the migrate_threshold.

* nr_max_batched_migration (int)
Default: 1
Number of segments to migrate simultaneously and atomically.
Set higher value to fully exploit the capacily of the backing store.

* sync_interval (sec)
Default: 60
All the dirty writes are guaranteed to be persistent by this interval.

* update_record_interval (sec)
Default: 60
The superblock record is updated every update_record_interval seconds.

Example
===
dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct
sz=`blockdev --getsize ${BACKING}`
dmsetup create writeboost-vol --table "0 ${sz} writeboost ${BACKING} {CACHE}"

* 4. TODO
- rename struct arr
-- It is like flex_array but lighter by eliminating the resizableness.
   Maybe, bigarray is a next candidate but I don't have a judge on this.
   I want to make an agreement on this renaming issue before doing it.
- resume, preresume and postsuspend possibly have to be implemented.
-- But I have no idea at all.
-- Maybe, I should make a research on other 

Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-24 Thread Akira Hayakawa
Hi, Mike

I am now working on redesigning and implementation
of dm-writeboost.

This is a progress report. 

Please run
git clone https://github.com/akiradeveloper/dm-writeboost.git 
to see full set of the code.

* 1. Current Status
writeboost in new design passed my test.
Documentations are ongoing.

* 2. Big Changes 
- Cache-sharing purged
- All Sysfs purged.
- All Userland tools in Python purged.
-- dmsetup is the only user interface now.
- The daemon in userland is ported to kernel.
- On-disk metadata are in little endian.
- 300 lines of codes shed in kernel
-- Python scripts were 500 LOC so 800 LOC in total.
-- It is now about 3.2k LOC all in kernel.
- Comments are added neatly.
- Reorder the codes so that it gets more readable.

* 3. Documentation in Draft
This is a current document that will be under Documentation/device-mapper

dm-writeboost
=
writeboost target provides log-structured caching.
It batches random writes into a big sequential write to a cache device.

It is like dm-cache but the difference is
that writeboost focuses on handling bursty writes and lifetime of SSD cache 
device.

Auxiliary PDF documents and Quick-start scripts are available in
https://github.com/akiradeveloper/dm-writeboost

Design
==
There are foreground path and 6 background daemons.

Foreground
--
It accepts bios and put writes to RAM buffer.
When the buffer is full, it creates a flush job and queues it.

Background
--
* Flush Daemon
Pop a flush job from the queue and executes it.

* Deferring ACK for barrier writes
Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily.
Immediately handling these bios badly slows down writeboost.
It surveils the bios with these flags and forcefully flushes them
at worst case within `barrier_deadline_ms` period.

* Migration Daemon
It migrates, writes back cache data to backing store,
the data on the cache device in segment granurality.

If `allow_migrate` is true, it migrates without impending situation.
Being in impending situation is that there are no room in cache device
for writing further flush jobs.

Migration at a time is done batching `nr_max_batched_migration` segments at 
maximum.
Therefore, unlike existing I/O scheduler,
two dirty writes distant in time space can be merged.

* Migration Modulator
Migration while the backing store is heavily loaded
grows the device queue and thus makes the situation ever worse.
This daemon modulates the migration by switching `allow_migrate`.

* Superblock Recorder
Superblock record is a last sector of first 1MB region in cache device.
It contains what id of the segment lastly migrated.
This daemon periodically update the region every `update_record_interval` 
seconds.

* Cache Synchronizer
This daemon forcefully makes all the dirty writes persistent
every `sync_interval` seconds.
Since writeboost correctly implements the bio semantics
writing the dirties out forcefully out of the main path is needless.
However, some user want to be on the safe side by enabling this.

Target Interface

All the operations are via dmsetup command.

Constructor
---
writeboost backing dev cache dev

backing dev : slow device holding original data blocks.
cache dev   : fast device holding cached data and its metadata.

Note that cache device is re-formatted
if the first sector of the cache device is zeroed out.

Status
--
#dirty caches #segments
id of the segment lastly migrated
id of the segment lastly flushed
id of the current segment
the position of the cursor
16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not)
# of kv pairs
kv pairs

Messages

You can tune up writeboost via message interface.

* barrier_deadline_ms (ms)
Default: 3
All the bios with barrier flags like REQ_FUA or REQ_FLUSH
are guaranteed to be acked within this deadline.

* allow_migrate (bool)
Default: 1
Set to 1 to start migration.

* enable_migration_modulator (bool) and
  migrate_threshold (%)
Default: 1
Set to 1 to run migration modulator.
Migration modulator surveils the load of backing store
and set the migration started when the load is
lower than the migrate_threshold.

* nr_max_batched_migration (int)
Default: 1
Number of segments to migrate simultaneously and atomically.
Set higher value to fully exploit the capacily of the backing store.

* sync_interval (sec)
Default: 60
All the dirty writes are guaranteed to be persistent by this interval.

* update_record_interval (sec)
Default: 60
The superblock record is updated every update_record_interval seconds.

Example
===
dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct
sz=`blockdev --getsize ${BACKING}`
dmsetup create writeboost-vol --table 0 ${sz} writeboost ${BACKING} {CACHE}

* 4. TODO
- rename struct arr
-- It is like flex_array but lighter by eliminating the resizableness.
   Maybe, bigarray is a next candidate but I don't have a judge on this.
   I want to make an agreement on this renaming issue before doing it.
- resume, 

Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-21 Thread Akira Hayakawa
Mike,

> We don't need to go through staging.  If the dm-writeboost target is
> designed well and provides a tangible benefit it doesn't need
> wide-spread users as justification for going in.  The users will come if
> it is implemented well.
OK.
The benefit of introducing writeboost will be documented.
1. READ often hit in page cache.
   That's what page cache is all about.
   READ cache only caches the rest that page cache couldn't cache.
2. Backing store in RAID mode crazily slow in WRITE,
   especially if it is RAID-5.
will be the points.
There is not a silver bullet as a cache software
but writeboost can fit in many situations I believe.


> Have you looked at how both dm-cache and dm-thinp handle this?
> Userspace takes care to write all zeroes to the start of the metadata
> device before the first use in the kernel.
Zeroing the first one sector is a sign of needing formatting
sounds nice to writeboost too.
It's simple and I like it.


> Could be the log structured nature of writeboost is very different.
> I'll review this closer tomorrow.
I should mention about the big design difference
between writeboost and dm-cache
to help you understand the nature of writeboost.

Writeboost doesn't have segregated metadata device like dm-cache does.
Data and metadata coexists in the same cache device.
That is what log-structured is.
Data and its relevant metadata are packed in a log segment
and written to cache device atomically
which makes writeboost reliable and fast.
So, 
> could be factored out.  I haven't yet looked close enough at that aspect
> of writeboost code to know if it could benefit from the existing
> bio-prison code or persistent-data library at all.  writeboost would
> obviously need a new space map type, etc.
what makes sense to dm-cache could not make sense to writeboost.
At a simple look, they don't fit to the design of writeboost.
But I will investigate these functionality further in later time.


> sounds like a step in the right direction.  Plus you can share the cache
> by layering multiple linear devices ontop of the dm-writeboost device.
They are theoretically different but it is actually a trade-off.
But it is not a big problem compared to fitting to device-mapper.


> Also managing dm-writeboost devices with lvm2 is a priority, so any
> interface similarities dm-writeboost has with dm-cache will be
> beneficial.
It sounds really good to me.
Huge benefit.


Akira

n 9/18/13 5:59 AM, Mike Snitzer wrote:
> On Tue, Sep 17 2013 at  8:43am -0400,
> Akira Hayakawa  wrote:
> 
>> Hi, Mike
>>
>> There are two designs in my mind
>> regarding the formatting cache.
>>
>> You said
>>>   administer the writeboost devices.  There is no need for this.  Just
>>>   have a normal DM target whose .ctr takes care of validation and
>>>   determines whether a device needs formatting, etc.  
>> makes me wonder how I format the cache device.
>>
>>
>> There are two choices for formatting cache and create a writeboost device
>> standing on the point of removing writeboost-mgr existing in the current 
>> design.
>> I will explain them from how the interface will look like.
>>
>> (1) dmsetup create myDevice ... "... $backing_path $cache_path"
>> which will returns error if the superblock of the given cache device
>> is invalid and needs formatting.
>> And then the user formats the cache device by some userland tool.
>>
>> (2) dmsetup create myDevice ... "... $backing_path $cache_path $do_format"
>> which also returns error if the superblock of the given cache device
>> is invalid and needs formatting when $do_format is 0.
>> And then user formats the cache device by setting $do_format to 1 and try 
>> again.
>>
>> There pros and cons about the design tradeoffs:
>> - (i)  (1) is simpler. do_format parameter in (2) doesn't seem to be sane.
>>(1) is like the interfaces of filesystems where dmsetup create is 
>> like mounting a filesystem.
>> - (ii) (2) can implement everything in kernel. It can gather all the 
>> information
>>about how the superblock in one place, kernel code.
>>
>> Excuse for the current design:
>> - The reason I design writeboost-mgr is almost regarding (ii) above.
>>   writeboost-mgr has a message "format_cache_device" and
>>   writeboost-format-cache userland command kicks the message to format cache.
>>
>> - writeboost-mgr has also a message "resume_cache"
>>   that validates and builds a in-memory structure according to the cache 
>> binding to given $cache_id
>>   and user later dmsetup create the writeboost device with the $cache_id.
>>   However, resuming the cache metadata should be done under .ctr like 
>> dm-cache does
>>   and should not relate LV to create and cache by external cache_id
>>   is what I realized by looking at the code of dm-cache which
>>   calls dm_cache_metadata_open() routines under .ctr .
> 
> Right, any in-core structures should be allocated in .ctr()
> 
>> writeboost-mgr is something like smell of over-engineering but
>> is useful for simplifying 

Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]

2013-09-21 Thread Akira Hayakawa
Mike,

 We don't need to go through staging.  If the dm-writeboost target is
 designed well and provides a tangible benefit it doesn't need
 wide-spread users as justification for going in.  The users will come if
 it is implemented well.
OK.
The benefit of introducing writeboost will be documented.
1. READ often hit in page cache.
   That's what page cache is all about.
   READ cache only caches the rest that page cache couldn't cache.
2. Backing store in RAID mode crazily slow in WRITE,
   especially if it is RAID-5.
will be the points.
There is not a silver bullet as a cache software
but writeboost can fit in many situations I believe.


 Have you looked at how both dm-cache and dm-thinp handle this?
 Userspace takes care to write all zeroes to the start of the metadata
 device before the first use in the kernel.
Zeroing the first one sector is a sign of needing formatting
sounds nice to writeboost too.
It's simple and I like it.


 Could be the log structured nature of writeboost is very different.
 I'll review this closer tomorrow.
I should mention about the big design difference
between writeboost and dm-cache
to help you understand the nature of writeboost.

Writeboost doesn't have segregated metadata device like dm-cache does.
Data and metadata coexists in the same cache device.
That is what log-structured is.
Data and its relevant metadata are packed in a log segment
and written to cache device atomically
which makes writeboost reliable and fast.
So, 
 could be factored out.  I haven't yet looked close enough at that aspect
 of writeboost code to know if it could benefit from the existing
 bio-prison code or persistent-data library at all.  writeboost would
 obviously need a new space map type, etc.
what makes sense to dm-cache could not make sense to writeboost.
At a simple look, they don't fit to the design of writeboost.
But I will investigate these functionality further in later time.


 sounds like a step in the right direction.  Plus you can share the cache
 by layering multiple linear devices ontop of the dm-writeboost device.
They are theoretically different but it is actually a trade-off.
But it is not a big problem compared to fitting to device-mapper.


 Also managing dm-writeboost devices with lvm2 is a priority, so any
 interface similarities dm-writeboost has with dm-cache will be
 beneficial.
It sounds really good to me.
Huge benefit.


Akira

n 9/18/13 5:59 AM, Mike Snitzer wrote:
 On Tue, Sep 17 2013 at  8:43am -0400,
 Akira Hayakawa ruby.w...@gmail.com wrote:
 
 Hi, Mike

 There are two designs in my mind
 regarding the formatting cache.

 You said
   administer the writeboost devices.  There is no need for this.  Just
   have a normal DM target whose .ctr takes care of validation and
   determines whether a device needs formatting, etc.  
 makes me wonder how I format the cache device.


 There are two choices for formatting cache and create a writeboost device
 standing on the point of removing writeboost-mgr existing in the current 
 design.
 I will explain them from how the interface will look like.

 (1) dmsetup create myDevice ... ... $backing_path $cache_path
 which will returns error if the superblock of the given cache device
 is invalid and needs formatting.
 And then the user formats the cache device by some userland tool.

 (2) dmsetup create myDevice ... ... $backing_path $cache_path $do_format
 which also returns error if the superblock of the given cache device
 is invalid and needs formatting when $do_format is 0.
 And then user formats the cache device by setting $do_format to 1 and try 
 again.

 There pros and cons about the design tradeoffs:
 - (i)  (1) is simpler. do_format parameter in (2) doesn't seem to be sane.
(1) is like the interfaces of filesystems where dmsetup create is 
 like mounting a filesystem.
 - (ii) (2) can implement everything in kernel. It can gather all the 
 information
about how the superblock in one place, kernel code.

 Excuse for the current design:
 - The reason I design writeboost-mgr is almost regarding (ii) above.
   writeboost-mgr has a message format_cache_device and
   writeboost-format-cache userland command kicks the message to format cache.

 - writeboost-mgr has also a message resume_cache
   that validates and builds a in-memory structure according to the cache 
 binding to given $cache_id
   and user later dmsetup create the writeboost device with the $cache_id.
   However, resuming the cache metadata should be done under .ctr like 
 dm-cache does
   and should not relate LV to create and cache by external cache_id
   is what I realized by looking at the code of dm-cache which
   calls dm_cache_metadata_open() routines under .ctr .
 
 Right, any in-core structures should be allocated in .ctr()
 
 writeboost-mgr is something like smell of over-engineering but
 is useful for simplifying the design for above reasons.


 Which do you think better?
 
 Have you looked at how both dm-cache and dm-thinp handle