Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Mike, I am happy to see that guys from filesystem to the block subsystem have been discussing how to handle barriers in each layer almost independently. >> Merging the barriers and replacing it with a single FLUSH >> by accepting a lot of writes >> is the reason for deferring barriers in writeboost. >> If you want to know further I recommend you to >> look at the source code to see >> how queue_barrier_io() is used and >> how the barriers are kidnapped in queue_flushing(). > > AFAICT, this is an unfortunate hack resulting from dm-writeboost being a > bio-based DM target. The block layer already has support for FLUSH > merging, see commit ae1b1539622fb4 ("block: reimplement FLUSH/FUA to > support merge") I have read the comments on this patch. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae My understanding is that REQ_FUA and REQ_FLUSH are decomposed to more primitive flags in accordance with the property of the device. {PRE|POST}FLUSH request are queued in flush_queue[one of the two] (which is often called "pending" queue) and calls blk_kick_flush that defers flushing and later if few conditions are satisfied it actually inserts "a single" flush request no matter how many flush requests are in the pending queue (just judged by !list_empty(pending)). If my understanding is correct, we are deferring flush across three layers. Let me summarize. - For filesystem, Dave said that metadata journaling defers barriers. - For device-mapper, writeboost, dm-cache and dm-thin defers barriers. - For block, it defers barriers and results it to merging several requests into one after all. I think writeboost can not discard this deferring hack because deferring the barriers is usually very effective to make it likely to fulfill the RAM buffer which makes the write throughput higher and decrease the CPU usage. However, for particular case such as what Dave pointed out, this hack is just a disturbance. Even for writeboost, the hack in the patch is just a disturbance too unfortunately. Upper layer dislikes the lower layers hidden optimization is just a limitation of the layered architecture of Linux kernel. I think these three layers are thinking almost the same thing is that these hacks are all good and each layer preparing a switch to turn on/off the optimization is what we have to do for compromise. All the problems originates from the fact that we have volatile cache and persistent memory can take these problems away. With persistent memory provided writeboost can switch off the deferring barriers. However, I think all the servers are equipped with persistent memory is the future tale. So, my idea is to maintain both modes for RAM buffer type (volatile, non-volatile) and in case of the former type deferring hack is a good compromise. Akira -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Dave, > i.e. there's no point justifying a behaviour with "we could do this > in future so lets ignore the impact on current users"... Sure, I am happy if we find a solution that is good for both of us or filesystem and block in other word. > e.g. what happens if a user has a mixed workload - one where > performance benefits are only seen by delaying FUA, and another that > is seriously slowed down by delaying FUA requests? This is where > knobs are problematic You are right. But there is no perfect solution to satisfy all. Dealing with each requirement will only complicate the code. Stepping away from the user and focusing on filesystem-block boundary >> Maybe, writeboost should disable deferring barriers >> if barrier_deadline_ms parameter is especially 0. adding the switch for the mounted filesystem to decides on/off is a simple but effective solution I believe. Deciding per bio basis instead of per device could be an another solution. I am happy if I can check the bio if it "may or may not defer the barrier". Akira -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Christoph, > You can detect O_DIRECT writes by second guession a special combination > of REQ_ flags only used there, as cfg tries to treat it special: > > #define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE) > #define WRITE_ODIRECT (WRITE | REQ_SYNC) > > the lack of REQ_NOIDLE when REQ_SYNC is set gives it away. Not related > to the FLUSH or FUA flags in any way, though. Thanks. But, our problem is to detect the bio may or may not be deferred. The flag REQ_NOIDLE is the one? > Akira, can you explain the workloads where your delay of FLUSH or FUA > requests helps you in any way? I very much agree with Dave's reasoning, > but if you found workloads where your hack helps we should make sure we > fix them at the place where they are issued. One of the examples is a fileserver accessed by multiple users. A barrier is submitted when a user closes a file for example. As I said in my previous post https://lkml.org/lkml/2013/10/4/186 writeboost has RAM buffer and we want one to be fulfilled with writes and then flushed to the cache device that takes all the barriers away with the completion. In that case we pay the minimum penalty for the barriers. Interestingly, writeboost is happy with a lot of writes. By deferring these barriers (FLUSH and FUA) multiple barriers are likely to be merged on a RAM buffer and then processed by replacing with only one FLUSH. Merging the barriers and replacing it with a single FLUSH by accepting a lot of writes is the reason for deferring barriers in writeboost. If you want to know further I recommend you to look at the source code to see how queue_barrier_io() is used and how the barriers are kidnapped in queue_flushing(). Akira -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Oct 08, 2013 at 10:43:07AM +1100, Dave Chinner wrote: > > Maybe, writeboost should disable deferring barriers > > if barrier_deadline_ms parameter is especially 0. > > Linux kernel's layered architecture is obviously not always perfect > > so there are similar cases in other boundaries > > such as O_DIRECT to bypass the page cache. > > Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're > relying on the user tweaking the corect knobs for their workload. You can detect O_DIRECT writes by second guession a special combination of REQ_ flags only used there, as cfg tries to treat it special: #define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE) #define WRITE_ODIRECT (WRITE | REQ_SYNC) the lack of REQ_NOIDLE when REQ_SYNC is set gives it away. Not related to the FLUSH or FUA flags in any way, though. Akira, can you explain the workloads where your delay of FLUSH or FUA requests helps you in any way? I very much agree with Dave's reasoning, but if you found workloads where your hack helps we should make sure we fix them at the place where they are issued. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Oct 08, 2013 at 10:43:07AM +1100, Dave Chinner wrote: Maybe, writeboost should disable deferring barriers if barrier_deadline_ms parameter is especially 0. Linux kernel's layered architecture is obviously not always perfect so there are similar cases in other boundaries such as O_DIRECT to bypass the page cache. Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're relying on the user tweaking the corect knobs for their workload. You can detect O_DIRECT writes by second guession a special combination of REQ_ flags only used there, as cfg tries to treat it special: #define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE) #define WRITE_ODIRECT (WRITE | REQ_SYNC) the lack of REQ_NOIDLE when REQ_SYNC is set gives it away. Not related to the FLUSH or FUA flags in any way, though. Akira, can you explain the workloads where your delay of FLUSH or FUA requests helps you in any way? I very much agree with Dave's reasoning, but if you found workloads where your hack helps we should make sure we fix them at the place where they are issued. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Christoph, You can detect O_DIRECT writes by second guession a special combination of REQ_ flags only used there, as cfg tries to treat it special: #define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE) #define WRITE_ODIRECT (WRITE | REQ_SYNC) the lack of REQ_NOIDLE when REQ_SYNC is set gives it away. Not related to the FLUSH or FUA flags in any way, though. Thanks. But, our problem is to detect the bio may or may not be deferred. The flag REQ_NOIDLE is the one? Akira, can you explain the workloads where your delay of FLUSH or FUA requests helps you in any way? I very much agree with Dave's reasoning, but if you found workloads where your hack helps we should make sure we fix them at the place where they are issued. One of the examples is a fileserver accessed by multiple users. A barrier is submitted when a user closes a file for example. As I said in my previous post https://lkml.org/lkml/2013/10/4/186 writeboost has RAM buffer and we want one to be fulfilled with writes and then flushed to the cache device that takes all the barriers away with the completion. In that case we pay the minimum penalty for the barriers. Interestingly, writeboost is happy with a lot of writes. By deferring these barriers (FLUSH and FUA) multiple barriers are likely to be merged on a RAM buffer and then processed by replacing with only one FLUSH. Merging the barriers and replacing it with a single FLUSH by accepting a lot of writes is the reason for deferring barriers in writeboost. If you want to know further I recommend you to look at the source code to see how queue_barrier_io() is used and how the barriers are kidnapped in queue_flushing(). Akira -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Dave, i.e. there's no point justifying a behaviour with we could do this in future so lets ignore the impact on current users... Sure, I am happy if we find a solution that is good for both of us or filesystem and block in other word. e.g. what happens if a user has a mixed workload - one where performance benefits are only seen by delaying FUA, and another that is seriously slowed down by delaying FUA requests? This is where knobs are problematic You are right. But there is no perfect solution to satisfy all. Dealing with each requirement will only complicate the code. Stepping away from the user and focusing on filesystem-block boundary Maybe, writeboost should disable deferring barriers if barrier_deadline_ms parameter is especially 0. adding the switch for the mounted filesystem to decides on/off is a simple but effective solution I believe. Deciding per bio basis instead of per device could be an another solution. I am happy if I can check the bio if it may or may not defer the barrier. Akira -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Mike, I am happy to see that guys from filesystem to the block subsystem have been discussing how to handle barriers in each layer almost independently. Merging the barriers and replacing it with a single FLUSH by accepting a lot of writes is the reason for deferring barriers in writeboost. If you want to know further I recommend you to look at the source code to see how queue_barrier_io() is used and how the barriers are kidnapped in queue_flushing(). AFAICT, this is an unfortunate hack resulting from dm-writeboost being a bio-based DM target. The block layer already has support for FLUSH merging, see commit ae1b1539622fb4 (block: reimplement FLUSH/FUA to support merge) I have read the comments on this patch. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae My understanding is that REQ_FUA and REQ_FLUSH are decomposed to more primitive flags in accordance with the property of the device. {PRE|POST}FLUSH request are queued in flush_queue[one of the two] (which is often called pending queue) and calls blk_kick_flush that defers flushing and later if few conditions are satisfied it actually inserts a single flush request no matter how many flush requests are in the pending queue (just judged by !list_empty(pending)). If my understanding is correct, we are deferring flush across three layers. Let me summarize. - For filesystem, Dave said that metadata journaling defers barriers. - For device-mapper, writeboost, dm-cache and dm-thin defers barriers. - For block, it defers barriers and results it to merging several requests into one after all. I think writeboost can not discard this deferring hack because deferring the barriers is usually very effective to make it likely to fulfill the RAM buffer which makes the write throughput higher and decrease the CPU usage. However, for particular case such as what Dave pointed out, this hack is just a disturbance. Even for writeboost, the hack in the patch is just a disturbance too unfortunately. Upper layer dislikes the lower layers hidden optimization is just a limitation of the layered architecture of Linux kernel. I think these three layers are thinking almost the same thing is that these hacks are all good and each layer preparing a switch to turn on/off the optimization is what we have to do for compromise. All the problems originates from the fact that we have volatile cache and persistent memory can take these problems away. With persistent memory provided writeboost can switch off the deferring barriers. However, I think all the servers are equipped with persistent memory is the future tale. So, my idea is to maintain both modes for RAM buffer type (volatile, non-volatile) and in case of the former type deferring hack is a good compromise. Akira -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Sat, Oct 05, 2013 at 04:51:16PM +0900, Akira Hayakawa wrote: > Dave, > > > That's where arbitrary delays in the storage stack below XFS cause > > problems - if the first FUA log write is delayed, the next log > > buffer will get filled, issued and delayed, and when we run out of > > log buffers (there are 8 maximum) the entire log subsystem will > > stall, stopping *all* log commit operations until log buffer > > IOs complete and become free again. i.e. it can stall modifications > > across the entire filesystem while we wait for batch timeouts to > > expire and issue and complete FUA requests. > To me, this sounds like design failure in XFS log subsystem. If you say so. As it is, XFS is the best of all the linux filesystems when it comes to performance under a heavy fsync workload, so if you consider it broken by design then you've got a horror show waiting for you on any other filesystem... > Or just the limitation of metadata journal. It's a recovery limitation - the more uncompleted log buffers we have outstanding, the more space in the log will be considered unrecoverable during a crash... > > IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the > > point where they are issued - any attempt to further optimise them > > by adding delays down in the stack to aggregate FUA operations will > > only increase latency of the operations that the issuer want to have > > complete as fast as possible > That lower layer stack attempts to optimize further > can benefit any filesystems. > So, your opinion is not always correct although > it is always correct in error handling or memory management. > > I have proposed future plan of using persistent memory. > I believe with this leap forward > filesystems are free from doing such optimization > relevant to write barriers. For more detail, please see my post. > https://lkml.org/lkml/2013/10/4/186 Sure, we already do that in the storage stack to minimise the impact of FUA operations - it's called a non-volatile write cache, and most RAID controllers have them. They rely on immediate dispatch of FUA operations to get them into the write cache as quickly as possible (i.e. what filesystems do right now), and that is something your proposed behaviour will prevent. i.e. there's no point justifying a behaviour with "we could do this in future so lets ignore the impact on current users"... > However, > I think I should leave option to disable the optimization > in case the upper layer doesn't like it. > Maybe, writeboost should disable deferring barriers > if barrier_deadline_ms parameter is especially 0. > Linux kernel's layered architecture is obviously not always perfect > so there are similar cases in other boundaries > such as O_DIRECT to bypass the page cache. Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're relying on the user tweaking the corect knobs for their workload. e.g. what happens if a user has a mixed workload - one where performance benefits are only seen by delaying FUA, and another that is seriously slowed down by delaying FUA requests? This is where knobs are problematic Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Sat, Oct 05, 2013 at 04:51:16PM +0900, Akira Hayakawa wrote: Dave, That's where arbitrary delays in the storage stack below XFS cause problems - if the first FUA log write is delayed, the next log buffer will get filled, issued and delayed, and when we run out of log buffers (there are 8 maximum) the entire log subsystem will stall, stopping *all* log commit operations until log buffer IOs complete and become free again. i.e. it can stall modifications across the entire filesystem while we wait for batch timeouts to expire and issue and complete FUA requests. To me, this sounds like design failure in XFS log subsystem. If you say so. As it is, XFS is the best of all the linux filesystems when it comes to performance under a heavy fsync workload, so if you consider it broken by design then you've got a horror show waiting for you on any other filesystem... Or just the limitation of metadata journal. It's a recovery limitation - the more uncompleted log buffers we have outstanding, the more space in the log will be considered unrecoverable during a crash... IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the point where they are issued - any attempt to further optimise them by adding delays down in the stack to aggregate FUA operations will only increase latency of the operations that the issuer want to have complete as fast as possible That lower layer stack attempts to optimize further can benefit any filesystems. So, your opinion is not always correct although it is always correct in error handling or memory management. I have proposed future plan of using persistent memory. I believe with this leap forward filesystems are free from doing such optimization relevant to write barriers. For more detail, please see my post. https://lkml.org/lkml/2013/10/4/186 Sure, we already do that in the storage stack to minimise the impact of FUA operations - it's called a non-volatile write cache, and most RAID controllers have them. They rely on immediate dispatch of FUA operations to get them into the write cache as quickly as possible (i.e. what filesystems do right now), and that is something your proposed behaviour will prevent. i.e. there's no point justifying a behaviour with we could do this in future so lets ignore the impact on current users... However, I think I should leave option to disable the optimization in case the upper layer doesn't like it. Maybe, writeboost should disable deferring barriers if barrier_deadline_ms parameter is especially 0. Linux kernel's layered architecture is obviously not always perfect so there are similar cases in other boundaries such as O_DIRECT to bypass the page cache. Right - but you can't detect O_DIRECT at the dm layer. IOWs, you're relying on the user tweaking the corect knobs for their workload. e.g. what happens if a user has a mixed workload - one where performance benefits are only seen by delaying FUA, and another that is seriously slowed down by delaying FUA requests? This is where knobs are problematic Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Dave, > That's where arbitrary delays in the storage stack below XFS cause > problems - if the first FUA log write is delayed, the next log > buffer will get filled, issued and delayed, and when we run out of > log buffers (there are 8 maximum) the entire log subsystem will > stall, stopping *all* log commit operations until log buffer > IOs complete and become free again. i.e. it can stall modifications > across the entire filesystem while we wait for batch timeouts to > expire and issue and complete FUA requests. To me, this sounds like design failure in XFS log subsystem. Or just the limitation of metadata journal. > IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the > point where they are issued - any attempt to further optimise them > by adding delays down in the stack to aggregate FUA operations will > only increase latency of the operations that the issuer want to have > complete as fast as possible That lower layer stack attempts to optimize further can benefit any filesystems. So, your opinion is not always correct although it is always correct in error handling or memory management. I have proposed future plan of using persistent memory. I believe with this leap forward filesystems are free from doing such optimization relevant to write barriers. For more detail, please see my post. https://lkml.org/lkml/2013/10/4/186 However, I think I should leave option to disable the optimization in case the upper layer doesn't like it. Maybe, writeboost should disable deferring barriers if barrier_deadline_ms parameter is especially 0. Linux kernel's layered architecture is obviously not always perfect so there are similar cases in other boundaries such as O_DIRECT to bypass the page cache. Maybe, dm-thin and dm-cache should add such switch. Akira -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Dave, That's where arbitrary delays in the storage stack below XFS cause problems - if the first FUA log write is delayed, the next log buffer will get filled, issued and delayed, and when we run out of log buffers (there are 8 maximum) the entire log subsystem will stall, stopping *all* log commit operations until log buffer IOs complete and become free again. i.e. it can stall modifications across the entire filesystem while we wait for batch timeouts to expire and issue and complete FUA requests. To me, this sounds like design failure in XFS log subsystem. Or just the limitation of metadata journal. IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the point where they are issued - any attempt to further optimise them by adding delays down in the stack to aggregate FUA operations will only increase latency of the operations that the issuer want to have complete as fast as possible That lower layer stack attempts to optimize further can benefit any filesystems. So, your opinion is not always correct although it is always correct in error handling or memory management. I have proposed future plan of using persistent memory. I believe with this leap forward filesystems are free from doing such optimization relevant to write barriers. For more detail, please see my post. https://lkml.org/lkml/2013/10/4/186 However, I think I should leave option to disable the optimization in case the upper layer doesn't like it. Maybe, writeboost should disable deferring barriers if barrier_deadline_ms parameter is especially 0. Linux kernel's layered architecture is obviously not always perfect so there are similar cases in other boundaries such as O_DIRECT to bypass the page cache. Maybe, dm-thin and dm-cache should add such switch. Akira -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote: > > > On Tue, 1 Oct 2013, Joe Thornber wrote: > > > > Alternatively, delaying them will stall the filesystem because it's > > > waiting for said REQ_FUA IO to complete. For example, journal writes > > > in XFS are extremely IO latency sensitive in workloads that have a > > > signifincant number of ordering constraints (e.g. O_SYNC writes, > > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the > > > filesystem for the majority of that barrier_deadline_ms. > > > > Yes, this is a valid concern, but I assume Akira has benchmarked. > > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to > > see if there are any other FUA requests on my queue that can be > > aggregated into a single flush. I agree with you that the target > > should never delay waiting for new io; that's asking for trouble. > > > > - Joe > > You could send the first REQ_FUA/REQ_FLUSH request directly to the disk > and aggregate all the requests that were received while you processed the > initial request. This way, you can do request batching without introducing > artifical delays. Yes, that's what XFS does with it's log when lots of fsync requests come in. i.e. the first is dispatched immmediately, and the others are gathered into the next log buffer until it is either full or the original REQ_FUA log write completes. That's where arbitrary delays in the storage stack below XFS cause problems - if the first FUA log write is delayed, the next log buffer will get filled, issued and delayed, and when we run out of log buffers (there are 8 maximum) the entire log subsystem will stall, stopping *all* log commit operations until log buffer IOs complete and become free again. i.e. it can stall modifications across the entire filesystem while we wait for batch timeouts to expire and issue and complete FUA requests. IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the point where they are issued - any attempt to further optimise them by adding delays down in the stack to aggregate FUA operations will only increase latency of the operations that the issuer want to have complete as fast as possible Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Wed, Oct 02, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote: On Tue, 1 Oct 2013, Joe Thornber wrote: Alternatively, delaying them will stall the filesystem because it's waiting for said REQ_FUA IO to complete. For example, journal writes in XFS are extremely IO latency sensitive in workloads that have a signifincant number of ordering constraints (e.g. O_SYNC writes, fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the filesystem for the majority of that barrier_deadline_ms. Yes, this is a valid concern, but I assume Akira has benchmarked. With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to see if there are any other FUA requests on my queue that can be aggregated into a single flush. I agree with you that the target should never delay waiting for new io; that's asking for trouble. - Joe You could send the first REQ_FUA/REQ_FLUSH request directly to the disk and aggregate all the requests that were received while you processed the initial request. This way, you can do request batching without introducing artifical delays. Yes, that's what XFS does with it's log when lots of fsync requests come in. i.e. the first is dispatched immmediately, and the others are gathered into the next log buffer until it is either full or the original REQ_FUA log write completes. That's where arbitrary delays in the storage stack below XFS cause problems - if the first FUA log write is delayed, the next log buffer will get filled, issued and delayed, and when we run out of log buffers (there are 8 maximum) the entire log subsystem will stall, stopping *all* log commit operations until log buffer IOs complete and become free again. i.e. it can stall modifications across the entire filesystem while we wait for batch timeouts to expire and issue and complete FUA requests. IMNSHO, REQ_FUA/REQ_FLUSH optimisations should be done at the point where they are issued - any attempt to further optimise them by adding delays down in the stack to aggregate FUA operations will only increase latency of the operations that the issuer want to have complete as fast as possible Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, 1 Oct 2013, Joe Thornber wrote: > > Alternatively, delaying them will stall the filesystem because it's > > waiting for said REQ_FUA IO to complete. For example, journal writes > > in XFS are extremely IO latency sensitive in workloads that have a > > signifincant number of ordering constraints (e.g. O_SYNC writes, > > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the > > filesystem for the majority of that barrier_deadline_ms. > > Yes, this is a valid concern, but I assume Akira has benchmarked. > With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to > see if there are any other FUA requests on my queue that can be > aggregated into a single flush. I agree with you that the target > should never delay waiting for new io; that's asking for trouble. > > - Joe You could send the first REQ_FUA/REQ_FLUSH request directly to the disk and aggregate all the requests that were received while you processed the initial request. This way, you can do request batching without introducing artifical delays. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, 1 Oct 2013, Joe Thornber wrote: Alternatively, delaying them will stall the filesystem because it's waiting for said REQ_FUA IO to complete. For example, journal writes in XFS are extremely IO latency sensitive in workloads that have a signifincant number of ordering constraints (e.g. O_SYNC writes, fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the filesystem for the majority of that barrier_deadline_ms. Yes, this is a valid concern, but I assume Akira has benchmarked. With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to see if there are any other FUA requests on my queue that can be aggregated into a single flush. I agree with you that the target should never delay waiting for new io; that's asking for trouble. - Joe You could send the first REQ_FUA/REQ_FLUSH request directly to the disk and aggregate all the requests that were received while you processed the initial request. This way, you can do request batching without introducing artifical delays. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Thu, Sep 26, 2013 at 01:43:25PM +1000, Dave Chinner wrote: > On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote: > > * Deferring ACK for barrier writes > > Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. > > Immediately handling these bios badly slows down writeboost. > > It surveils the bios with these flags and forcefully flushes them > > at worst case within `barrier_deadline_ms` period. > > That rings alarm bells. > > If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons, > delaying them to allow other IOs to be submitted and dispatched may > very well violate the IO ordering constraints the filesystem is > trying to acheive. If the fs is using REQ_FUA for ordering they need to wait for completion of that bio before issuing any subsequent bio that needs to be strictly ordered. So I don't think there is any issue here. > Alternatively, delaying them will stall the filesystem because it's > waiting for said REQ_FUA IO to complete. For example, journal writes > in XFS are extremely IO latency sensitive in workloads that have a > signifincant number of ordering constraints (e.g. O_SYNC writes, > fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the > filesystem for the majority of that barrier_deadline_ms. Yes, this is a valid concern, but I assume Akira has benchmarked. With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to see if there are any other FUA requests on my queue that can be aggregated into a single flush. I agree with you that the target should never delay waiting for new io; that's asking for trouble. - Joe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Thu, Sep 26, 2013 at 01:43:25PM +1000, Dave Chinner wrote: On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote: * Deferring ACK for barrier writes Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. Immediately handling these bios badly slows down writeboost. It surveils the bios with these flags and forcefully flushes them at worst case within `barrier_deadline_ms` period. That rings alarm bells. If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons, delaying them to allow other IOs to be submitted and dispatched may very well violate the IO ordering constraints the filesystem is trying to acheive. If the fs is using REQ_FUA for ordering they need to wait for completion of that bio before issuing any subsequent bio that needs to be strictly ordered. So I don't think there is any issue here. Alternatively, delaying them will stall the filesystem because it's waiting for said REQ_FUA IO to complete. For example, journal writes in XFS are extremely IO latency sensitive in workloads that have a signifincant number of ordering constraints (e.g. O_SYNC writes, fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the filesystem for the majority of that barrier_deadline_ms. Yes, this is a valid concern, but I assume Akira has benchmarked. With dm-thin, I delay the REQ_FUA/REQ_FLUSH for a tiny amount, just to see if there are any other FUA requests on my queue that can be aggregated into a single flush. I agree with you that the target should never delay waiting for new io; that's asking for trouble. - Joe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Hi, Two major progress: 1) .ctr accepts segment size so .ctr now accepts 3 arguments: . 2) fold the small files splitted that I suggested in the previous progress report. For 1) I use zero length array to dynamically accept the segment size. writeboost had the parameter embedded previously and one must re-compile the code to change the parameter which badly loses usability was the problem. For 2) > Unfortunately I think you went too far with all these different small > files, I was hoping to see 2 or 3 .c files and a couple .h files. > > Maybe fold all the daemon code into a 1 .c and 1 .h ? > > The core of the writeboost target in dm-writeboost-target.c ? > > And fold all the other data structures into a 1 .c and 1 .h ? > > When folding these files together feel free to use dividers in the code > like dm-thin.c and dm-cache-target.c do, e.g.: > > /*-*/ As Mike pointed out splitting into almost 20 files went too far. I aggregated these files into 3 .c files and 3 .h files in total which are shown below. -- Summary -- 39 dm-writeboost-daemon.h 46 dm-writeboost-metadata.h 413 dm-writeboost.h 577 dm-writeboost-daemon.c 1129 dm-writeboost-metadata.c 1212 dm-writeboost-target.c 81 dm-writeboost.mod.c The responsibilities of each .c file is the policy of this splitting. a) dm-writeboost-metadata.c This file knows how the metadata is laid out on cache device. It can audit/format the cache device metadata and resume/free the in-core cache metadata from that on the cache device. Also provides accessor to the in-core metadata resumed. b) dm-writeboost-target.c This file contains all the methods to define target type. In terms of I/O processing, this files only defines from when bio is accepted to when flush job is queued which is described as "foreground processing" in the document. What happens after the job is queued is defined in -daemon.c file. c) dm-writeboost-daemon.c This file contains all the daemons as Mike suggested. Maybe, superblock_recorder should be in the -metadata.c file but I chose to put it on this file since for unity. Thanks, Akira followed by the current .h files. -- dm-writeboost-daemon.h -- /* * Copyright (C) 2012-2013 Akira Hayakawa * * This file is released under the GPL. */ #ifndef DM_WRITEBOOST_DAEMON_H #define DM_WRITEBOOST_DAEMON_H /**/ void flush_proc(struct work_struct *); /**/ void queue_barrier_io(struct wb_cache *, struct bio *); void barrier_deadline_proc(unsigned long data); void flush_barrier_ios(struct work_struct *); /**/ void migrate_proc(struct work_struct *); void wait_for_migration(struct wb_cache *, u64 id); /**/ void modulator_proc(struct work_struct *); /**/ void sync_proc(struct work_struct *); /**/ void recorder_proc(struct work_struct *); /**/ #endif -- dm-writeboost-metadata.h -- /* * Copyright (C) 2012-2013 Akira Hayakawa * * This file is released under the GPL. */ #ifndef DM_WRITEBOOST_METADATA_H #define DM_WRITEBOOST_METADATA_H /**/ struct segment_header *get_segment_header_by_id(struct wb_cache *, u64 segment_id); sector_t calc_mb_start_sector(struct wb_cache *, struct segment_header *, cache_nr mb_idx); bool is_on_buffer(struct wb_cache *, cache_nr mb_idx); /**/ struct ht_head *ht_get_head(struct wb_cache *, struct lookup_key *); struct metablock *ht_lookup(struct wb_cache *, struct ht_head *, struct lookup_key *); void ht_register(struct wb_cache *, struct ht_head *, struct lookup_key *, struct metablock *); void ht_del(struct wb_cache *, struct metablock *); void discard_caches_inseg(struct wb_cache *, struct segment_header *); /**/ int __must_check audit_cache_device(struct dm_dev *, struct wb_cache *, bool *need_format, bool *allow_format); int __must_check format_cache_device(struct dm_dev *, struct wb_cache *); /**/ void prepare_segment_header_device(struct segment_header_device *dest, struct wb_cache *, struct segment_header *src); /**/ int __must_check
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Hi, Two major progress: 1) .ctr accepts segment size so .ctr now accepts 3 arguments: backing dev cache dev segment size order. 2) fold the small files splitted that I suggested in the previous progress report. For 1) I use zero length array to dynamically accept the segment size. writeboost had the parameter embedded previously and one must re-compile the code to change the parameter which badly loses usability was the problem. For 2) Unfortunately I think you went too far with all these different small files, I was hoping to see 2 or 3 .c files and a couple .h files. Maybe fold all the daemon code into a 1 .c and 1 .h ? The core of the writeboost target in dm-writeboost-target.c ? And fold all the other data structures into a 1 .c and 1 .h ? When folding these files together feel free to use dividers in the code like dm-thin.c and dm-cache-target.c do, e.g.: /*-*/ As Mike pointed out splitting into almost 20 files went too far. I aggregated these files into 3 .c files and 3 .h files in total which are shown below. -- Summary -- 39 dm-writeboost-daemon.h 46 dm-writeboost-metadata.h 413 dm-writeboost.h 577 dm-writeboost-daemon.c 1129 dm-writeboost-metadata.c 1212 dm-writeboost-target.c 81 dm-writeboost.mod.c The responsibilities of each .c file is the policy of this splitting. a) dm-writeboost-metadata.c This file knows how the metadata is laid out on cache device. It can audit/format the cache device metadata and resume/free the in-core cache metadata from that on the cache device. Also provides accessor to the in-core metadata resumed. b) dm-writeboost-target.c This file contains all the methods to define target type. In terms of I/O processing, this files only defines from when bio is accepted to when flush job is queued which is described as foreground processing in the document. What happens after the job is queued is defined in -daemon.c file. c) dm-writeboost-daemon.c This file contains all the daemons as Mike suggested. Maybe, superblock_recorder should be in the -metadata.c file but I chose to put it on this file since for unity. Thanks, Akira followed by the current .h files. -- dm-writeboost-daemon.h -- /* * Copyright (C) 2012-2013 Akira Hayakawa ruby.w...@gmail.com * * This file is released under the GPL. */ #ifndef DM_WRITEBOOST_DAEMON_H #define DM_WRITEBOOST_DAEMON_H /**/ void flush_proc(struct work_struct *); /**/ void queue_barrier_io(struct wb_cache *, struct bio *); void barrier_deadline_proc(unsigned long data); void flush_barrier_ios(struct work_struct *); /**/ void migrate_proc(struct work_struct *); void wait_for_migration(struct wb_cache *, u64 id); /**/ void modulator_proc(struct work_struct *); /**/ void sync_proc(struct work_struct *); /**/ void recorder_proc(struct work_struct *); /**/ #endif -- dm-writeboost-metadata.h -- /* * Copyright (C) 2012-2013 Akira Hayakawa ruby.w...@gmail.com * * This file is released under the GPL. */ #ifndef DM_WRITEBOOST_METADATA_H #define DM_WRITEBOOST_METADATA_H /**/ struct segment_header *get_segment_header_by_id(struct wb_cache *, u64 segment_id); sector_t calc_mb_start_sector(struct wb_cache *, struct segment_header *, cache_nr mb_idx); bool is_on_buffer(struct wb_cache *, cache_nr mb_idx); /**/ struct ht_head *ht_get_head(struct wb_cache *, struct lookup_key *); struct metablock *ht_lookup(struct wb_cache *, struct ht_head *, struct lookup_key *); void ht_register(struct wb_cache *, struct ht_head *, struct lookup_key *, struct metablock *); void ht_del(struct wb_cache *, struct metablock *); void discard_caches_inseg(struct wb_cache *, struct segment_header *); /**/ int __must_check audit_cache_device(struct dm_dev *, struct wb_cache *, bool *need_format, bool *allow_format); int __must_check format_cache_device(struct dm_dev *, struct wb_cache *); /**/ void prepare_segment_header_device(struct segment_header_device *dest, struct wb_cache *, struct segment_header *src);
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Wed, Sep 25 2013 at 9:47pm -0400, Akira Hayakawa wrote: > Hi, Mike > > The monolithic source code (3.2k) > is nicely splitted into almost 20 *.c files > according to the functionality and > data strucutures in OOP style. > > The aim of this posting > is to share how the splitting looks like. > > I believe that > at least reading the *.h files > can convince you the splitting is clear. > > The code is now tainted with > almost 20 version switch macros > and WB* debug macros > but I will clean them up > for sending patch. > > Again, > the latest code can be cloned by > git clone https://github.com/akiradeveloper/dm-writeboost.git > > I will make few updates to the source codes on this weekend > so please track it to follow the latest version. > Below is only the snapshot. > > Akira > > -- Summary -- > 33 Makefile > 10 bigarray.h > 19 cache-alloc.h > 10 defer-barrier.h > 8 dirty-sync.h > 8 flush-daemon.h > 10 format-cache.h > 24 handle-io.h > 16 hashtable.h > 18 migrate-daemon.h > 7 migrate-modulator.h > 12 queue-flush-job.h > 8 rambuf.h > 13 recover.h > 18 segment.h > 8 superblock-recorder.h > 9 target.h > 30 util.h > 384 writeboost.h > 99 bigarray.c > 192 cache-alloc.c > 36 defer-barrier.c > 33 dirty-sync.c > 85 flush-daemon.c > 234 format-cache.c > 553 handle-io.c > 109 hashtable.c > 345 migrate-daemon.c > 41 migrate-modulator.c > 169 queue-flush-job.c > 52 rambuf.c > 308 recover.c > 118 segment.c > 61 superblock-recorder.c > 376 target.c > 126 util.c Unfortunately I think you went too far with all these different small files, I was hoping to see 2 or 3 .c files and a couple .h files. Maybe fold all the daemon code into a 1 .c and 1 .h ? The core of the writeboost target in dm-writeboost-target.c ? And fold all the other data structures into a 1 .c and 1 .h ? When folding these files together feel free to use dividers in the code like dm-thin.c and dm-cache-target.c do, e.g.: /*-*/ Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Wed, Sep 25 2013 at 9:47pm -0400, Akira Hayakawa ruby.w...@gmail.com wrote: Hi, Mike The monolithic source code (3.2k) is nicely splitted into almost 20 *.c files according to the functionality and data strucutures in OOP style. The aim of this posting is to share how the splitting looks like. I believe that at least reading the *.h files can convince you the splitting is clear. The code is now tainted with almost 20 version switch macros and WB* debug macros but I will clean them up for sending patch. Again, the latest code can be cloned by git clone https://github.com/akiradeveloper/dm-writeboost.git I will make few updates to the source codes on this weekend so please track it to follow the latest version. Below is only the snapshot. Akira -- Summary -- 33 Makefile 10 bigarray.h 19 cache-alloc.h 10 defer-barrier.h 8 dirty-sync.h 8 flush-daemon.h 10 format-cache.h 24 handle-io.h 16 hashtable.h 18 migrate-daemon.h 7 migrate-modulator.h 12 queue-flush-job.h 8 rambuf.h 13 recover.h 18 segment.h 8 superblock-recorder.h 9 target.h 30 util.h 384 writeboost.h 99 bigarray.c 192 cache-alloc.c 36 defer-barrier.c 33 dirty-sync.c 85 flush-daemon.c 234 format-cache.c 553 handle-io.c 109 hashtable.c 345 migrate-daemon.c 41 migrate-modulator.c 169 queue-flush-job.c 52 rambuf.c 308 recover.c 118 segment.c 61 superblock-recorder.c 376 target.c 126 util.c Unfortunately I think you went too far with all these different small files, I was hoping to see 2 or 3 .c files and a couple .h files. Maybe fold all the daemon code into a 1 .c and 1 .h ? The core of the writeboost target in dm-writeboost-target.c ? And fold all the other data structures into a 1 .c and 1 .h ? When folding these files together feel free to use dividers in the code like dm-thin.c and dm-cache-target.c do, e.g.: /*-*/ Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote: > * Deferring ACK for barrier writes > Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. > Immediately handling these bios badly slows down writeboost. > It surveils the bios with these flags and forcefully flushes them > at worst case within `barrier_deadline_ms` period. That rings alarm bells. If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons, delaying them to allow other IOs to be submitted and dispatched may very well violate the IO ordering constraints the filesystem is trying to acheive. Alternatively, delaying them will stall the filesystem because it's waiting for said REQ_FUA IO to complete. For example, journal writes in XFS are extremely IO latency sensitive in workloads that have a signifincant number of ordering constraints (e.g. O_SYNC writes, fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the filesystem for the majority of that barrier_deadline_ms. i.e. this says to me that the best performance you can get from such workloas is one synchronous operation per process per barrier_deadline_ms, even when the storage and filesystem might be capable of executing hundreds of synchronous operations per barrier_deadline_ms.. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Hi, Mike I have made another progress yesterday: Splitting the monolithic source code into meaningful pieces is done. It will follow in the next mail. > Yes, please share your plan. Anything that can simplify the code layout > is best done earlier to simplfy code review. Sorry, should have been done in earlier stage. First, I reply to each of your comments. > OK, but the thing is upper level consumers in the IO stack, like ext4, > expect that when the REQ_FLUSH completes that the device has in fact > flushed any transient state in memory. So I'm not seeing how handling > these lazily is an option. Though I do appreciate that dm-cache (and > dm-thin) do take similar approaches. Would like to get Joe Thornber's > insight here. When the upper level consumers receives the completion of bio with REQ_FLUSH sent all the transient states are persistent. writeboost do four steps to accomplish this: 1. queue the flush job with the current transient state (RAM buffer). 2. wait for the completion of the flush job to be written in cache device. 3. blkdev_issue_flush() to the cache device to make all the writes persistent. 4. bio_endio() to the said flagged bios. If the implementation isn't wrong It could be working as the consumers expect is what I believe. > These seem reasonable to me. Will need to have a look at thread naming > to make sure the names reflect they are part of a dm-writeboost service. I change former "Cache Synchronizer" to "Dirty Synchronizer" but it sounds little bit odd still. Naming is truly difficult. > You don't allow user to specify the "segment size"? I'd expect tuning > that could be important based on the underlying storage capabilities > (e.g. having the segment size match that of the SSD's erase block or > matching the backing device's full stripe width?). SO something similar > to what we have in dm-cache's blocksize. For the current implementation, No. The segment size is hard-coded in the source code and one has to re-compile the module to change the segment size. But hard-coding the size has reasonable background for performance and simplification. Please look at the code fragment from .map method which does (1) writeboost first sees hit/miss. Get metablock (mb). (2) And then have to get the segment_header "logically" containing the metablock. mb = ht_lookup(cache, head, ); // (1) if (mb) { seg = ((void *) mb) - (mb->idx % NR_CACHES_INSEG) * // (2) sizeof(struct metablock); atomic_inc(>nr_inflight_ios); } #define NR_CACHES_INSEG ((1 << (WB_SEGMENTSIZE_ORDER - 3)) - 1) (3) struct segment_header { struct metablock mb_array[NR_CACHES_INSEG]; In the current implementation I place metablocks especially "physically" in the segment header (3) so calculation of the segment header containing the metablock is a just a simple address calculation which performs good. Since writeboost focuses on the peak write performance the light-weighted lookup is the lifeline. If I re-design writeboost to accept segment size in .ctr this technique will be impossible since knowing NR_CACHES_INSEG before accepting it is impossible. It is just a matter of tradeoff. But probably, having purged cache-sharing gave me some another chance of fancy technique to do the same thing with reasonable overhead and code complexity. I will try to think of it. I know forcing re-compiling the kernel to the ordinary users sounds harsh. > I'll look at the code but it strikes me as odd that the first sector of > the cache device is checked yet the last sector of the first MB of the > cache is wher ethe superblock resides. I'd think you'd want to have the > check on whether to format or not to be the same location as the > superblock? The first sector of the first 1MB is called Superblock Header and the last sector of the first 1MB is called Superblock Record. The former contains information fixed at initialization and the latter contains information updated runtime by Superblock Recorder daemon. The latter is also checked in initialization step. The logic is in recover_cache(). If it contains `last_migrated_segment_id` updated, the time for recover_cache() becomes short. > So this "<16 stat info (r/w)", is that like /proc/diskstats ? Are you > aware that dm-stats exists now and can be used instead of needing to > tracking these stats in dm-writeboost? Sort of. But the difference is that these information is relevant to how a bio went through the path in writeboost. They are like "read hits", "read misses" ... in dm-cache status. So I don't think I need to discard it. I read through the document of statistics https://lwn.net/Articles/566273/ and I understand the dm-stats only surveils the external I/O statistics but not the internal conditional branch in detail. > Whatever name you come up with, please add a "dm_" prefix. add dm_ prefix to only struct or to including all filenames and function
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote: > Hi, Mike > > I am now working on redesigning and implementation > of dm-writeboost. Ok, I'm dropping your original patch, please resend when you have something you want merged into drivers/staging/ thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Sep 24 2013 at 8:20am -0400, Akira Hayakawa wrote: > Hi, Mike > > I am now working on redesigning and implementation > of dm-writeboost. > > This is a progress report. > > Please run > git clone https://github.com/akiradeveloper/dm-writeboost.git > to see full set of the code. I likely won't be able to look closely at the code until Monday (9/30); I have some higher priority reviews and issues to take care of this week. But I'm very encouraged by what you've shared below; looks like things are moving in the right direction. Great job. > * 1. Current Status > writeboost in new design passed my test. > Documentations are ongoing. > > * 2. Big Changes > - Cache-sharing purged > - All Sysfs purged. > - All Userland tools in Python purged. > -- dmsetup is the only user interface now. > - The daemon in userland is ported to kernel. > - On-disk metadata are in little endian. > - 300 lines of codes shed in kernel > -- Python scripts were 500 LOC so 800 LOC in total. > -- It is now about 3.2k LOC all in kernel. > - Comments are added neatly. > - Reorder the codes so that it gets more readable. > > * 3. Documentation in Draft > This is a current document that will be under Documentation/device-mapper > > dm-writeboost > = > writeboost target provides log-structured caching. > It batches random writes into a big sequential write to a cache device. > > It is like dm-cache but the difference is > that writeboost focuses on handling bursty writes and lifetime of SSD cache > device. > > Auxiliary PDF documents and Quick-start scripts are available in > https://github.com/akiradeveloper/dm-writeboost > > Design > == > There are foreground path and 6 background daemons. > > Foreground > -- > It accepts bios and put writes to RAM buffer. > When the buffer is full, it creates a "flush job" and queues it. > > Background > -- > * Flush Daemon > Pop a flush job from the queue and executes it. > > * Deferring ACK for barrier writes > Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. > Immediately handling these bios badly slows down writeboost. > It surveils the bios with these flags and forcefully flushes them > at worst case within `barrier_deadline_ms` period. OK, but the thing is upper level consumers in the IO stack, like ext4, expect that when the REQ_FLUSH completes that the device has in fact flushed any transient state in memory. So I'm not seeing how handling these lazily is an option. Though I do appreciate that dm-cache (and dm-thin) do take similar approaches. Would like to get Joe Thornber's insight here. > * Migration Daemon > It migrates, writes back cache data to backing store, > the data on the cache device in segment granurality. > > If `allow_migrate` is true, it migrates without impending situation. > Being in impending situation is that there are no room in cache device > for writing further flush jobs. > > Migration at a time is done batching `nr_max_batched_migration` segments at > maximum. > Therefore, unlike existing I/O scheduler, > two dirty writes distant in time space can be merged. > > * Migration Modulator > Migration while the backing store is heavily loaded > grows the device queue and thus makes the situation ever worse. > This daemon modulates the migration by switching `allow_migrate`. > > * Superblock Recorder > Superblock record is a last sector of first 1MB region in cache device. > It contains what id of the segment lastly migrated. > This daemon periodically update the region every `update_record_interval` > seconds. > > * Cache Synchronizer > This daemon forcefully makes all the dirty writes persistent > every `sync_interval` seconds. > Since writeboost correctly implements the bio semantics > writing the dirties out forcefully out of the main path is needless. > However, some user want to be on the safe side by enabling this. These seem reasonable to me. Will need to have a look at thread naming to make sure the names reflect they are part of a dm-writeboost service. > Target Interface > > All the operations are via dmsetup command. > > Constructor > --- > writeboost > > backing dev : slow device holding original data blocks. > cache dev : fast device holding cached data and its metadata. You don't allow user to specify the "segment size"? I'd expect tuning that could be important based on the underlying storage capabilities (e.g. having the segment size match that of the SSD's erase block or matching the backing device's full stripe width?). SO something similar to what we have in dm-cache's blocksize. > Note that cache device is re-formatted > if the first sector of the cache device is zeroed out. I'll look at the code but it strikes me as odd that the first sector of the cache device is checked yet the last sector of the first MB of the cache is wher ethe superblock resides. I'd think you'd want to have the check on whether to format or not to be the same
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Sep 24 2013 at 8:20am -0400, Akira Hayakawa ruby.w...@gmail.com wrote: Hi, Mike I am now working on redesigning and implementation of dm-writeboost. This is a progress report. Please run git clone https://github.com/akiradeveloper/dm-writeboost.git to see full set of the code. I likely won't be able to look closely at the code until Monday (9/30); I have some higher priority reviews and issues to take care of this week. But I'm very encouraged by what you've shared below; looks like things are moving in the right direction. Great job. * 1. Current Status writeboost in new design passed my test. Documentations are ongoing. * 2. Big Changes - Cache-sharing purged - All Sysfs purged. - All Userland tools in Python purged. -- dmsetup is the only user interface now. - The daemon in userland is ported to kernel. - On-disk metadata are in little endian. - 300 lines of codes shed in kernel -- Python scripts were 500 LOC so 800 LOC in total. -- It is now about 3.2k LOC all in kernel. - Comments are added neatly. - Reorder the codes so that it gets more readable. * 3. Documentation in Draft This is a current document that will be under Documentation/device-mapper dm-writeboost = writeboost target provides log-structured caching. It batches random writes into a big sequential write to a cache device. It is like dm-cache but the difference is that writeboost focuses on handling bursty writes and lifetime of SSD cache device. Auxiliary PDF documents and Quick-start scripts are available in https://github.com/akiradeveloper/dm-writeboost Design == There are foreground path and 6 background daemons. Foreground -- It accepts bios and put writes to RAM buffer. When the buffer is full, it creates a flush job and queues it. Background -- * Flush Daemon Pop a flush job from the queue and executes it. * Deferring ACK for barrier writes Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. Immediately handling these bios badly slows down writeboost. It surveils the bios with these flags and forcefully flushes them at worst case within `barrier_deadline_ms` period. OK, but the thing is upper level consumers in the IO stack, like ext4, expect that when the REQ_FLUSH completes that the device has in fact flushed any transient state in memory. So I'm not seeing how handling these lazily is an option. Though I do appreciate that dm-cache (and dm-thin) do take similar approaches. Would like to get Joe Thornber's insight here. * Migration Daemon It migrates, writes back cache data to backing store, the data on the cache device in segment granurality. If `allow_migrate` is true, it migrates without impending situation. Being in impending situation is that there are no room in cache device for writing further flush jobs. Migration at a time is done batching `nr_max_batched_migration` segments at maximum. Therefore, unlike existing I/O scheduler, two dirty writes distant in time space can be merged. * Migration Modulator Migration while the backing store is heavily loaded grows the device queue and thus makes the situation ever worse. This daemon modulates the migration by switching `allow_migrate`. * Superblock Recorder Superblock record is a last sector of first 1MB region in cache device. It contains what id of the segment lastly migrated. This daemon periodically update the region every `update_record_interval` seconds. * Cache Synchronizer This daemon forcefully makes all the dirty writes persistent every `sync_interval` seconds. Since writeboost correctly implements the bio semantics writing the dirties out forcefully out of the main path is needless. However, some user want to be on the safe side by enabling this. These seem reasonable to me. Will need to have a look at thread naming to make sure the names reflect they are part of a dm-writeboost service. Target Interface All the operations are via dmsetup command. Constructor --- writeboost backing dev cache dev backing dev : slow device holding original data blocks. cache dev : fast device holding cached data and its metadata. You don't allow user to specify the segment size? I'd expect tuning that could be important based on the underlying storage capabilities (e.g. having the segment size match that of the SSD's erase block or matching the backing device's full stripe width?). SO something similar to what we have in dm-cache's blocksize. Note that cache device is re-formatted if the first sector of the cache device is zeroed out. I'll look at the code but it strikes me as odd that the first sector of the cache device is checked yet the last sector of the first MB of the cache is wher ethe superblock resides. I'd think you'd want to have the check on whether to format or not to be the same location as the superblock? Status -- #dirty caches
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote: Hi, Mike I am now working on redesigning and implementation of dm-writeboost. Ok, I'm dropping your original patch, please resend when you have something you want merged into drivers/staging/ thanks, greg k-h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Hi, Mike I have made another progress yesterday: Splitting the monolithic source code into meaningful pieces is done. It will follow in the next mail. Yes, please share your plan. Anything that can simplify the code layout is best done earlier to simplfy code review. Sorry, should have been done in earlier stage. First, I reply to each of your comments. OK, but the thing is upper level consumers in the IO stack, like ext4, expect that when the REQ_FLUSH completes that the device has in fact flushed any transient state in memory. So I'm not seeing how handling these lazily is an option. Though I do appreciate that dm-cache (and dm-thin) do take similar approaches. Would like to get Joe Thornber's insight here. When the upper level consumers receives the completion of bio with REQ_FLUSH sent all the transient states are persistent. writeboost do four steps to accomplish this: 1. queue the flush job with the current transient state (RAM buffer). 2. wait for the completion of the flush job to be written in cache device. 3. blkdev_issue_flush() to the cache device to make all the writes persistent. 4. bio_endio() to the said flagged bios. If the implementation isn't wrong It could be working as the consumers expect is what I believe. These seem reasonable to me. Will need to have a look at thread naming to make sure the names reflect they are part of a dm-writeboost service. I change former Cache Synchronizer to Dirty Synchronizer but it sounds little bit odd still. Naming is truly difficult. You don't allow user to specify the segment size? I'd expect tuning that could be important based on the underlying storage capabilities (e.g. having the segment size match that of the SSD's erase block or matching the backing device's full stripe width?). SO something similar to what we have in dm-cache's blocksize. For the current implementation, No. The segment size is hard-coded in the source code and one has to re-compile the module to change the segment size. But hard-coding the size has reasonable background for performance and simplification. Please look at the code fragment from .map method which does (1) writeboost first sees hit/miss. Get metablock (mb). (2) And then have to get the segment_header logically containing the metablock. mb = ht_lookup(cache, head, key); // (1) if (mb) { seg = ((void *) mb) - (mb-idx % NR_CACHES_INSEG) * // (2) sizeof(struct metablock); atomic_inc(seg-nr_inflight_ios); } #define NR_CACHES_INSEG ((1 (WB_SEGMENTSIZE_ORDER - 3)) - 1) (3) struct segment_header { struct metablock mb_array[NR_CACHES_INSEG]; In the current implementation I place metablocks especially physically in the segment header (3) so calculation of the segment header containing the metablock is a just a simple address calculation which performs good. Since writeboost focuses on the peak write performance the light-weighted lookup is the lifeline. If I re-design writeboost to accept segment size in .ctr this technique will be impossible since knowing NR_CACHES_INSEG before accepting it is impossible. It is just a matter of tradeoff. But probably, having purged cache-sharing gave me some another chance of fancy technique to do the same thing with reasonable overhead and code complexity. I will try to think of it. I know forcing re-compiling the kernel to the ordinary users sounds harsh. I'll look at the code but it strikes me as odd that the first sector of the cache device is checked yet the last sector of the first MB of the cache is wher ethe superblock resides. I'd think you'd want to have the check on whether to format or not to be the same location as the superblock? The first sector of the first 1MB is called Superblock Header and the last sector of the first 1MB is called Superblock Record. The former contains information fixed at initialization and the latter contains information updated runtime by Superblock Recorder daemon. The latter is also checked in initialization step. The logic is in recover_cache(). If it contains `last_migrated_segment_id` updated, the time for recover_cache() becomes short. So this 16 stat info (r/w), is that like /proc/diskstats ? Are you aware that dm-stats exists now and can be used instead of needing to tracking these stats in dm-writeboost? Sort of. But the difference is that these information is relevant to how a bio went through the path in writeboost. They are like read hits, read misses ... in dm-cache status. So I don't think I need to discard it. I read through the document of statistics https://lwn.net/Articles/566273/ and I understand the dm-stats only surveils the external I/O statistics but not the internal conditional branch in detail. Whatever name you come up with, please add a dm_ prefix. add dm_ prefix to only struct or to including all filenames and function names? If so, needs really big fixing.
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
On Tue, Sep 24, 2013 at 09:20:50PM +0900, Akira Hayakawa wrote: * Deferring ACK for barrier writes Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. Immediately handling these bios badly slows down writeboost. It surveils the bios with these flags and forcefully flushes them at worst case within `barrier_deadline_ms` period. That rings alarm bells. If the filesystem is using REQ_FUA/REQ_FLUSH for ordering reasons, delaying them to allow other IOs to be submitted and dispatched may very well violate the IO ordering constraints the filesystem is trying to acheive. Alternatively, delaying them will stall the filesystem because it's waiting for said REQ_FUA IO to complete. For example, journal writes in XFS are extremely IO latency sensitive in workloads that have a signifincant number of ordering constraints (e.g. O_SYNC writes, fsync, etc) and delaying even one REQ_FUA/REQ_FLUSH can stall the filesystem for the majority of that barrier_deadline_ms. i.e. this says to me that the best performance you can get from such workloas is one synchronous operation per process per barrier_deadline_ms, even when the storage and filesystem might be capable of executing hundreds of synchronous operations per barrier_deadline_ms.. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Hi, Mike I am now working on redesigning and implementation of dm-writeboost. This is a progress report. Please run git clone https://github.com/akiradeveloper/dm-writeboost.git to see full set of the code. * 1. Current Status writeboost in new design passed my test. Documentations are ongoing. * 2. Big Changes - Cache-sharing purged - All Sysfs purged. - All Userland tools in Python purged. -- dmsetup is the only user interface now. - The daemon in userland is ported to kernel. - On-disk metadata are in little endian. - 300 lines of codes shed in kernel -- Python scripts were 500 LOC so 800 LOC in total. -- It is now about 3.2k LOC all in kernel. - Comments are added neatly. - Reorder the codes so that it gets more readable. * 3. Documentation in Draft This is a current document that will be under Documentation/device-mapper dm-writeboost = writeboost target provides log-structured caching. It batches random writes into a big sequential write to a cache device. It is like dm-cache but the difference is that writeboost focuses on handling bursty writes and lifetime of SSD cache device. Auxiliary PDF documents and Quick-start scripts are available in https://github.com/akiradeveloper/dm-writeboost Design == There are foreground path and 6 background daemons. Foreground -- It accepts bios and put writes to RAM buffer. When the buffer is full, it creates a "flush job" and queues it. Background -- * Flush Daemon Pop a flush job from the queue and executes it. * Deferring ACK for barrier writes Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. Immediately handling these bios badly slows down writeboost. It surveils the bios with these flags and forcefully flushes them at worst case within `barrier_deadline_ms` period. * Migration Daemon It migrates, writes back cache data to backing store, the data on the cache device in segment granurality. If `allow_migrate` is true, it migrates without impending situation. Being in impending situation is that there are no room in cache device for writing further flush jobs. Migration at a time is done batching `nr_max_batched_migration` segments at maximum. Therefore, unlike existing I/O scheduler, two dirty writes distant in time space can be merged. * Migration Modulator Migration while the backing store is heavily loaded grows the device queue and thus makes the situation ever worse. This daemon modulates the migration by switching `allow_migrate`. * Superblock Recorder Superblock record is a last sector of first 1MB region in cache device. It contains what id of the segment lastly migrated. This daemon periodically update the region every `update_record_interval` seconds. * Cache Synchronizer This daemon forcefully makes all the dirty writes persistent every `sync_interval` seconds. Since writeboost correctly implements the bio semantics writing the dirties out forcefully out of the main path is needless. However, some user want to be on the safe side by enabling this. Target Interface All the operations are via dmsetup command. Constructor --- writeboost backing dev : slow device holding original data blocks. cache dev : fast device holding cached data and its metadata. Note that cache device is re-formatted if the first sector of the cache device is zeroed out. Status -- <#dirty caches> <#segments> <16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not)> <# of kv pairs> Messages You can tune up writeboost via message interface. * barrier_deadline_ms (ms) Default: 3 All the bios with barrier flags like REQ_FUA or REQ_FLUSH are guaranteed to be acked within this deadline. * allow_migrate (bool) Default: 1 Set to 1 to start migration. * enable_migration_modulator (bool) and migrate_threshold (%) Default: 1 Set to 1 to run migration modulator. Migration modulator surveils the load of backing store and set the migration started when the load is lower than the migrate_threshold. * nr_max_batched_migration (int) Default: 1 Number of segments to migrate simultaneously and atomically. Set higher value to fully exploit the capacily of the backing store. * sync_interval (sec) Default: 60 All the dirty writes are guaranteed to be persistent by this interval. * update_record_interval (sec) Default: 60 The superblock record is updated every update_record_interval seconds. Example === dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct sz=`blockdev --getsize ${BACKING}` dmsetup create writeboost-vol --table "0 ${sz} writeboost ${BACKING} {CACHE}" * 4. TODO - rename struct arr -- It is like flex_array but lighter by eliminating the resizableness. Maybe, bigarray is a next candidate but I don't have a judge on this. I want to make an agreement on this renaming issue before doing it. - resume, preresume and postsuspend possibly have to be implemented. -- But I have no idea at all. -- Maybe, I should make a research on other
Re: Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Hi, Mike I am now working on redesigning and implementation of dm-writeboost. This is a progress report. Please run git clone https://github.com/akiradeveloper/dm-writeboost.git to see full set of the code. * 1. Current Status writeboost in new design passed my test. Documentations are ongoing. * 2. Big Changes - Cache-sharing purged - All Sysfs purged. - All Userland tools in Python purged. -- dmsetup is the only user interface now. - The daemon in userland is ported to kernel. - On-disk metadata are in little endian. - 300 lines of codes shed in kernel -- Python scripts were 500 LOC so 800 LOC in total. -- It is now about 3.2k LOC all in kernel. - Comments are added neatly. - Reorder the codes so that it gets more readable. * 3. Documentation in Draft This is a current document that will be under Documentation/device-mapper dm-writeboost = writeboost target provides log-structured caching. It batches random writes into a big sequential write to a cache device. It is like dm-cache but the difference is that writeboost focuses on handling bursty writes and lifetime of SSD cache device. Auxiliary PDF documents and Quick-start scripts are available in https://github.com/akiradeveloper/dm-writeboost Design == There are foreground path and 6 background daemons. Foreground -- It accepts bios and put writes to RAM buffer. When the buffer is full, it creates a flush job and queues it. Background -- * Flush Daemon Pop a flush job from the queue and executes it. * Deferring ACK for barrier writes Barrier flags such as REQ_FUA and REQ_FLUSH are handled lazily. Immediately handling these bios badly slows down writeboost. It surveils the bios with these flags and forcefully flushes them at worst case within `barrier_deadline_ms` period. * Migration Daemon It migrates, writes back cache data to backing store, the data on the cache device in segment granurality. If `allow_migrate` is true, it migrates without impending situation. Being in impending situation is that there are no room in cache device for writing further flush jobs. Migration at a time is done batching `nr_max_batched_migration` segments at maximum. Therefore, unlike existing I/O scheduler, two dirty writes distant in time space can be merged. * Migration Modulator Migration while the backing store is heavily loaded grows the device queue and thus makes the situation ever worse. This daemon modulates the migration by switching `allow_migrate`. * Superblock Recorder Superblock record is a last sector of first 1MB region in cache device. It contains what id of the segment lastly migrated. This daemon periodically update the region every `update_record_interval` seconds. * Cache Synchronizer This daemon forcefully makes all the dirty writes persistent every `sync_interval` seconds. Since writeboost correctly implements the bio semantics writing the dirties out forcefully out of the main path is needless. However, some user want to be on the safe side by enabling this. Target Interface All the operations are via dmsetup command. Constructor --- writeboost backing dev cache dev backing dev : slow device holding original data blocks. cache dev : fast device holding cached data and its metadata. Note that cache device is re-formatted if the first sector of the cache device is zeroed out. Status -- #dirty caches #segments id of the segment lastly migrated id of the segment lastly flushed id of the current segment the position of the cursor 16 stat info (r/w) x (hit/miss) x (on buffer/not) x (fullsize/not) # of kv pairs kv pairs Messages You can tune up writeboost via message interface. * barrier_deadline_ms (ms) Default: 3 All the bios with barrier flags like REQ_FUA or REQ_FLUSH are guaranteed to be acked within this deadline. * allow_migrate (bool) Default: 1 Set to 1 to start migration. * enable_migration_modulator (bool) and migrate_threshold (%) Default: 1 Set to 1 to run migration modulator. Migration modulator surveils the load of backing store and set the migration started when the load is lower than the migrate_threshold. * nr_max_batched_migration (int) Default: 1 Number of segments to migrate simultaneously and atomically. Set higher value to fully exploit the capacily of the backing store. * sync_interval (sec) Default: 60 All the dirty writes are guaranteed to be persistent by this interval. * update_record_interval (sec) Default: 60 The superblock record is updated every update_record_interval seconds. Example === dd if=/dev/zero of=${CACHE} bs=512 count=1 oflag=direct sz=`blockdev --getsize ${BACKING}` dmsetup create writeboost-vol --table 0 ${sz} writeboost ${BACKING} {CACHE} * 4. TODO - rename struct arr -- It is like flex_array but lighter by eliminating the resizableness. Maybe, bigarray is a next candidate but I don't have a judge on this. I want to make an agreement on this renaming issue before doing it. - resume,
Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Mike, > We don't need to go through staging. If the dm-writeboost target is > designed well and provides a tangible benefit it doesn't need > wide-spread users as justification for going in. The users will come if > it is implemented well. OK. The benefit of introducing writeboost will be documented. 1. READ often hit in page cache. That's what page cache is all about. READ cache only caches the rest that page cache couldn't cache. 2. Backing store in RAID mode crazily slow in WRITE, especially if it is RAID-5. will be the points. There is not a silver bullet as a cache software but writeboost can fit in many situations I believe. > Have you looked at how both dm-cache and dm-thinp handle this? > Userspace takes care to write all zeroes to the start of the metadata > device before the first use in the kernel. Zeroing the first one sector is a sign of needing formatting sounds nice to writeboost too. It's simple and I like it. > Could be the log structured nature of writeboost is very different. > I'll review this closer tomorrow. I should mention about the big design difference between writeboost and dm-cache to help you understand the nature of writeboost. Writeboost doesn't have segregated metadata device like dm-cache does. Data and metadata coexists in the same cache device. That is what log-structured is. Data and its relevant metadata are packed in a log segment and written to cache device atomically which makes writeboost reliable and fast. So, > could be factored out. I haven't yet looked close enough at that aspect > of writeboost code to know if it could benefit from the existing > bio-prison code or persistent-data library at all. writeboost would > obviously need a new space map type, etc. what makes sense to dm-cache could not make sense to writeboost. At a simple look, they don't fit to the design of writeboost. But I will investigate these functionality further in later time. > sounds like a step in the right direction. Plus you can share the cache > by layering multiple linear devices ontop of the dm-writeboost device. They are theoretically different but it is actually a trade-off. But it is not a big problem compared to fitting to device-mapper. > Also managing dm-writeboost devices with lvm2 is a priority, so any > interface similarities dm-writeboost has with dm-cache will be > beneficial. It sounds really good to me. Huge benefit. Akira n 9/18/13 5:59 AM, Mike Snitzer wrote: > On Tue, Sep 17 2013 at 8:43am -0400, > Akira Hayakawa wrote: > >> Hi, Mike >> >> There are two designs in my mind >> regarding the formatting cache. >> >> You said >>> administer the writeboost devices. There is no need for this. Just >>> have a normal DM target whose .ctr takes care of validation and >>> determines whether a device needs formatting, etc. >> makes me wonder how I format the cache device. >> >> >> There are two choices for formatting cache and create a writeboost device >> standing on the point of removing writeboost-mgr existing in the current >> design. >> I will explain them from how the interface will look like. >> >> (1) dmsetup create myDevice ... "... $backing_path $cache_path" >> which will returns error if the superblock of the given cache device >> is invalid and needs formatting. >> And then the user formats the cache device by some userland tool. >> >> (2) dmsetup create myDevice ... "... $backing_path $cache_path $do_format" >> which also returns error if the superblock of the given cache device >> is invalid and needs formatting when $do_format is 0. >> And then user formats the cache device by setting $do_format to 1 and try >> again. >> >> There pros and cons about the design tradeoffs: >> - (i) (1) is simpler. do_format parameter in (2) doesn't seem to be sane. >>(1) is like the interfaces of filesystems where dmsetup create is >> like mounting a filesystem. >> - (ii) (2) can implement everything in kernel. It can gather all the >> information >>about how the superblock in one place, kernel code. >> >> Excuse for the current design: >> - The reason I design writeboost-mgr is almost regarding (ii) above. >> writeboost-mgr has a message "format_cache_device" and >> writeboost-format-cache userland command kicks the message to format cache. >> >> - writeboost-mgr has also a message "resume_cache" >> that validates and builds a in-memory structure according to the cache >> binding to given $cache_id >> and user later dmsetup create the writeboost device with the $cache_id. >> However, resuming the cache metadata should be done under .ctr like >> dm-cache does >> and should not relate LV to create and cache by external cache_id >> is what I realized by looking at the code of dm-cache which >> calls dm_cache_metadata_open() routines under .ctr . > > Right, any in-core structures should be allocated in .ctr() > >> writeboost-mgr is something like smell of over-engineering but >> is useful for simplifying
Reworking dm-writeboost [was: Re: staging: Add dm-writeboost]
Mike, We don't need to go through staging. If the dm-writeboost target is designed well and provides a tangible benefit it doesn't need wide-spread users as justification for going in. The users will come if it is implemented well. OK. The benefit of introducing writeboost will be documented. 1. READ often hit in page cache. That's what page cache is all about. READ cache only caches the rest that page cache couldn't cache. 2. Backing store in RAID mode crazily slow in WRITE, especially if it is RAID-5. will be the points. There is not a silver bullet as a cache software but writeboost can fit in many situations I believe. Have you looked at how both dm-cache and dm-thinp handle this? Userspace takes care to write all zeroes to the start of the metadata device before the first use in the kernel. Zeroing the first one sector is a sign of needing formatting sounds nice to writeboost too. It's simple and I like it. Could be the log structured nature of writeboost is very different. I'll review this closer tomorrow. I should mention about the big design difference between writeboost and dm-cache to help you understand the nature of writeboost. Writeboost doesn't have segregated metadata device like dm-cache does. Data and metadata coexists in the same cache device. That is what log-structured is. Data and its relevant metadata are packed in a log segment and written to cache device atomically which makes writeboost reliable and fast. So, could be factored out. I haven't yet looked close enough at that aspect of writeboost code to know if it could benefit from the existing bio-prison code or persistent-data library at all. writeboost would obviously need a new space map type, etc. what makes sense to dm-cache could not make sense to writeboost. At a simple look, they don't fit to the design of writeboost. But I will investigate these functionality further in later time. sounds like a step in the right direction. Plus you can share the cache by layering multiple linear devices ontop of the dm-writeboost device. They are theoretically different but it is actually a trade-off. But it is not a big problem compared to fitting to device-mapper. Also managing dm-writeboost devices with lvm2 is a priority, so any interface similarities dm-writeboost has with dm-cache will be beneficial. It sounds really good to me. Huge benefit. Akira n 9/18/13 5:59 AM, Mike Snitzer wrote: On Tue, Sep 17 2013 at 8:43am -0400, Akira Hayakawa ruby.w...@gmail.com wrote: Hi, Mike There are two designs in my mind regarding the formatting cache. You said administer the writeboost devices. There is no need for this. Just have a normal DM target whose .ctr takes care of validation and determines whether a device needs formatting, etc. makes me wonder how I format the cache device. There are two choices for formatting cache and create a writeboost device standing on the point of removing writeboost-mgr existing in the current design. I will explain them from how the interface will look like. (1) dmsetup create myDevice ... ... $backing_path $cache_path which will returns error if the superblock of the given cache device is invalid and needs formatting. And then the user formats the cache device by some userland tool. (2) dmsetup create myDevice ... ... $backing_path $cache_path $do_format which also returns error if the superblock of the given cache device is invalid and needs formatting when $do_format is 0. And then user formats the cache device by setting $do_format to 1 and try again. There pros and cons about the design tradeoffs: - (i) (1) is simpler. do_format parameter in (2) doesn't seem to be sane. (1) is like the interfaces of filesystems where dmsetup create is like mounting a filesystem. - (ii) (2) can implement everything in kernel. It can gather all the information about how the superblock in one place, kernel code. Excuse for the current design: - The reason I design writeboost-mgr is almost regarding (ii) above. writeboost-mgr has a message format_cache_device and writeboost-format-cache userland command kicks the message to format cache. - writeboost-mgr has also a message resume_cache that validates and builds a in-memory structure according to the cache binding to given $cache_id and user later dmsetup create the writeboost device with the $cache_id. However, resuming the cache metadata should be done under .ctr like dm-cache does and should not relate LV to create and cache by external cache_id is what I realized by looking at the code of dm-cache which calls dm_cache_metadata_open() routines under .ctr . Right, any in-core structures should be allocated in .ctr() writeboost-mgr is something like smell of over-engineering but is useful for simplifying the design for above reasons. Which do you think better? Have you looked at how both dm-cache and dm-thinp handle