Re: Plan: journalling fixes for WAPBL
2017-01-02 18:31 GMT+01:00 David Holland: > Well, there's two things going on here. One is the parallelism limit, > which you can work around by having more drives, e.g. in an array. > The other is netbsd's 64k MAXPHYS issue, which is our own problem that > we could put time into. (And in fact, it would be nice to get > tls-maxphys into -8... anyone have time? I don't. sigh) It would be very nice to have this intergrated, yes. It won't have dramatic performance effect, but it's relatively low hanging fruit, it's already almost done and we should just get rid of this arbitrary system limit. I'd like to look into this, but I won't manage sooner then autumn 2017 - I'd like to first work on FUA/DPO support, and then SATA NCQ. I think those could have bigger performance impact. I hope I'll have some patches to make FUA available for I/O for SCSI drives by second half of feb, plus the changes for WAPBL to use it. It's slightly difficult to DTRT with FUA on nested drivers like raidframe/cgd/vnd, but maybe we can ignore those for the first iteration :) I'll send a proposal once I figure details. Jaromir
Re: Plan: journalling fixes for WAPBL
On Mon, Jan 02, 2017 at 06:08:04PM +, David Holland wrote: > On Mon, Jan 02, 2017 at 01:01:34PM -0500, Thor Lancelot Simon wrote: > > On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote: > > > (from a while back) > > > > > > However, I'm missing something. The I/O queue depths that you need to > > > get peak write performance from SSDs are larger than 31, and the test > > > labs appear to have been able to do this with SATA-attached SSDs... > > > what are/were they doing? > > > > Aggressive prefetching, extreme efforts to reduce command latency at > > the drive end of the SATA link (and higher link speeds), plus much > > larger request sizes than we can issue. > > Yes, but I mean testing with queue depths > 31, like ~100, which I'm > sure I remember seeing. But maybe I'm wrong... obviously I should go > rake up some links, maybe later. The tests could have been run with RAID controllers that present a SCSI interface to the host. These often support very deep queues both for the virtual targets and at the adapter (channel) itself, at which point it's all about minimizing latency again on the controller's side of the interaction, where it really _is_ SATA with a limited queue depth. If you want a large number of SATA targets in one box you are likely using a RAID controller even if you're just using it in JBOD mode. That makes every SATA target look like a SCSI target. Thor
Re: Plan: journalling fixes for WAPBL
On Mon, Jan 02, 2017 at 01:01:34PM -0500, Thor Lancelot Simon wrote: > On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote: > > (from a while back) > > > > However, I'm missing something. The I/O queue depths that you need to > > get peak write performance from SSDs are larger than 31, and the test > > labs appear to have been able to do this with SATA-attached SSDs... > > what are/were they doing? > > Aggressive prefetching, extreme efforts to reduce command latency at > the drive end of the SATA link (and higher link speeds), plus much > larger request sizes than we can issue. Yes, but I mean testing with queue depths > 31, like ~100, which I'm sure I remember seeing. But maybe I'm wrong... obviously I should go rake up some links, maybe later. -- David A. Holland dholl...@netbsd.org
Re: Plan: journalling fixes for WAPBL
On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote: > (from a while back) > > However, I'm missing something. The I/O queue depths that you need to > get peak write performance from SSDs are larger than 31, and the test > labs appear to have been able to do this with SATA-attached SSDs... > what are/were they doing? Aggressive prefetching, extreme efforts to reduce command latency at the drive end of the SATA link (and higher link speeds), plus much larger request sizes than we can issue. -- Thor Lancelot Simon t...@panix.com Ring the bells that still can ring.
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 08:51:30AM -0600, Warner Losh wrote: > [*] There is an NCQ version of TRIM, but it requires the AUX register > to be sent and very few sata hosts controllers support that (though > AHCI does, many of the LSI controllers don't in any performant way). I (somewhat idly) wonder if this is why we currently have TRIM that works on ahcisata but not on several other controllers... (PR 51756 for siisata, PR 47455 for piixide, maybe more) -- David A. Holland dholl...@netbsd.org
Re: Plan: journalling fixes for WAPBL
(from a while back) On Wed, Sep 28, 2016 at 02:27:39PM +, paul.kon...@dell.com wrote: > > On Sep 28, 2016, at 7:22 AM, Jarom?r Dole?ek> > wrote: > > I think it's far assesment to say that on SATA with NCQ/31 tags (max > > is actually 31, not 32 tags), it's pretty much impossible to have > > acceptable write performance without using write cache. We could never > > saturate even drive with 16MB cache with just 31 tags and 64k maxphys. > > So it's IMO not useful to design for world without disk drive write > > cache. > > I think that depends on the software. In a SAN storage array I > work on, we used to use SATA drives, always with cache disabled to > avoid data loss due to power failure. We had them running just > fine with NCQ. (For that matter, even without NCQ, though that > takes major effort.) Well, there's two things going on here. One is the parallelism limit, which you can work around by having more drives, e.g. in an array. The other is netbsd's 64k MAXPHYS issue, which is our own problem that we could put time into. (And in fact, it would be nice to get tls-maxphys into -8... anyone have time? I don't. sigh) However, I'm missing something. The I/O queue depths that you need to get peak write performance from SSDs are larger than 31, and the test labs appear to have been able to do this with SATA-attached SSDs... what are/were they doing? -- David A. Holland dholl...@netbsd.org
Re: Plan: journalling fixes for WAPBL
> On Sep 28, 2016, at 7:22 AM, Jaromír Doleček> wrote: > > I think it's far assesment to say that on SATA with NCQ/31 tags (max > is actually 31, not 32 tags), it's pretty much impossible to have > acceptable write performance without using write cache. We could never > saturate even drive with 16MB cache with just 31 tags and 64k maxphys. > So it's IMO not useful to design for world without disk drive write > cache. I think that depends on the software. In a SAN storage array I work on, we used to use SATA drives, always with cache disabled to avoid data loss due to power failure. We had them running just fine with NCQ. (For that matter, even without NCQ, though that takes major effort.) So perhaps an optimization effort is called for, if people view this performance issue as worth the trouble. Or you might decide that for performance SAS is the answer, and SATA is only for non-critical applications. paul
Re: Plan: journalling fixes for WAPBL
I think it's far assesment to say that on SATA with NCQ/31 tags (max is actually 31, not 32 tags), it's pretty much impossible to have acceptable write performance without using write cache. We could never saturate even drive with 16MB cache with just 31 tags and 64k maxphys. So it's IMO not useful to design for world without disk drive write cache. Back to discussion about B_ORDERED: As was said before, SCSI ORDERED tag does precisely what we want for journal commit record - it forces all previous commands sent to controller to be finished before the one with ORDERED tag is processed, and any commands sent after the ORDERED tagged one are executed only after the previous ordered command is finished. No need for any bufq magic there, which is wonderful. Too bad that NCQ doesn't provide this. That said, we still need to be sure that all the previous commands were sent prior to pushing ORDERED command to SCSI controller. Are there any SCSI controllers with multiple submission queues (like NVMe), regardless of our scsipi layer MP limitations? FWIW AHCI is single-threaded by design, every command submission has to write to same set of registers. Jaromir 2016-09-23 19:51 GMT+02:00 Manuel Bouyer: > On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote: >> > > This seems like the key thing needed to avoid FUA: to implement fsync() >> > > you just wait for notifications of completion to be received, and once >> > > you have those for all requests pending when fsync was called, or >> > > started as part of the fsync, then you're done. >> > >> > *if you have the write cache disabled* >> >> *Running with the write cache enabled is a bad idea* > > On ATA devices, you can't permanently disable the write cache. You have > to do it on every power cycles. > > Well this really needs to be carefully evaluated. With only 32 tags I'm not > sure you can efficiently use recent devices with the write cache > disabled (most enterprise disks have a 64M cache these days) > > -- > Manuel Bouyer > NetBSD: 26 ans d'experience feront toujours la difference > --
Re: Plan: journalling fixes for WAPBL
On Sat, Sep 24, 2016 at 2:01 AM, David Hollandwrote: > On Fri, Sep 23, 2016 at 07:51:32PM +0200, Manuel Bouyer wrote: > > > > *if you have the write cache disabled* > > > > > > *Running with the write cache enabled is a bad idea* > > > > On ATA devices, you can't permanently disable the write cache. You have > > to do it on every power cycles. > > There are also drives that ignore attempts to turn off write caching. These drives lie to the host and say that caching is off, when it really is still on, right? Warner
Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)
On Fri, Sep 23, 2016 at 01:02:26PM +, paul.kon...@dell.com wrote: > > > On Sep 23, 2016, at 5:49 AM, Edgar Fu??wrote: > > > >> The whole point of tagged queueing is to let you *not* set [the write > >> cache] bit in the mode pages and still get good performance. > > I don't get that. My understanding was that TCQ allowed the drive to > > re-order > > commands within the bounds described by the tags. With the write cache > > disabled, all write commands must hit stable storage before being reported > > completed. So what's the point of tagging with cacheing disabled? > > I'm not sure. But I have the impression that in the real world tagging is > rarely, if ever, used. I'm not sure what you mean. Do you mean that tagging is rarely, if ever, used _to establish write barriers_, or do you mean that tagging is rarely, if ever used, period? If the latter, you're way, way wrong. Thor
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 07:51:32PM +0200, Manuel Bouyer wrote: > > > *if you have the write cache disabled* > > > > *Running with the write cache enabled is a bad idea* > > On ATA devices, you can't permanently disable the write cache. You have > to do it on every power cycles. There are also drives that ignore attempts to turn off write caching. -- David A. Holland dholl...@netbsd.org
Re: Plan: journalling fixes for WAPBL
> On Sep 23, 2016, at 10:51 AM, Warner Loshwrote: > > On Fri, Sep 23, 2016 at 7:38 AM, Thor Lancelot Simon wrote: >> On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote: >>> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote: > AFAIK ordered tags only guarantees that the write will happen in order, > but not that the writes are actually done to stable storage. The target's not allowed to report the command complete unless the data are on stable storage, except if you have write cache enable set in the relevant mode page. If you run SCSI drives like that, you're playing with fire. Expect to get burned. The whole point of tagged queueing is to let you *not* set that bit in the mode pages and still get good performance. >>> >>> Now I remember that I did indeed disable disk write cache when I had >>> scsi disks in production. It's been a while though. >>> >>> But anyway, from what I remember you still need the disk cache flush >>> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags
Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)
> On Sep 23, 2016, at 5:49 AM, Edgar Fußwrote: > >> The whole point of tagged queueing is to let you *not* set [the write >> cache] bit in the mode pages and still get good performance. > I don't get that. My understanding was that TCQ allowed the drive to re-order > commands within the bounds described by the tags. With the write cache > disabled, all write commands must hit stable storage before being reported > completed. So what's the point of tagging with cacheing disabled? I'm not sure. But I have the impression that in the real world tagging is rarely, if ever, used. paul
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 11:54 AM, Warner Loshwrote: > On Fri, Sep 23, 2016 at 11:20 AM, Thor Lancelot Simon wrote: >> On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote: >>> On September 23, 2016 10:51:30 AM EDT, Warner Losh wrote: >>> >All NCQ gives you is the ability to schedule multiple requests and >>> >to get notification of their completion (perhaps out of order). There's >>> >no coherency features are all in NCQ. >>> >>> This seems like the key thing needed to avoid FUA: to implement fsync() you >>> just wait for notifications of completion to be received, and once you have >>> those for all requests pending when fsync was called, or started as part of >>> the fsync, then you're done. >> >> The other key point is that -- unless SATA NCQ is radically different from >> SCSI tagged queuing in a particularly stupid way -- the rules require all >> "simple" tags to be completed before any "ordered" tag is completed. That >> is, >> ordered tags are barriers against all simple tags. > > SATA NCQ doesn't have ordered tags. There's just 32 slots to send > requests into. Don't allow the word 'tag' to confuse you into thinking > it is anything at all like SCSI tags. You get ordering by not > scheduling anything until after the queue has drained when you send > your "ordered" command. It is that stupid. And it can be even worse, since if the 'ordered' item must complete after all before it, you have to drain the queue before you can even send it to the drive. Depends on what the ordering guarantees you want are... Warner
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 11:20 AM, Thor Lancelot Simonwrote: > On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote: >> On September 23, 2016 10:51:30 AM EDT, Warner Losh wrote: >> >All NCQ gives you is the ability to schedule multiple requests and >> >to get notification of their completion (perhaps out of order). There's >> >no coherency features are all in NCQ. >> >> This seems like the key thing needed to avoid FUA: to implement fsync() you >> just wait for notifications of completion to be received, and once you have >> those for all requests pending when fsync was called, or started as part of >> the fsync, then you're done. > > The other key point is that -- unless SATA NCQ is radically different from > SCSI tagged queuing in a particularly stupid way -- the rules require all > "simple" tags to be completed before any "ordered" tag is completed. That is, > ordered tags are barriers against all simple tags. SATA NCQ doesn't have ordered tags. There's just 32 slots to send requests into. Don't allow the word 'tag' to confuse you into thinking it is anything at all like SCSI tags. You get ordering by not scheduling anything until after the queue has drained when you send your "ordered" command. It is that stupid. Warner
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote: > > > This seems like the key thing needed to avoid FUA: to implement fsync() > > > you just wait for notifications of completion to be received, and once > > > you have those for all requests pending when fsync was called, or started > > > as part of the fsync, then you're done. > > > > *if you have the write cache disabled* > > *Running with the write cache enabled is a bad idea* On ATA devices, you can't permanently disable the write cache. You have to do it on every power cycles. Well this really needs to be carefully evaluated. With only 32 tags I'm not sure you can efficiently use recent devices with the write cache disabled (most enterprise disks have a 64M cache these days) -- Manuel BouyerNetBSD: 26 ans d'experience feront toujours la difference --
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 07:45:00PM +0200, Manuel Bouyer wrote: > On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote: > > On September 23, 2016 10:51:30 AM EDT, Warner Loshwrote: > > >All NCQ gives you is the ability to schedule multiple requests and > > >to get notification of their completion (perhaps out of order). There's > > >no coherency features are all in NCQ. > > > > This seems like the key thing needed to avoid FUA: to implement fsync() you > > just wait for notifications of completion to be received, and once you have > > those for all requests pending when fsync was called, or started as part of > > the fsync, then you're done. > > *if you have the write cache disabled* *Running with the write cache enabled is a bad idea*
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 01:20:09PM -0400, Thor Lancelot Simon wrote: > On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote: > > On September 23, 2016 10:51:30 AM EDT, Warner Loshwrote: > > >All NCQ gives you is the ability to schedule multiple requests and > > >to get notification of their completion (perhaps out of order). There's > > >no coherency features are all in NCQ. > > > > This seems like the key thing needed to avoid FUA: to implement fsync() you > > just wait for notifications of completion to be received, and once you have > > those for all requests pending when fsync was called, or started as part of > > the fsync, then you're done. > > The other key point is that -- unless SATA NCQ is radically different from > SCSI tagged queuing in a particularly stupid way -- the rules require all > "simple" tags to be completed before any "ordered" tag is completed. That is, > ordered tags are barriers against all simple tags. If I remember properly, there's only simple tags in ATA. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote: > On September 23, 2016 10:51:30 AM EDT, Warner Loshwrote: > >All NCQ gives you is the ability to schedule multiple requests and > >to get notification of their completion (perhaps out of order). There's > >no coherency features are all in NCQ. > > This seems like the key thing needed to avoid FUA: to implement fsync() you > just wait for notifications of completion to be received, and once you have > those for all requests pending when fsync was called, or started as part of > the fsync, then you're done. *if you have the write cache disabled* -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Plan: journalling fixes for WAPBL
On September 23, 2016 10:51:30 AM EDT, Warner Loshwrote: >All NCQ gives you is the ability to schedule multiple requests and >to get notification of their completion (perhaps out of order). There's >no coherency features are all in NCQ. This seems like the key thing needed to avoid FUA: to implement fsync() you just wait for notifications of completion to be received, and once you have those for all requests pending when fsync was called, or started as part of the fsync, then you're done. Eric
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 09:38:44AM -0400, Thor Lancelot Simon wrote: > > But anyway, from what I remember you still need the disk cache flush > > operation for SATA, even with NCQ. It's not equivalent to the SCSI tags. > > I think that's true only if you're running with write cache enabled; but > the difference is that most ATA disks ship with it turned on by default. all of them have it turned on by default, and you can't permanentely disable it (you have to turn it off after each reset) > > With an aggressive implementation of tag management on the host side, > there should be no performance benefit from unconditionally enabling > the write cache -- all the available cache should be used to stage > writes for pending tags. Sometimes it works. With ATA you have only 32 tags ... -- Manuel BouyerNetBSD: 26 ans d'experience feront toujours la difference --
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 7:38 AM, Thor Lancelot Simonwrote: > On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote: >> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote: >> > > AFAIK ordered tags only guarantees that the write will happen in order, >> > > but not that the writes are actually done to stable storage. >> > >> > The target's not allowed to report the command complete unless the data >> > are on stable storage, except if you have write cache enable set in the >> > relevant mode page. >> > >> > If you run SCSI drives like that, you're playing with fire. Expect to get >> > burned. The whole point of tagged queueing is to let you *not* set that >> > bit in the mode pages and still get good performance. >> >> Now I remember that I did indeed disable disk write cache when I had >> scsi disks in production. It's been a while though. >> >> But anyway, from what I remember you still need the disk cache flush >> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags. All NCQ gives you is the ability to schedule multiple requests and to get notification of their completion (perhaps out of order). There's no coherency features are all in NCQ. > I think that's true only if you're running with write cache enabled; but > the difference is that most ATA disks ship with it turned on by default. > > With an aggressive implementation of tag management on the host side, > there should be no performance benefit from unconditionally enabling > the write cache -- all the available cache should be used to stage > writes for pending tags. Sometimes it works. You don't need to flush all the writes, but do need to take special care if you need more coherent semantics, which often is a small minority of the writes, so I would agree the affect can be mostly mitigated. Not completely since any coherency point has to drain the queue completely. The cache drain ops are non-NCQ, and to send non-NCQ requests no NCQ requests can be pending. TRIM[*] commands are the same way. Warner [*] There is an NCQ version of TRIM, but it requires the AUX register to be sent and very few sata hosts controllers support that (though AHCI does, many of the LSI controllers don't in any performant way).
Re: Plan: journalling fixes for WAPBL
On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote: > On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote: > > > AFAIK ordered tags only guarantees that the write will happen in order, > > > but not that the writes are actually done to stable storage. > > > > The target's not allowed to report the command complete unless the data > > are on stable storage, except if you have write cache enable set in the > > relevant mode page. > > > > If you run SCSI drives like that, you're playing with fire. Expect to get > > burned. The whole point of tagged queueing is to let you *not* set that > > bit in the mode pages and still get good performance. > > Now I remember that I did indeed disable disk write cache when I had > scsi disks in production. It's been a while though. > > But anyway, from what I remember you still need the disk cache flush > operation for SATA, even with NCQ. It's not equivalent to the SCSI tags. I think that's true only if you're running with write cache enabled; but the difference is that most ATA disks ship with it turned on by default. With an aggressive implementation of tag management on the host side, there should be no performance benefit from unconditionally enabling the write cache -- all the available cache should be used to stage writes for pending tags. Sometimes it works. -- Thor Lancelot Simont...@panix.com "The dirtiest word in art is the C-word. I can't even say 'craft' without feeling dirty."-Chuck Close
Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)
On Fri, Sep 23, 2016 at 11:49:50AM +0200, Edgar Fu? wrote: > > The whole point of tagged queueing is to let you *not* set [the write > > cache] bit in the mode pages and still get good performance. > > I don't get that. My understanding was that TCQ allowed the drive > to re-order commands within the bounds described by the tags. With > the write cache disabled, all write commands must hit stable > storage before being reported completed. So what's the point of > tagging with cacheing disabled? You can have more than one in flight at a time. Typically the more you can manage to have pending at once, the better the performance, especially with SSDs. -- David A. Holland dholl...@netbsd.org
FUA and TCQ (was: Plan: journalling fixes for WAPBL)
> The whole point of tagged queueing is to let you *not* set [the write > cache] bit in the mode pages and still get good performance. I don't get that. My understanding was that TCQ allowed the drive to re-order commands within the bounds described by the tags. With the write cache disabled, all write commands must hit stable storage before being reported completed. So what's the point of tagging with cacheing disabled?
Re: Plan: journalling fixes for WAPBL
On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote: > > AFAIK ordered tags only guarantees that the write will happen in order, > > but not that the writes are actually done to stable storage. > > The target's not allowed to report the command complete unless the data > are on stable storage, except if you have write cache enable set in the > relevant mode page. > > If you run SCSI drives like that, you're playing with fire. Expect to get > burned. The whole point of tagged queueing is to let you *not* set that > bit in the mode pages and still get good performance. Now I remember that I did indeed disable disk write cache when I had scsi disks in production. It's been a while though. But anyway, from what I remember you still need the disk cache flush operation for SATA, even with NCQ. It's not equivalent to the SCSI tags. -- Manuel BouyerNetBSD: 26 ans d'experience feront toujours la difference --
Re: Plan: journalling fixes for WAPBL
On Thu, Sep 22, 2016 at 07:57:00AM +0800, Paul Goyette wrote: > While not particularly part of wapbl itself, I would like to see its > callers (ie, lfs) be more modular! lfs is not related to wapbl, or even (now) ufs. > Currently, ffs (whether built-in or modular) has to be built with OPTIONS > WAPBL enabled in order to use wapbl. And the ffs module has to "require" > the wapbl module. This is because there is allegedly-filesystem-independent wapbl code that was thought to maybe be reusable for additional block-journaling implementations, e.g. ext3. I have always had doubts about this and it hasn't panned out so far. -- David A. Holland dholl...@netbsd.org
Re: Plan: journalling fixes for WAPBL
On Thu, Sep 22, 2016 at 04:06:55PM +0200, Manuel Bouyer wrote: > On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote: > > On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote: > > > > > > 3.2 use FUA (Force Unit Access) for commit record write > > > This avoids need to issue even the second DIOCCACHESYNC, as flushing > > > the disk cache is not really all that useful, I like the thread over > > > at: > > > http://yarchive.net/comp/linux/drive_caches.html > > > Slightly less controversially, this would allow the rest of the > > > journal records to be written asynchronously, leaving them to execute > > > even after commit if so desired. It may be useful to have this > > > behaviour optional. I lean towards skipping the disk cache flush as > > > default behaviour however, if we implement write barrier for the > > > commit record (see below). > > > WAPBL would need to deal with drives without FUA, i.e fall back to cache > > > flush. > > > > I have never understood this business about needing FUA to implement > > barriers. AFAICT, for any SCSI or SCSI-like disk device, all that is > > actually needed is to do standard writes with simple tags, and barrier > > writes with ordered tags. What am I missing? > > AFAIK ordered tags only guarantees that the write will happen in order, > but not that the writes are actually done to stable storage. The target's not allowed to report the command complete unless the data are on stable storage, except if you have write cache enable set in the relevant mode page. If you run SCSI drives like that, you're playing with fire. Expect to get burned. The whole point of tagged queueing is to let you *not* set that bit in the mode pages and still get good performance. Thor
Re: Plan: journalling fixes for WAPBL
On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote: > On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote: > > > > 3.2 use FUA (Force Unit Access) for commit record write > > This avoids need to issue even the second DIOCCACHESYNC, as flushing > > the disk cache is not really all that useful, I like the thread over > > at: > > http://yarchive.net/comp/linux/drive_caches.html > > Slightly less controversially, this would allow the rest of the > > journal records to be written asynchronously, leaving them to execute > > even after commit if so desired. It may be useful to have this > > behaviour optional. I lean towards skipping the disk cache flush as > > default behaviour however, if we implement write barrier for the > > commit record (see below). > > WAPBL would need to deal with drives without FUA, i.e fall back to cache > > flush. > > I have never understood this business about needing FUA to implement > barriers. AFAICT, for any SCSI or SCSI-like disk device, all that is > actually needed is to do standard writes with simple tags, and barrier > writes with ordered tags. What am I missing? AFAIK ordered tags only guarantees that the write will happen in order, but not that the writes are actually done to stable storage. If you get a fsync() from userland, you have to do a cache flush (or FUA). -- Manuel BouyerNetBSD: 26 ans d'experience feront toujours la difference --
Re: Plan: journalling fixes for WAPBL
On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote: > > 3.2 use FUA (Force Unit Access) for commit record write > This avoids need to issue even the second DIOCCACHESYNC, as flushing > the disk cache is not really all that useful, I like the thread over > at: > http://yarchive.net/comp/linux/drive_caches.html > Slightly less controversially, this would allow the rest of the > journal records to be written asynchronously, leaving them to execute > even after commit if so desired. It may be useful to have this > behaviour optional. I lean towards skipping the disk cache flush as > default behaviour however, if we implement write barrier for the > commit record (see below). > WAPBL would need to deal with drives without FUA, i.e fall back to cache > flush. I have never understood this business about needing FUA to implement barriers. AFAICT, for any SCSI or SCSI-like disk device, all that is actually needed is to do standard writes with simple tags, and barrier writes with ordered tags. What am I missing? I must have proposed adding a B_ARRIER or B_ORDERED at least five times over the years. There are always objections... Thor
Re: Plan: journalling fixes for WAPBL
Date: Wed, 21 Sep 2016 17:06:18 -0700 From: Brian Buhrowhello. Does this discussion imply that the WAPBL log/journaling function is broken in NetBSD-current? Are we back to straight FFS as it was before the days of WAPBL or softdep? Please tell me I'm mistaken about this. If so, that's quite a regression, even from NetBSD-5 where both WAPBL log and softdep work quite well. It is no more broken than it was in netbsd-5.
Re: Plan: journalling fixes for WAPBL
hello. Does this discussion imply that the WAPBL log/journaling function is broken in NetBSD-current? Are we back to straight FFS as it was before the days of WAPBL or softdep? Please tell me I'm mistaken about this. If so, that's quite a regression, even from NetBSD-5 where both WAPBL log and softdep work quite well. -thanks -Brian
Re: Plan: journalling fixes for WAPBL
I think adding 2.2 (cg stuff) would also be important to include for re-enabling by default. Also consider: While not particularly part of wapbl itself, I would like to see its callers (ie, lfs) be more modular! Currently, ffs (whether built-in or modular) has to be built with OPTIONS WAPBL enabled in order to use wapbl. And the ffs module has to "require" the wapbl module. It would be desirable (at least for me) if ffs (and any future users of wapbl) could auto-load the wapbl module whenever it is needed. IE, if an existing log-enabled file-system is mounted (or if a new log needs to be created), and possibly also when an existing log needs to be removed, after a 'tunefs -l 0'. This is probably beyond what you expected to do, but I just thought to "throw it out there" to get it in everyone's radar screens. :) On Thu, 22 Sep 2016, Jarom??r Dole?~Mek wrote: Hi, I've been poking around in the WAPBL sources and some of the email threads, also read the doc/roadmaps comments, so I'm aware of some of the sentiment. I think it would still be useful to get WAPBL safe to enable by default again in NetBSD. Neither lfs64 nor Harvard journalling fs is currently in tree. So it's unknown when they would be stable enough to replace ffs by default. Also, I think that it is useful to keep some kind of generic[*] journalling code, perhaps for use also for ext2fs or maybe xfs one day. In either case, IMO it is good to do also some generic system improvements usable by any journalling solution. I see following groups of useful changes. Reasonably for -8 timeframe, IMO only group one really needs to be resolved to safely enable wapbl journalling by default. 1. critical fixes for WAPBL 2. less critical fixes for WAPBL 3. performance improvements for WAPBL 4. disk subsystem and journalling-related improvements 1. Critical fixes for WAPBL 1.1 kern/47146 kernel panic when many files are unlinked 1.2 kern/50725 discard handling 1.3 kern/49175 degenerate truncate() case - too embarassing to leave in 2. Less critical fixes for WAPBL 2.1 kern/45676 flush semantics 2.2 (no PR) make group descriptor updates part of change transaction The transaction, which changed the group descriptor, should contain also the cg block write. Now the group descriptor blocks are written to disk during filesystem sync via separate transaction, so it's quite frequent they do not survive crash if it happens before sync. Normally fsck fixes these easily using inode metadata, but fsck is skipped for journalled filesystems. This IMO can lead to incorrect block allocation, until fsck is actually run. 2.3 file data leaks on crashes File data content blocks are written asynchronously, some of it can make it to the disk before journal is commited, hence blocks can end up back in different file on system crash. FFS always had it, even with softdep albait more limited. 2.4 buffer blocks kept in memory until commit Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl, starving the buffer cache subsystem. 3. WAPBL performance fixes 3.1 checksum journal data for commit Avoid one of the two DIOCCACHESYNC by computing checksum over data and storing it in the commit record; there is even field for it already, so matter of implementation. There is however CPU use concern maybe. crc32c hash is good candidate, do we need to have hash alternatives? This seems to be reasonably simple to implement, needs just some hooks into journal writes and journal replay logic. 3.2 use FUA (Force Unit Access) for commit record write This avoids need to issue even the second DIOCCACHESYNC, as flushing the disk cache is not really all that useful, I like the thread over at: http://yarchive.net/comp/linux/drive_caches.html Slightly less controversially, this would allow the rest of the journal records to be written asynchronously, leaving them to execute even after commit if so desired. It may be useful to have this behaviour optional. I lean towards skipping the disk cache flush as default behaviour however, if we implement write barrier for the commit record (see below). WAPBL would need to deal with drives without FUA, i.e fall back to cache flush. 3.3 async, or 'group sync' writes Submit all the journal block writes to the drive at once, instead of writing the blocks synchronously one by one. We could even have the journal block writes completely async if we have the commit record checksum. Implementing 'group sync' write would be quite simple, making it full async is more difficult and actually not very useful for journalling, since commit would force those writes to disk drive anyway if it's write barrier (see below) 4. disk subsystem and journalling-related improvements 4.1 write barriers The current DIOCCACHESYNC has a problem in that it could be quite easily I/O starved if the drive is very loaded. Normally, the drive firmware flushes the disk buffer very soon (i.e in region of milliseconds, i.e. when it has full track of data),
Plan: journalling fixes for WAPBL
Hi, I've been poking around in the WAPBL sources and some of the email threads, also read the doc/roadmaps comments, so I'm aware of some of the sentiment. I think it would still be useful to get WAPBL safe to enable by default again in NetBSD. Neither lfs64 nor Harvard journalling fs is currently in tree. So it's unknown when they would be stable enough to replace ffs by default. Also, I think that it is useful to keep some kind of generic[*] journalling code, perhaps for use also for ext2fs or maybe xfs one day. In either case, IMO it is good to do also some generic system improvements usable by any journalling solution. I see following groups of useful changes. Reasonably for -8 timeframe, IMO only group one really needs to be resolved to safely enable wapbl journalling by default. 1. critical fixes for WAPBL 2. less critical fixes for WAPBL 3. performance improvements for WAPBL 4. disk subsystem and journalling-related improvements 1. Critical fixes for WAPBL 1.1 kern/47146 kernel panic when many files are unlinked 1.2 kern/50725 discard handling 1.3 kern/49175 degenerate truncate() case - too embarassing to leave in 2. Less critical fixes for WAPBL 2.1 kern/45676 flush semantics 2.2 (no PR) make group descriptor updates part of change transaction The transaction, which changed the group descriptor, should contain also the cg block write. Now the group descriptor blocks are written to disk during filesystem sync via separate transaction, so it's quite frequent they do not survive crash if it happens before sync. Normally fsck fixes these easily using inode metadata, but fsck is skipped for journalled filesystems. This IMO can lead to incorrect block allocation, until fsck is actually run. 2.3 file data leaks on crashes File data content blocks are written asynchronously, some of it can make it to the disk before journal is commited, hence blocks can end up back in different file on system crash. FFS always had it, even with softdep albait more limited. 2.4 buffer blocks kept in memory until commit Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl, starving the buffer cache subsystem. 3. WAPBL performance fixes 3.1 checksum journal data for commit Avoid one of the two DIOCCACHESYNC by computing checksum over data and storing it in the commit record; there is even field for it already, so matter of implementation. There is however CPU use concern maybe. crc32c hash is good candidate, do we need to have hash alternatives? This seems to be reasonably simple to implement, needs just some hooks into journal writes and journal replay logic. 3.2 use FUA (Force Unit Access) for commit record write This avoids need to issue even the second DIOCCACHESYNC, as flushing the disk cache is not really all that useful, I like the thread over at: http://yarchive.net/comp/linux/drive_caches.html Slightly less controversially, this would allow the rest of the journal records to be written asynchronously, leaving them to execute even after commit if so desired. It may be useful to have this behaviour optional. I lean towards skipping the disk cache flush as default behaviour however, if we implement write barrier for the commit record (see below). WAPBL would need to deal with drives without FUA, i.e fall back to cache flush. 3.3 async, or 'group sync' writes Submit all the journal block writes to the drive at once, instead of writing the blocks synchronously one by one. We could even have the journal block writes completely async if we have the commit record checksum. Implementing 'group sync' write would be quite simple, making it full async is more difficult and actually not very useful for journalling, since commit would force those writes to disk drive anyway if it's write barrier (see below) 4. disk subsystem and journalling-related improvements 4.1 write barriers The current DIOCCACHESYNC has a problem in that it could be quite easily I/O starved if the drive is very loaded. Normally, the drive firmware flushes the disk buffer very soon (i.e in region of milliseconds, i.e. when it has full track of data), but concurrent disk activity might prevent it from doing it soon enough. More serious NetBSD kernel problem is however that DIOCCACHESYNC bypasses bufq, so if there are any queued writes, DIOCCACHESYNC sends the command do disk before those writes are sent to the drive. In order to avoid both of them, it would be good to have a way to mark a buf as barrier. bufq and/or disk routines would be changed to drain the write queue before barrier write is sent to drive, and any later writes would wait until barrier write completes. On sane hardware like SCSI/SAS, this could be almost completely offloaded to the controller by just using ORDERED tags, without need to drain the queue. This would be semi-hard to implement, especially if it would require changes to disk drivers. 4.2 scsipi default to ORDERED tags, change to SIMPLE >From a quick scsipi_base.c inspection, it seems we use ordered tag