Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread Jaromír Doleček
2017-01-02 18:31 GMT+01:00 David Holland :
> Well, there's two things going on here. One is the parallelism limit,
> which you can work around by having more drives, e.g. in an array.
> The other is netbsd's 64k MAXPHYS issue, which is our own problem that
> we could put time into. (And in fact, it would be nice to get
> tls-maxphys into -8... anyone have time? I don't. sigh)

It would be very nice to have this intergrated, yes. It won't have
dramatic performance effect, but it's relatively low hanging fruit,
it's already almost done and we should just get rid of this arbitrary
system limit. I'd like to look into this, but I won't manage sooner
then autumn 2017 - I'd like to first work on FUA/DPO support, and then
SATA NCQ. I think those could have bigger performance impact.

I hope I'll have some patches to make FUA available for I/O for SCSI
drives by second half of feb, plus the changes for WAPBL to use it.
It's slightly difficult to DTRT with FUA on nested drivers like
raidframe/cgd/vnd, but maybe we can ignore those for the first
iteration :) I'll send a proposal once I figure details.

Jaromir


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread Thor Lancelot Simon
On Mon, Jan 02, 2017 at 06:08:04PM +, David Holland wrote:
> On Mon, Jan 02, 2017 at 01:01:34PM -0500, Thor Lancelot Simon wrote:
>  > On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote:
>  > > (from a while back)
>  > > 
>  > > However, I'm missing something. The I/O queue depths that you need to
>  > > get peak write performance from SSDs are larger than 31, and the test
>  > > labs appear to have been able to do this with SATA-attached SSDs...
>  > > what are/were they doing?
>  > 
>  > Aggressive prefetching, extreme efforts to reduce command latency at
>  > the drive end of the SATA link (and higher link speeds), plus much
>  > larger request sizes than we can issue.
> 
> Yes, but I mean testing with queue depths > 31, like ~100, which I'm
> sure I remember seeing. But maybe I'm wrong... obviously I should go
> rake up some links, maybe later.

The tests could have been run with RAID controllers that present a
SCSI interface to the host.  These often support very deep queues both
for the virtual targets and at the adapter (channel) itself, at which
point it's all about minimizing latency again on the controller's side
of the interaction, where it really _is_ SATA with a limited queue
depth.

If you want a large number of SATA targets in one box you are likely
using a RAID controller even if you're just using it in JBOD mode.  That
makes every SATA target look like a SCSI target.

Thor


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread David Holland
On Mon, Jan 02, 2017 at 01:01:34PM -0500, Thor Lancelot Simon wrote:
 > On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote:
 > > (from a while back)
 > > 
 > > However, I'm missing something. The I/O queue depths that you need to
 > > get peak write performance from SSDs are larger than 31, and the test
 > > labs appear to have been able to do this with SATA-attached SSDs...
 > > what are/were they doing?
 > 
 > Aggressive prefetching, extreme efforts to reduce command latency at
 > the drive end of the SATA link (and higher link speeds), plus much
 > larger request sizes than we can issue.

Yes, but I mean testing with queue depths > 31, like ~100, which I'm
sure I remember seeing. But maybe I'm wrong... obviously I should go
rake up some links, maybe later.

-- 
David A. Holland
dholl...@netbsd.org


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread Thor Lancelot Simon
On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote:
> (from a while back)
> 
> However, I'm missing something. The I/O queue depths that you need to
> get peak write performance from SSDs are larger than 31, and the test
> labs appear to have been able to do this with SATA-attached SSDs...
> what are/were they doing?

Aggressive prefetching, extreme efforts to reduce command latency at
the drive end of the SATA link (and higher link speeds), plus much
larger request sizes than we can issue.

-- 
 Thor Lancelot Simon  t...@panix.com

Ring the bells that still can ring.


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread David Holland
On Fri, Sep 23, 2016 at 08:51:30AM -0600, Warner Losh wrote:
 > [*] There is an NCQ version of TRIM, but it requires the AUX register
 > to be sent and very few sata hosts controllers support that (though
 > AHCI does, many of the LSI controllers don't in any performant way).

I (somewhat idly) wonder if this is why we currently have TRIM that
works on ahcisata but not on several other controllers...

(PR 51756 for siisata, PR 47455 for piixide, maybe more)

-- 
David A. Holland
dholl...@netbsd.org


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread David Holland
(from a while back)

On Wed, Sep 28, 2016 at 02:27:39PM +, paul.kon...@dell.com wrote:
 > > On Sep 28, 2016, at 7:22 AM, Jarom?r Dole?ek  
 > > wrote:
 > > I think it's far assesment to say that on SATA with NCQ/31 tags (max
 > > is actually 31, not 32 tags), it's pretty much impossible to have
 > > acceptable write performance without using write cache. We could never
 > > saturate even drive with 16MB cache with just 31 tags and 64k maxphys.
 > > So it's IMO not useful to design for world without disk drive write
 > > cache.
 > 
 > I think that depends on the software.  In a SAN storage array I
 > work on, we used to use SATA drives, always with cache disabled to
 > avoid data loss due to power failure.  We had them running just
 > fine with NCQ.  (For that matter, even without NCQ, though that
 > takes major effort.)

Well, there's two things going on here. One is the parallelism limit,
which you can work around by having more drives, e.g. in an array.
The other is netbsd's 64k MAXPHYS issue, which is our own problem that
we could put time into. (And in fact, it would be nice to get
tls-maxphys into -8... anyone have time? I don't. sigh)

However, I'm missing something. The I/O queue depths that you need to
get peak write performance from SSDs are larger than 31, and the test
labs appear to have been able to do this with SATA-attached SSDs...
what are/were they doing?

-- 
David A. Holland
dholl...@netbsd.org


Re: Plan: journalling fixes for WAPBL

2016-09-28 Thread Paul.Koning

> On Sep 28, 2016, at 7:22 AM, Jaromír Doleček  
> wrote:
> 
> I think it's far assesment to say that on SATA with NCQ/31 tags (max
> is actually 31, not 32 tags), it's pretty much impossible to have
> acceptable write performance without using write cache. We could never
> saturate even drive with 16MB cache with just 31 tags and 64k maxphys.
> So it's IMO not useful to design for world without disk drive write
> cache.

I think that depends on the software.  In a SAN storage array I work on, we 
used to use SATA drives, always with cache disabled to avoid data loss due to 
power failure.  We had them running just fine with NCQ.  (For that matter, even 
without NCQ, though that takes major effort.)

So perhaps an optimization effort is called for, if people view this 
performance issue as worth the trouble.  Or you might decide that for 
performance SAS is the answer, and SATA is only for non-critical applications.

paul



Re: Plan: journalling fixes for WAPBL

2016-09-28 Thread Jaromír Doleček
I think it's far assesment to say that on SATA with NCQ/31 tags (max
is actually 31, not 32 tags), it's pretty much impossible to have
acceptable write performance without using write cache. We could never
saturate even drive with 16MB cache with just 31 tags and 64k maxphys.
So it's IMO not useful to design for world without disk drive write
cache.

Back to discussion about B_ORDERED:

As was said before, SCSI ORDERED tag does precisely what we want for
journal commit record - it forces all previous commands sent to
controller to be finished before the one with ORDERED tag is
processed, and any commands sent after the ORDERED tagged one are
executed only after the previous ordered command is finished. No need
for any bufq magic there, which is wonderful. Too bad that NCQ doesn't
provide this.

That said, we still need to be sure that all the previous commands
were sent prior to pushing ORDERED command to SCSI controller. Are
there any SCSI controllers with multiple submission queues (like
NVMe), regardless of our scsipi layer MP limitations?

FWIW AHCI is single-threaded by design, every command submission has
to write to same set of registers.

Jaromir

2016-09-23 19:51 GMT+02:00 Manuel Bouyer :
> On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote:
>> > > This seems like the key thing needed to avoid FUA: to implement fsync() 
>> > > you just wait for notifications of completion to be received, and once 
>> > > you have those for all requests pending when fsync was called, or 
>> > > started as part of the fsync, then you're done.
>> >
>> > *if you have the write cache disabled*
>>
>> *Running with the write cache enabled is a bad idea*
>
> On ATA devices, you can't permanently disable the write cache. You have
> to do it on every power cycles.
>
> Well this really needs to be carefully evaluated. With only 32 tags I'm not
> sure you can efficiently use recent devices with the write cache
> disabled (most enterprise disks have a 64M cache these days)
>
> --
> Manuel Bouyer 
>  NetBSD: 26 ans d'experience feront toujours la difference
> --


Re: Plan: journalling fixes for WAPBL

2016-09-24 Thread Warner Losh
On Sat, Sep 24, 2016 at 2:01 AM, David Holland  wrote:
> On Fri, Sep 23, 2016 at 07:51:32PM +0200, Manuel Bouyer wrote:
>  > > > *if you have the write cache disabled*
>  > >
>  > > *Running with the write cache enabled is a bad idea*
>  >
>  > On ATA devices, you can't permanently disable the write cache. You have
>  > to do it on every power cycles.
>
> There are also drives that ignore attempts to turn off write caching.

These drives lie to the host and say that caching is off, when it
really is still on, right?

Warner


Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-24 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 01:02:26PM +, paul.kon...@dell.com wrote:
> 
> > On Sep 23, 2016, at 5:49 AM, Edgar Fu??  wrote:
> > 
> >> The whole point of tagged queueing is to let you *not* set [the write 
> >> cache] bit in the mode pages and still get good performance.
> > I don't get that. My understanding was that TCQ allowed the drive to 
> > re-order 
> > commands within the bounds described by the tags. With the write cache 
> > disabled, all write commands must hit stable storage before being reported 
> > completed. So what's the point of tagging with cacheing disabled?
> 
> I'm not sure.  But I have the impression that in the real world tagging is 
> rarely, if ever, used.

I'm not sure what you mean.  Do you mean that tagging is rarely, if ever,
used _to establish write barriers_, or do you mean that tagging is rarely,
if ever used, period?

If the latter, you're way, way wrong.

Thor


Re: Plan: journalling fixes for WAPBL

2016-09-24 Thread David Holland
On Fri, Sep 23, 2016 at 07:51:32PM +0200, Manuel Bouyer wrote:
 > > > *if you have the write cache disabled*
 > > 
 > > *Running with the write cache enabled is a bad idea*
 > 
 > On ATA devices, you can't permanently disable the write cache. You have
 > to do it on every power cycles.

There are also drives that ignore attempts to turn off write caching.

-- 
David A. Holland
dholl...@netbsd.org


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Paul.Koning

> On Sep 23, 2016, at 10:51 AM, Warner Losh  wrote:
> 
> On Fri, Sep 23, 2016 at 7:38 AM, Thor Lancelot Simon  wrote:
>> On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote:
>>> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
> AFAIK ordered tags only guarantees that the write will happen in order,
> but not that the writes are actually done to stable storage.
 
 The target's not allowed to report the command complete unless the data
 are on stable storage, except if you have write cache enable set in the
 relevant mode page.
 
 If you run SCSI drives like that, you're playing with fire.  Expect to get
 burned.  The whole point of tagged queueing is to let you *not* set that
 bit in the mode pages and still get good performance.
>>> 
>>> Now I remember that I did indeed disable disk write cache when I had
>>> scsi disks in production. It's been a while though.
>>> 
>>> But anyway, from what I remember you still need the disk cache flush
>>> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags

Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-23 Thread Paul.Koning

> On Sep 23, 2016, at 5:49 AM, Edgar Fuß  wrote:
> 
>> The whole point of tagged queueing is to let you *not* set [the write 
>> cache] bit in the mode pages and still get good performance.
> I don't get that. My understanding was that TCQ allowed the drive to re-order 
> commands within the bounds described by the tags. With the write cache 
> disabled, all write commands must hit stable storage before being reported 
> completed. So what's the point of tagging with cacheing disabled?

I'm not sure.  But I have the impression that in the real world tagging is 
rarely, if ever, used.

paul



Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Warner Losh
On Fri, Sep 23, 2016 at 11:54 AM, Warner Losh  wrote:
> On Fri, Sep 23, 2016 at 11:20 AM, Thor Lancelot Simon  wrote:
>> On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote:
>>> On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
>>> >All NCQ gives you is the ability to schedule multiple requests and
>>> >to get notification of their completion (perhaps out of order). There's
>>> >no coherency features are all in NCQ.
>>>
>>> This seems like the key thing needed to avoid FUA: to implement fsync() you 
>>> just wait for notifications of completion to be received, and once you have 
>>> those for all requests pending when fsync was called, or started as part of 
>>> the fsync, then you're done.
>>
>> The other key point is that -- unless SATA NCQ is radically different from
>> SCSI tagged queuing in a particularly stupid way -- the rules require all
>> "simple" tags to be completed before any "ordered" tag is completed.  That 
>> is,
>> ordered tags are barriers against all simple tags.
>
> SATA NCQ doesn't have ordered tags. There's just 32 slots to send
> requests into. Don't allow the word 'tag' to confuse you into thinking
> it is anything at all like SCSI tags. You get ordering by not
> scheduling anything until after the queue has drained when you send
> your "ordered" command. It is that stupid.

And it can be even worse, since if the 'ordered' item must complete
after all before it, you have to drain the queue before you can even
send it to the drive. Depends on what the ordering guarantees you want
are...

Warner


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Warner Losh
On Fri, Sep 23, 2016 at 11:20 AM, Thor Lancelot Simon  wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote:
>> On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
>> >All NCQ gives you is the ability to schedule multiple requests and
>> >to get notification of their completion (perhaps out of order). There's
>> >no coherency features are all in NCQ.
>>
>> This seems like the key thing needed to avoid FUA: to implement fsync() you 
>> just wait for notifications of completion to be received, and once you have 
>> those for all requests pending when fsync was called, or started as part of 
>> the fsync, then you're done.
>
> The other key point is that -- unless SATA NCQ is radically different from
> SCSI tagged queuing in a particularly stupid way -- the rules require all
> "simple" tags to be completed before any "ordered" tag is completed.  That is,
> ordered tags are barriers against all simple tags.

SATA NCQ doesn't have ordered tags. There's just 32 slots to send
requests into. Don't allow the word 'tag' to confuse you into thinking
it is anything at all like SCSI tags. You get ordering by not
scheduling anything until after the queue has drained when you send
your "ordered" command. It is that stupid.

Warner


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Manuel Bouyer
On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote:
> > > This seems like the key thing needed to avoid FUA: to implement fsync() 
> > > you just wait for notifications of completion to be received, and once 
> > > you have those for all requests pending when fsync was called, or started 
> > > as part of the fsync, then you're done.
> > 
> > *if you have the write cache disabled*
> 
> *Running with the write cache enabled is a bad idea*

On ATA devices, you can't permanently disable the write cache. You have
to do it on every power cycles.

Well this really needs to be carefully evaluated. With only 32 tags I'm not
sure you can efficiently use recent devices with the write cache
disabled (most enterprise disks have a 64M cache these days)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 07:45:00PM +0200, Manuel Bouyer wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote:
> > On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
> > >All NCQ gives you is the ability to schedule multiple requests and
> > >to get notification of their completion (perhaps out of order). There's
> > >no coherency features are all in NCQ.
> > 
> > This seems like the key thing needed to avoid FUA: to implement fsync() you 
> > just wait for notifications of completion to be received, and once you have 
> > those for all requests pending when fsync was called, or started as part of 
> > the fsync, then you're done.
> 
> *if you have the write cache disabled*

*Running with the write cache enabled is a bad idea*



Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Manuel Bouyer
On Fri, Sep 23, 2016 at 01:20:09PM -0400, Thor Lancelot Simon wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote:
> > On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
> > >All NCQ gives you is the ability to schedule multiple requests and
> > >to get notification of their completion (perhaps out of order). There's
> > >no coherency features are all in NCQ.
> > 
> > This seems like the key thing needed to avoid FUA: to implement fsync() you 
> > just wait for notifications of completion to be received, and once you have 
> > those for all requests pending when fsync was called, or started as part of 
> > the fsync, then you're done.
> 
> The other key point is that -- unless SATA NCQ is radically different from
> SCSI tagged queuing in a particularly stupid way -- the rules require all
> "simple" tags to be completed before any "ordered" tag is completed.  That is,
> ordered tags are barriers against all simple tags.

If I remember properly, there's only simple tags in ATA.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Manuel Bouyer
On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote:
> On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
> >All NCQ gives you is the ability to schedule multiple requests and
> >to get notification of their completion (perhaps out of order). There's
> >no coherency features are all in NCQ.
> 
> This seems like the key thing needed to avoid FUA: to implement fsync() you 
> just wait for notifications of completion to be received, and once you have 
> those for all requests pending when fsync was called, or started as part of 
> the fsync, then you're done.

*if you have the write cache disabled*

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Eric Haszlakiewicz
On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
>All NCQ gives you is the ability to schedule multiple requests and
>to get notification of their completion (perhaps out of order). There's
>no coherency features are all in NCQ.

This seems like the key thing needed to avoid FUA: to implement fsync() you 
just wait for notifications of completion to be received, and once you have 
those for all requests pending when fsync was called, or started as part of the 
fsync, then you're done.

Eric



Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Manuel Bouyer
On Fri, Sep 23, 2016 at 09:38:44AM -0400, Thor Lancelot Simon wrote:
> > But anyway, from what I remember you still need the disk cache flush
> > operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.
> 
> I think that's true only if you're running with write cache enabled; but
> the difference is that most ATA disks ship with it turned on by default.

all of them have it turned on by default, and you can't permanentely
disable it (you have to turn it off after each reset)

> 
> With an aggressive implementation of tag management on the host side,
> there should be no performance benefit from unconditionally enabling
> the write cache -- all the available cache should be used to stage
> writes for pending tags.  Sometimes it works.

With ATA you have only 32 tags ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Warner Losh
On Fri, Sep 23, 2016 at 7:38 AM, Thor Lancelot Simon  wrote:
> On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote:
>> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
>> > > AFAIK ordered tags only guarantees that the write will happen in order,
>> > > but not that the writes are actually done to stable storage.
>> >
>> > The target's not allowed to report the command complete unless the data
>> > are on stable storage, except if you have write cache enable set in the
>> > relevant mode page.
>> >
>> > If you run SCSI drives like that, you're playing with fire.  Expect to get
>> > burned.  The whole point of tagged queueing is to let you *not* set that
>> > bit in the mode pages and still get good performance.
>>
>> Now I remember that I did indeed disable disk write cache when I had
>> scsi disks in production. It's been a while though.
>>
>> But anyway, from what I remember you still need the disk cache flush
>> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

All NCQ gives you is the ability to schedule multiple requests and
to get notification of their completion (perhaps out of order). There's
no coherency features are all in NCQ.

> I think that's true only if you're running with write cache enabled; but
> the difference is that most ATA disks ship with it turned on by default.
>
> With an aggressive implementation of tag management on the host side,
> there should be no performance benefit from unconditionally enabling
> the write cache -- all the available cache should be used to stage
> writes for pending tags.  Sometimes it works.

You don't need to flush all the writes, but do need to take special care
if you need more coherent semantics, which often is a small minority
of the writes, so I would agree the affect can be mostly mitigated. Not
completely since any coherency point has to drain the queue completely.
The cache drain ops are non-NCQ, and to send non-NCQ requests
no NCQ requests can be pending. TRIM[*] commands are the same way.

Warner

[*] There is an NCQ version of TRIM, but it requires the AUX register
to be sent and very few sata hosts controllers support that (though
AHCI does, many of the LSI controllers don't in any performant way).


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote:
> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
> > > AFAIK ordered tags only guarantees that the write will happen in order,
> > > but not that the writes are actually done to stable storage.
> > 
> > The target's not allowed to report the command complete unless the data
> > are on stable storage, except if you have write cache enable set in the
> > relevant mode page.
> > 
> > If you run SCSI drives like that, you're playing with fire.  Expect to get
> > burned.  The whole point of tagged queueing is to let you *not* set that
> > bit in the mode pages and still get good performance.
> 
> Now I remember that I did indeed disable disk write cache when I had
> scsi disks in production. It's been a while though.
> 
> But anyway, from what I remember you still need the disk cache flush
> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

I think that's true only if you're running with write cache enabled; but
the difference is that most ATA disks ship with it turned on by default.

With an aggressive implementation of tag management on the host side,
there should be no performance benefit from unconditionally enabling
the write cache -- all the available cache should be used to stage
writes for pending tags.  Sometimes it works.

-- 
  Thor Lancelot Simont...@panix.com

"The dirtiest word in art is the C-word.  I can't even say 'craft'
 without feeling dirty."-Chuck Close


Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-23 Thread David Holland
On Fri, Sep 23, 2016 at 11:49:50AM +0200, Edgar Fu? wrote:
 > > The whole point of tagged queueing is to let you *not* set [the write 
 > > cache] bit in the mode pages and still get good performance.
 >
 > I don't get that. My understanding was that TCQ allowed the drive
 > to re-order commands within the bounds described by the tags. With
 > the write cache disabled, all write commands must hit stable
 > storage before being reported completed. So what's the point of
 > tagging with cacheing disabled?

You can have more than one in flight at a time. Typically the more you
can manage to have pending at once, the better the performance,
especially with SSDs.

-- 
David A. Holland
dholl...@netbsd.org


FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-23 Thread Edgar Fuß
> The whole point of tagged queueing is to let you *not* set [the write 
> cache] bit in the mode pages and still get good performance.
I don't get that. My understanding was that TCQ allowed the drive to re-order 
commands within the bounds described by the tags. With the write cache 
disabled, all write commands must hit stable storage before being reported 
completed. So what's the point of tagging with cacheing disabled?


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Manuel Bouyer
On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
> > AFAIK ordered tags only guarantees that the write will happen in order,
> > but not that the writes are actually done to stable storage.
> 
> The target's not allowed to report the command complete unless the data
> are on stable storage, except if you have write cache enable set in the
> relevant mode page.
> 
> If you run SCSI drives like that, you're playing with fire.  Expect to get
> burned.  The whole point of tagged queueing is to let you *not* set that
> bit in the mode pages and still get good performance.

Now I remember that I did indeed disable disk write cache when I had
scsi disks in production. It's been a while though.

But anyway, from what I remember you still need the disk cache flush
operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread David Holland
On Thu, Sep 22, 2016 at 07:57:00AM +0800, Paul Goyette wrote:
 > While not particularly part of wapbl itself, I would like to see its
 > callers (ie, lfs) be more modular!

lfs is not related to wapbl, or even (now) ufs.

 > Currently, ffs (whether built-in or modular) has to be built with OPTIONS
 > WAPBL enabled in order to use wapbl.  And the ffs module has to "require"
 > the wapbl module.

This is because there is allegedly-filesystem-independent wapbl code
that was thought to maybe be reusable for additional block-journaling
implementations, e.g. ext3. I have always had doubts about this and it
hasn't panned out so far.

-- 
David A. Holland
dholl...@netbsd.org


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread Thor Lancelot Simon
On Thu, Sep 22, 2016 at 04:06:55PM +0200, Manuel Bouyer wrote:
> On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote:
> > On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> > > 
> > > 3.2 use FUA (Force Unit Access) for commit record write
> > > This avoids need to issue even the second DIOCCACHESYNC, as flushing
> > > the disk cache is not really all that useful, I like the thread over
> > > at:
> > > http://yarchive.net/comp/linux/drive_caches.html
> > > Slightly less controversially, this would allow the rest of the
> > > journal records to be written asynchronously, leaving them to execute
> > > even after commit if so desired. It may be useful to have this
> > > behaviour optional. I lean towards skipping the disk cache flush as
> > > default behaviour however, if we implement write barrier for the
> > > commit record (see below).
> > > WAPBL would need to deal with drives without FUA, i.e fall back to cache 
> > > flush.
> > 
> > I have never understood this business about needing FUA to implement
> > barriers.  AFAICT, for any SCSI or SCSI-like disk device, all that is
> > actually needed is to do standard writes with simple tags, and barrier
> > writes with ordered tags.  What am I missing?
> 
> AFAIK ordered tags only guarantees that the write will happen in order,
> but not that the writes are actually done to stable storage.

The target's not allowed to report the command complete unless the data
are on stable storage, except if you have write cache enable set in the
relevant mode page.

If you run SCSI drives like that, you're playing with fire.  Expect to get
burned.  The whole point of tagged queueing is to let you *not* set that
bit in the mode pages and still get good performance.

Thor


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread Manuel Bouyer
On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote:
> On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> > 
> > 3.2 use FUA (Force Unit Access) for commit record write
> > This avoids need to issue even the second DIOCCACHESYNC, as flushing
> > the disk cache is not really all that useful, I like the thread over
> > at:
> > http://yarchive.net/comp/linux/drive_caches.html
> > Slightly less controversially, this would allow the rest of the
> > journal records to be written asynchronously, leaving them to execute
> > even after commit if so desired. It may be useful to have this
> > behaviour optional. I lean towards skipping the disk cache flush as
> > default behaviour however, if we implement write barrier for the
> > commit record (see below).
> > WAPBL would need to deal with drives without FUA, i.e fall back to cache 
> > flush.
> 
> I have never understood this business about needing FUA to implement
> barriers.  AFAICT, for any SCSI or SCSI-like disk device, all that is
> actually needed is to do standard writes with simple tags, and barrier
> writes with ordered tags.  What am I missing?

AFAIK ordered tags only guarantees that the write will happen in order,
but not that the writes are actually done to stable storage.
If you get a fsync() from userland, you have to do a cache flush (or FUA).

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread Thor Lancelot Simon
On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> 
> 3.2 use FUA (Force Unit Access) for commit record write
> This avoids need to issue even the second DIOCCACHESYNC, as flushing
> the disk cache is not really all that useful, I like the thread over
> at:
> http://yarchive.net/comp/linux/drive_caches.html
> Slightly less controversially, this would allow the rest of the
> journal records to be written asynchronously, leaving them to execute
> even after commit if so desired. It may be useful to have this
> behaviour optional. I lean towards skipping the disk cache flush as
> default behaviour however, if we implement write barrier for the
> commit record (see below).
> WAPBL would need to deal with drives without FUA, i.e fall back to cache 
> flush.

I have never understood this business about needing FUA to implement
barriers.  AFAICT, for any SCSI or SCSI-like disk device, all that is
actually needed is to do standard writes with simple tags, and barrier
writes with ordered tags.  What am I missing?

I must have proposed adding a B_ARRIER or B_ORDERED at least five times
over the years.  There are always objections...

Thor


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread Taylor R Campbell
   Date: Wed, 21 Sep 2016 17:06:18 -0700
   From: Brian Buhrow 

   hello.  Does this discussion imply that the WAPBL log/journaling
   function is broken in NetBSD-current?  Are we back to straight FFS as it
   was before the days of WAPBL or softdep?  Please tell me I'm mistaken about
   this.  If so, that's quite a regression, even from NetBSD-5 where both
   WAPBL log and softdep work quite well.

It is no more broken than it was in netbsd-5.


Re: Plan: journalling fixes for WAPBL

2016-09-21 Thread Brian Buhrow
hello.  Does this discussion imply that the WAPBL log/journaling
function is broken in NetBSD-current?  Are we back to straight FFS as it
was before the days of WAPBL or softdep?  Please tell me I'm mistaken about
this.  If so, that's quite a regression, even from NetBSD-5 where both
WAPBL log and softdep work quite well.
-thanks
-Brian



Re: Plan: journalling fixes for WAPBL

2016-09-21 Thread Paul Goyette
I think adding 2.2 (cg stuff) would also be important to include for 
re-enabling by default.




Also consider:

While not particularly part of wapbl itself, I would like to see its 
callers (ie, lfs) be more modular!


Currently, ffs (whether built-in or modular) has to be built with 
OPTIONS WAPBL enabled in order to use wapbl.  And the ffs module has to 
"require" the wapbl module.


It would be desirable (at least for me) if ffs (and any future users of 
wapbl) could auto-load the wapbl module whenever it is needed.  IE, if 
an existing log-enabled file-system is mounted (or if a new log needs to 
be created), and possibly also when an existing log needs to be removed, 
after a 'tunefs -l 0'.


This is probably beyond what you expected to do, but I just thought to 
"throw it out there" to get it in everyone's radar screens.  :)




On Thu, 22 Sep 2016, Jarom??r Dole?~Mek wrote:


Hi,

I've been poking around in the WAPBL sources and some of the email
threads, also read the doc/roadmaps comments, so I'm aware of some of
the sentiment.

I think it would still be useful to get WAPBL safe to enable by
default again in NetBSD. Neither lfs64 nor Harvard journalling fs is
currently in tree. So it's unknown when they would be stable enough to
replace ffs by default. Also, I think that it is useful to keep some
kind of generic[*] journalling code, perhaps for use also for ext2fs
or maybe xfs one day.

In either case, IMO it is good to do also some generic system
improvements usable by any journalling solution.

I see following groups of useful changes. Reasonably for -8 timeframe,
IMO only group one really needs to be resolved to safely enable wapbl
journalling by default.

1. critical fixes for WAPBL
2. less critical fixes for WAPBL
3. performance improvements for WAPBL
4. disk subsystem and journalling-related improvements

1. Critical fixes for WAPBL
1.1 kern/47146 kernel panic when many files are unlinked
1.2 kern/50725 discard handling
1.3 kern/49175 degenerate truncate() case - too embarassing to leave in

2. Less critical fixes for WAPBL
2.1 kern/45676 flush semantics

2.2 (no PR) make group descriptor updates part of change transaction
The transaction, which changed the group descriptor, should contain
also the cg block write. Now the group descriptor blocks are written
to disk during filesystem sync via separate transaction, so it's quite
frequent they do not survive crash if it happens before sync. Normally
fsck fixes these easily using inode metadata, but fsck is skipped for
journalled filesystems. This IMO can lead to incorrect block
allocation, until fsck is actually run.

2.3 file data leaks on crashes
File data content blocks are written asynchronously, some of it can
make it to the disk before journal is commited, hence blocks can end
up back in different file on system crash. FFS always had it, even
with softdep albait more limited.

2.4 buffer blocks kept in memory until commit
Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl,
starving the buffer cache subsystem.

3. WAPBL performance fixes
3.1 checksum journal data for commit
Avoid one of the two DIOCCACHESYNC by computing checksum over data and
storing it in the commit record; there is even field for it already,
so matter of implementation. There is however CPU use concern maybe.
crc32c hash is good candidate, do we need to have hash alternatives?
This seems to be reasonably simple to implement, needs just some hooks
into journal writes and journal replay logic.

3.2 use FUA (Force Unit Access) for commit record write
This avoids need to issue even the second DIOCCACHESYNC, as flushing
the disk cache is not really all that useful, I like the thread over
at:
http://yarchive.net/comp/linux/drive_caches.html
Slightly less controversially, this would allow the rest of the
journal records to be written asynchronously, leaving them to execute
even after commit if so desired. It may be useful to have this
behaviour optional. I lean towards skipping the disk cache flush as
default behaviour however, if we implement write barrier for the
commit record (see below).
WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.

3.3 async, or 'group sync' writes
Submit all the journal block writes to the drive at once, instead of
writing the blocks synchronously one by one. We could even have the
journal block writes completely async if we have the commit record
checksum.
Implementing 'group sync' write would be quite simple, making it full
async is more difficult and actually not very useful for journalling,
since commit would force those writes to disk drive anyway if it's
write barrier (see below)

4. disk subsystem and journalling-related improvements
4.1 write barriers
The current DIOCCACHESYNC has a problem in that it could be quite
easily I/O starved if the drive is very loaded. Normally, the drive
firmware flushes the disk buffer very soon (i.e in region of
milliseconds, i.e. when it has full track of data), 

Plan: journalling fixes for WAPBL

2016-09-21 Thread Jaromír Doleček
Hi,

I've been poking around in the WAPBL sources and some of the email
threads, also read the doc/roadmaps comments, so I'm aware of some of
the sentiment.

I think it would still be useful to get WAPBL safe to enable by
default again in NetBSD. Neither lfs64 nor Harvard journalling fs is
currently in tree. So it's unknown when they would be stable enough to
replace ffs by default. Also, I think that it is useful to keep some
kind of generic[*] journalling code, perhaps for use also for ext2fs
or maybe xfs one day.

In either case, IMO it is good to do also some generic system
improvements usable by any journalling solution.

I see following groups of useful changes. Reasonably for -8 timeframe,
IMO only group one really needs to be resolved to safely enable wapbl
journalling by default.

1. critical fixes for WAPBL
2. less critical fixes for WAPBL
3. performance improvements for WAPBL
4. disk subsystem and journalling-related improvements

1. Critical fixes for WAPBL
1.1 kern/47146 kernel panic when many files are unlinked
1.2 kern/50725 discard handling
1.3 kern/49175 degenerate truncate() case - too embarassing to leave in

2. Less critical fixes for WAPBL
2.1 kern/45676 flush semantics

2.2 (no PR) make group descriptor updates part of change transaction
The transaction, which changed the group descriptor, should contain
also the cg block write. Now the group descriptor blocks are written
to disk during filesystem sync via separate transaction, so it's quite
frequent they do not survive crash if it happens before sync. Normally
fsck fixes these easily using inode metadata, but fsck is skipped for
journalled filesystems. This IMO can lead to incorrect block
allocation, until fsck is actually run.

2.3 file data leaks on crashes
File data content blocks are written asynchronously, some of it can
make it to the disk before journal is commited, hence blocks can end
up back in different file on system crash. FFS always had it, even
with softdep albait more limited.

2.4 buffer blocks kept in memory until commit
Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl,
starving the buffer cache subsystem.

3. WAPBL performance fixes
3.1 checksum journal data for commit
Avoid one of the two DIOCCACHESYNC by computing checksum over data and
storing it in the commit record; there is even field for it already,
so matter of implementation. There is however CPU use concern maybe.
crc32c hash is good candidate, do we need to have hash alternatives?
This seems to be reasonably simple to implement, needs just some hooks
into journal writes and journal replay logic.

3.2 use FUA (Force Unit Access) for commit record write
This avoids need to issue even the second DIOCCACHESYNC, as flushing
the disk cache is not really all that useful, I like the thread over
at:
http://yarchive.net/comp/linux/drive_caches.html
Slightly less controversially, this would allow the rest of the
journal records to be written asynchronously, leaving them to execute
even after commit if so desired. It may be useful to have this
behaviour optional. I lean towards skipping the disk cache flush as
default behaviour however, if we implement write barrier for the
commit record (see below).
WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.

3.3 async, or 'group sync' writes
Submit all the journal block writes to the drive at once, instead of
writing the blocks synchronously one by one. We could even have the
journal block writes completely async if we have the commit record
checksum.
Implementing 'group sync' write would be quite simple, making it full
async is more difficult and actually not very useful for journalling,
since commit would force those writes to disk drive anyway if it's
write barrier (see below)

4. disk subsystem and journalling-related improvements
4.1 write barriers
The current DIOCCACHESYNC has a problem in that it could be quite
easily I/O starved if the drive is very loaded. Normally, the drive
firmware flushes the disk buffer very soon (i.e in region of
milliseconds, i.e. when it has full track of data), but concurrent
disk activity might prevent it from doing it soon enough.
More serious NetBSD kernel problem is however that DIOCCACHESYNC
bypasses bufq, so if there are any queued writes, DIOCCACHESYNC sends
the command do disk before those writes are sent to the drive.
In order to avoid both of them, it would be good to have a way to mark
a buf as barrier. bufq and/or disk routines would be changed to drain
the write queue before barrier write is sent to drive, and any later
writes would wait until barrier write completes. On sane hardware like
SCSI/SAS, this could be almost completely offloaded to the controller
by just using ORDERED tags, without need to drain the queue.
This would be semi-hard to implement, especially if it would require
changes to disk drivers.

4.2 scsipi default to ORDERED tags, change to SIMPLE
>From a quick scsipi_base.c inspection, it seems we use ordered tag