Re: FUA and TCQ

2016-09-26 Thread Michael van Elst
i...@bsdimp.com (Warner Losh) writes:

>I've not used any m.2 devices. These tests were raw dd's of 128k I/Os
>with one thread of execution, so no effective queueing at all.

gossam: {4} dd if=/dev/rdk0 bs=128k of=/dev/null count=10
10+0 records in
10+0 records out
1310720 bytes transferred in 8.766 secs (1495231576 bytes/sec)

That's about 50% below the nominal speed due to syscall overhead
and no queuing. With bs=1024k the overhead is smaller, the device
is rated at 2.5GB/s for reading.

gossam: {7} dd if=/dev/rdk0 bs=1024k of=/dev/null count=1 &
1+0 records in
1+0 records out
1048576 bytes transferred in 4.371 secs (2398938458 bytes/sec)


>Just ran a couple of tests and found dd of 4k blocks gave me 160MB/s,
>128k blocks gave me 600MB/s, 1M blocks gave me 636MB/s. random
>read/write with 64 jobs and an I/O depth of 128 with 128k random reeds
>with fio gave me 3.5GB/s. This particular drive is rated at 3.6GB/s.

Yes, those are similar results. With multiple dd's the numbers almost
add up until the CPUs become the bottleneck.

However, I was looking for devices that even fail the dd test with
large buffers. Apparently there are devices where you must use
concurrent I/O operations to reach their nominal speed, otherwise
you only get a fraction (maybe 20-30%).

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: FUA and TCQ

2016-09-26 Thread Warner Losh
On Mon, Sep 26, 2016 at 8:27 AM, Michael van Elst  wrote:
> i...@bsdimp.com (Warner Losh) writes:
>
>>NVMe is even worse. There's one drive that w/o queueing I can barely
>>get 1GB/s out of. With queueing and multiple requests I can get the
>>spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to
>>90-93Gbps that our 100Gbps boxes can do (though it is but one of
>>many things).
>
> Luckily the Samsung 950pro isn't of that type. Can you tell what
> NVMe devices (in particular in M.2 form factor) have that problem?

I've not used any m.2 devices. These tests were raw dd's of 128k I/Os
with one thread of execution, so no effective queueing at all. As
queueing gets involved, the performance increases dramatically as the
drive idle time drops substantially. I'd imagine most drives are like
this for the workload I was testing since you had to make a full
round-trip from the kernel to userland after the completion to get the
next I/O rather than having it already in the hardware... Unless
NetBSD's context switching is substantially faster than FreeBSD's, I'd
expect to see similar results there as well. Some cards do a little
better, but not by much... All cards to significantly better when
multiple transactions are scheduled simultaneously.

Just ran a couple of tests and found dd of 4k blocks gave me 160MB/s,
128k blocks gave me 600MB/s, 1M blocks gave me 636MB/s. random
read/write with 64 jobs and an I/O depth of 128 with 128k random reeds
with fio gave me 3.5GB/s. This particular drive is rated at 3.6GB/s.
This is for a HGST Ultrastar SN100. All numbers from FreeBSD. In
production, for unencrypted traffic, we see a similar number to the
deep queue fio test. While I've not tried on NetBSD, I'd be surprised
if you got significantly more than these numbers due to the round trip
to user land vs having the next request being present in the drive...

Warner


Re: FUA and TCQ

2016-09-26 Thread Michael van Elst
b...@softjar.se (Johnny Billquist) writes:

>Good point. In which case (if I read you right), it's not the reordering 
>that matters, but the simple case of being able to queue up several 
>operations, to keep the disk busy.

For sequential reading we are currently limited to 8 operations in
flight (uvm readahead). This is less an issue for local disks, but
it has a big impact on ISCSI. But it also makes reading through the
filesystem faster than reading from the raw disk device.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: FUA and TCQ

2016-09-26 Thread Michael van Elst
i...@bsdimp.com (Warner Losh) writes:

>NVMe is even worse. There's one drive that w/o queueing I can barely
>get 1GB/s out of. With queueing and multiple requests I can get the
>spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to
>90-93Gbps that our 100Gbps boxes can do (though it is but one of
>many things).

Luckily the Samsung 950pro isn't of that type. Can you tell what
NVMe devices (in particular in M.2 form factor) have that problem?

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-24 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 01:02:26PM +, paul.kon...@dell.com wrote:
> 
> > On Sep 23, 2016, at 5:49 AM, Edgar Fu??  wrote:
> > 
> >> The whole point of tagged queueing is to let you *not* set [the write 
> >> cache] bit in the mode pages and still get good performance.
> > I don't get that. My understanding was that TCQ allowed the drive to 
> > re-order 
> > commands within the bounds described by the tags. With the write cache 
> > disabled, all write commands must hit stable storage before being reported 
> > completed. So what's the point of tagging with cacheing disabled?
> 
> I'm not sure.  But I have the impression that in the real world tagging is 
> rarely, if ever, used.

I'm not sure what you mean.  Do you mean that tagging is rarely, if ever,
used _to establish write barriers_, or do you mean that tagging is rarely,
if ever used, period?

If the latter, you're way, way wrong.

Thor


Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-23 Thread Paul.Koning

> On Sep 23, 2016, at 5:49 AM, Edgar Fuß  wrote:
> 
>> The whole point of tagged queueing is to let you *not* set [the write 
>> cache] bit in the mode pages and still get good performance.
> I don't get that. My understanding was that TCQ allowed the drive to re-order 
> commands within the bounds described by the tags. With the write cache 
> disabled, all write commands must hit stable storage before being reported 
> completed. So what's the point of tagging with cacheing disabled?

I'm not sure.  But I have the impression that in the real world tagging is 
rarely, if ever, used.

paul



Re: FUA and TCQ

2016-09-23 Thread Warner Losh
On Fri, Sep 23, 2016 at 8:05 AM, Thor Lancelot Simon  wrote:
> Our storage stack's inability to use tags with SATA targets is a huge
> gating factor for performance with real workloads (the residual use of
> the kernel lock at and below the bufq layer is another).

FreeBSD's storage stack does support NCQ. When that's artificially
turned off, performance drops on a certain brand of SSDs from about
500-550MB/s for large reads down to 200-300MB/s depending on
too many factors to go into here. It helps a lot for work loads and is
critical for Netflix to get 36-38Gbps rate from our 40Gbps systems.

> Starting de
> novo with NVMe, where it's perverse and structurally difficult to not
> support multiple commands in flight simultaneously, will help some, but
> SATA SSDs are going to be around for a long time still and it'd be
> great if this limitation went away.

NVMe is even worse. There's one drive that w/o queueing I can barely
get 1GB/s out of. With queueing and multiple requests I can get the
spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to
90-93Gbps that our 100Gbps boxes can do (though it is but one of
many things).

> That said, I am not going to fix it myself so all I can do is sit here
> and pontificate -- which is worth about what you paid for it, and no
> more.

Yea, I'm just a FreeBSD guy lurking here.

Warner


Re: FUA and TCQ

2016-09-23 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 09:38:08AM -0400, Greg Troxel wrote:
> 
> Johnny Billquist  writes:
> 
> > With rotating rust, the order of operations can make a huge difference
> > in speed. With SSDs you don't have those seek times to begin with, so
> > I would expect the gains to be marginal.
> 
> For reordering, I agree with you, but the SSD speeds are so high that
> pipeling is probably necessary to keep the SSD from stalling due to not
> having enough data to write.  So this could help move from 300 MB/s
> (that I am seeing) to 550 MB/s.

The iSCSI case is illustrative, too.  Now you can have a "SCSI bus" with
a huge bandwidth delay product.  It doesn't matter how quickly the target
says it finished one command (which is all enabling the write-cache can get
you) if you are working in lockstep such that the initiator cannot send
more commands until it receives the target's ack.

This is why on iSCSI you really do see hundreds of tags in flight at
once.  You can pump up the request size, but that causes fairness
problems.  Keeping many commands active at the same time helps much more.

Now think about that SSD again.  The SSD's write latency is so low that
_relative to the delay time  it takes the host to issue a new command_ you
have the same problem.  It's clear that enabling the write cache can't
really help, or at least can't help much: you need to have many commands
pending at the same time.

Our storage stack's inability to use tags with SATA targets is a huge
gating factor for performance with real workloads (the residual use of
the kernel lock at and below the bufq layer is another).  Starting de
novo with NVMe, where it's perverse and structurally difficult to not
support multiple commands in flight simultaneously, will help some, but
SATA SSDs are going to be around for a long time still and it'd be
great if this limitation went away.

That said, I am not going to fix it myself so all I can do is sit here
and pontificate -- which is worth about what you paid for it, and no
more.

Thor


Re: FUA and TCQ

2016-09-23 Thread Johnny Billquist

On 2016-09-23 15:38, Greg Troxel wrote:


Johnny Billquist  writes:


With rotating rust, the order of operations can make a huge difference
in speed. With SSDs you don't have those seek times to begin with, so
I would expect the gains to be marginal.


For reordering, I agree with you, but the SSD speeds are so high that
pipeling is probably necessary to keep the SSD from stalling due to not
having enough data to write.  So this could help move from 300 MB/s
(that I am seeing) to 550 MB/s.


Good point. In which case (if I read you right), it's not the reordering 
that matters, but the simple case of being able to queue up several 
operations, to keep the disk busy. And potentially running several disks 
in parallel. Keeping them all busy. And we of course also have the 
pre-processing work before the command is queued, which can be done 
while the controller is busy. There are many potential gains here.


Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol


Re: FUA and TCQ

2016-09-23 Thread Greg Troxel

Johnny Billquist  writes:

> With rotating rust, the order of operations can make a huge difference
> in speed. With SSDs you don't have those seek times to begin with, so
> I would expect the gains to be marginal.

For reordering, I agree with you, but the SSD speeds are so high that
pipeling is probably necessary to keep the SSD from stalling due to not
having enough data to write.  So this could help move from 300 MB/s
(that I am seeing) to 550 MB/s.


signature.asc
Description: PGP signature


Re: FUA and TCQ

2016-09-23 Thread Johnny Billquist

On 2016-09-23 13:05, David Holland wrote:

On Fri, Sep 23, 2016 at 11:49:50AM +0200, Edgar Fu? wrote:
 > > The whole point of tagged queueing is to let you *not* set [the write
 > > cache] bit in the mode pages and still get good performance.
 >
 > I don't get that. My understanding was that TCQ allowed the drive
 > to re-order commands within the bounds described by the tags. With
 > the write cache disabled, all write commands must hit stable
 > storage before being reported completed. So what's the point of
 > tagging with cacheing disabled?

You can have more than one in flight at a time. Typically the more you
can manage to have pending at once, the better the performance,
especially with SSDs.


I'd say especially with rotating rust, but either way... :-)
Yes, that's the whole point of tagged queuing. Issue many operations. 
Let the disk and controller sort out in which order to do them to make 
it the most efficient.


With rotating rust, the order of operations can make a huge difference 
in speed. With SSDs you don't have those seek times to begin with, so I 
would expect the gains to be marginal.


Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol


Re: FUA and TCQ

2016-09-23 Thread Manuel Bouyer
On Fri, Sep 23, 2016 at 01:13:09PM +0200, Edgar Fuß wrote:
> > You can have more than one in flight at a time.
> My SCSI knowledge is probably out-dated. How can I have several commands 
> in flight concurrently?

This is what tagged queueing is for.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: FUA and TCQ

2016-09-23 Thread Edgar Fuß
> You can have more than one in flight at a time.
My SCSI knowledge is probably out-dated. How can I have several commands 
in flight concurrently?


Re: FUA and TCQ

2016-09-23 Thread Johnny Billquist

On 2016-09-23 11:49, Edgar Fuß wrote:

The whole point of tagged queueing is to let you *not* set [the write
cache] bit in the mode pages and still get good performance.

I don't get that. My understanding was that TCQ allowed the drive to re-order
commands within the bounds described by the tags. With the write cache
disabled, all write commands must hit stable storage before being reported
completed. So what's the point of tagging with cacheing disabled?


Totally independent of any caching - disk I/O performance can be greatly 
improved by reordering operations to minimize disk head movement. Most 
of disk I/O times are head movements. I'd guess that makes up about 90% 
of the time.


Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol


Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-23 Thread David Holland
On Fri, Sep 23, 2016 at 11:49:50AM +0200, Edgar Fu? wrote:
 > > The whole point of tagged queueing is to let you *not* set [the write 
 > > cache] bit in the mode pages and still get good performance.
 >
 > I don't get that. My understanding was that TCQ allowed the drive
 > to re-order commands within the bounds described by the tags. With
 > the write cache disabled, all write commands must hit stable
 > storage before being reported completed. So what's the point of
 > tagging with cacheing disabled?

You can have more than one in flight at a time. Typically the more you
can manage to have pending at once, the better the performance,
especially with SSDs.

-- 
David A. Holland
dholl...@netbsd.org


FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-23 Thread Edgar Fuß
> The whole point of tagged queueing is to let you *not* set [the write 
> cache] bit in the mode pages and still get good performance.
I don't get that. My understanding was that TCQ allowed the drive to re-order 
commands within the bounds described by the tags. With the write cache 
disabled, all write commands must hit stable storage before being reported 
completed. So what's the point of tagging with cacheing disabled?