Re: FUA and TCQ
i...@bsdimp.com (Warner Losh) writes: >I've not used any m.2 devices. These tests were raw dd's of 128k I/Os >with one thread of execution, so no effective queueing at all. gossam: {4} dd if=/dev/rdk0 bs=128k of=/dev/null count=10 10+0 records in 10+0 records out 1310720 bytes transferred in 8.766 secs (1495231576 bytes/sec) That's about 50% below the nominal speed due to syscall overhead and no queuing. With bs=1024k the overhead is smaller, the device is rated at 2.5GB/s for reading. gossam: {7} dd if=/dev/rdk0 bs=1024k of=/dev/null count=1 & 1+0 records in 1+0 records out 1048576 bytes transferred in 4.371 secs (2398938458 bytes/sec) >Just ran a couple of tests and found dd of 4k blocks gave me 160MB/s, >128k blocks gave me 600MB/s, 1M blocks gave me 636MB/s. random >read/write with 64 jobs and an I/O depth of 128 with 128k random reeds >with fio gave me 3.5GB/s. This particular drive is rated at 3.6GB/s. Yes, those are similar results. With multiple dd's the numbers almost add up until the CPUs become the bottleneck. However, I was looking for devices that even fail the dd test with large buffers. Apparently there are devices where you must use concurrent I/O operations to reach their nominal speed, otherwise you only get a fraction (maybe 20-30%). -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: FUA and TCQ
On Mon, Sep 26, 2016 at 8:27 AM, Michael van Elst wrote: > i...@bsdimp.com (Warner Losh) writes: > >>NVMe is even worse. There's one drive that w/o queueing I can barely >>get 1GB/s out of. With queueing and multiple requests I can get the >>spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to >>90-93Gbps that our 100Gbps boxes can do (though it is but one of >>many things). > > Luckily the Samsung 950pro isn't of that type. Can you tell what > NVMe devices (in particular in M.2 form factor) have that problem? I've not used any m.2 devices. These tests were raw dd's of 128k I/Os with one thread of execution, so no effective queueing at all. As queueing gets involved, the performance increases dramatically as the drive idle time drops substantially. I'd imagine most drives are like this for the workload I was testing since you had to make a full round-trip from the kernel to userland after the completion to get the next I/O rather than having it already in the hardware... Unless NetBSD's context switching is substantially faster than FreeBSD's, I'd expect to see similar results there as well. Some cards do a little better, but not by much... All cards to significantly better when multiple transactions are scheduled simultaneously. Just ran a couple of tests and found dd of 4k blocks gave me 160MB/s, 128k blocks gave me 600MB/s, 1M blocks gave me 636MB/s. random read/write with 64 jobs and an I/O depth of 128 with 128k random reeds with fio gave me 3.5GB/s. This particular drive is rated at 3.6GB/s. This is for a HGST Ultrastar SN100. All numbers from FreeBSD. In production, for unencrypted traffic, we see a similar number to the deep queue fio test. While I've not tried on NetBSD, I'd be surprised if you got significantly more than these numbers due to the round trip to user land vs having the next request being present in the drive... Warner
Re: FUA and TCQ
b...@softjar.se (Johnny Billquist) writes: >Good point. In which case (if I read you right), it's not the reordering >that matters, but the simple case of being able to queue up several >operations, to keep the disk busy. For sequential reading we are currently limited to 8 operations in flight (uvm readahead). This is less an issue for local disks, but it has a big impact on ISCSI. But it also makes reading through the filesystem faster than reading from the raw disk device. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: FUA and TCQ
i...@bsdimp.com (Warner Losh) writes: >NVMe is even worse. There's one drive that w/o queueing I can barely >get 1GB/s out of. With queueing and multiple requests I can get the >spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to >90-93Gbps that our 100Gbps boxes can do (though it is but one of >many things). Luckily the Samsung 950pro isn't of that type. Can you tell what NVMe devices (in particular in M.2 form factor) have that problem? -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)
On Fri, Sep 23, 2016 at 01:02:26PM +, paul.kon...@dell.com wrote: > > > On Sep 23, 2016, at 5:49 AM, Edgar Fu?? wrote: > > > >> The whole point of tagged queueing is to let you *not* set [the write > >> cache] bit in the mode pages and still get good performance. > > I don't get that. My understanding was that TCQ allowed the drive to > > re-order > > commands within the bounds described by the tags. With the write cache > > disabled, all write commands must hit stable storage before being reported > > completed. So what's the point of tagging with cacheing disabled? > > I'm not sure. But I have the impression that in the real world tagging is > rarely, if ever, used. I'm not sure what you mean. Do you mean that tagging is rarely, if ever, used _to establish write barriers_, or do you mean that tagging is rarely, if ever used, period? If the latter, you're way, way wrong. Thor
Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)
> On Sep 23, 2016, at 5:49 AM, Edgar Fuß wrote: > >> The whole point of tagged queueing is to let you *not* set [the write >> cache] bit in the mode pages and still get good performance. > I don't get that. My understanding was that TCQ allowed the drive to re-order > commands within the bounds described by the tags. With the write cache > disabled, all write commands must hit stable storage before being reported > completed. So what's the point of tagging with cacheing disabled? I'm not sure. But I have the impression that in the real world tagging is rarely, if ever, used. paul
Re: FUA and TCQ
On Fri, Sep 23, 2016 at 8:05 AM, Thor Lancelot Simon wrote: > Our storage stack's inability to use tags with SATA targets is a huge > gating factor for performance with real workloads (the residual use of > the kernel lock at and below the bufq layer is another). FreeBSD's storage stack does support NCQ. When that's artificially turned off, performance drops on a certain brand of SSDs from about 500-550MB/s for large reads down to 200-300MB/s depending on too many factors to go into here. It helps a lot for work loads and is critical for Netflix to get 36-38Gbps rate from our 40Gbps systems. > Starting de > novo with NVMe, where it's perverse and structurally difficult to not > support multiple commands in flight simultaneously, will help some, but > SATA SSDs are going to be around for a long time still and it'd be > great if this limitation went away. NVMe is even worse. There's one drive that w/o queueing I can barely get 1GB/s out of. With queueing and multiple requests I can get the spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to 90-93Gbps that our 100Gbps boxes can do (though it is but one of many things). > That said, I am not going to fix it myself so all I can do is sit here > and pontificate -- which is worth about what you paid for it, and no > more. Yea, I'm just a FreeBSD guy lurking here. Warner
Re: FUA and TCQ
On Fri, Sep 23, 2016 at 09:38:08AM -0400, Greg Troxel wrote: > > Johnny Billquist writes: > > > With rotating rust, the order of operations can make a huge difference > > in speed. With SSDs you don't have those seek times to begin with, so > > I would expect the gains to be marginal. > > For reordering, I agree with you, but the SSD speeds are so high that > pipeling is probably necessary to keep the SSD from stalling due to not > having enough data to write. So this could help move from 300 MB/s > (that I am seeing) to 550 MB/s. The iSCSI case is illustrative, too. Now you can have a "SCSI bus" with a huge bandwidth delay product. It doesn't matter how quickly the target says it finished one command (which is all enabling the write-cache can get you) if you are working in lockstep such that the initiator cannot send more commands until it receives the target's ack. This is why on iSCSI you really do see hundreds of tags in flight at once. You can pump up the request size, but that causes fairness problems. Keeping many commands active at the same time helps much more. Now think about that SSD again. The SSD's write latency is so low that _relative to the delay time it takes the host to issue a new command_ you have the same problem. It's clear that enabling the write cache can't really help, or at least can't help much: you need to have many commands pending at the same time. Our storage stack's inability to use tags with SATA targets is a huge gating factor for performance with real workloads (the residual use of the kernel lock at and below the bufq layer is another). Starting de novo with NVMe, where it's perverse and structurally difficult to not support multiple commands in flight simultaneously, will help some, but SATA SSDs are going to be around for a long time still and it'd be great if this limitation went away. That said, I am not going to fix it myself so all I can do is sit here and pontificate -- which is worth about what you paid for it, and no more. Thor
Re: FUA and TCQ
On 2016-09-23 15:38, Greg Troxel wrote: Johnny Billquist writes: With rotating rust, the order of operations can make a huge difference in speed. With SSDs you don't have those seek times to begin with, so I would expect the gains to be marginal. For reordering, I agree with you, but the SSD speeds are so high that pipeling is probably necessary to keep the SSD from stalling due to not having enough data to write. So this could help move from 300 MB/s (that I am seeing) to 550 MB/s. Good point. In which case (if I read you right), it's not the reordering that matters, but the simple case of being able to queue up several operations, to keep the disk busy. And potentially running several disks in parallel. Keeping them all busy. And we of course also have the pre-processing work before the command is queued, which can be done while the controller is busy. There are many potential gains here. Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: b...@softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol
Re: FUA and TCQ
Johnny Billquist writes: > With rotating rust, the order of operations can make a huge difference > in speed. With SSDs you don't have those seek times to begin with, so > I would expect the gains to be marginal. For reordering, I agree with you, but the SSD speeds are so high that pipeling is probably necessary to keep the SSD from stalling due to not having enough data to write. So this could help move from 300 MB/s (that I am seeing) to 550 MB/s. signature.asc Description: PGP signature
Re: FUA and TCQ
On 2016-09-23 13:05, David Holland wrote: On Fri, Sep 23, 2016 at 11:49:50AM +0200, Edgar Fu? wrote: > > The whole point of tagged queueing is to let you *not* set [the write > > cache] bit in the mode pages and still get good performance. > > I don't get that. My understanding was that TCQ allowed the drive > to re-order commands within the bounds described by the tags. With > the write cache disabled, all write commands must hit stable > storage before being reported completed. So what's the point of > tagging with cacheing disabled? You can have more than one in flight at a time. Typically the more you can manage to have pending at once, the better the performance, especially with SSDs. I'd say especially with rotating rust, but either way... :-) Yes, that's the whole point of tagged queuing. Issue many operations. Let the disk and controller sort out in which order to do them to make it the most efficient. With rotating rust, the order of operations can make a huge difference in speed. With SSDs you don't have those seek times to begin with, so I would expect the gains to be marginal. Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: b...@softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol
Re: FUA and TCQ
On Fri, Sep 23, 2016 at 01:13:09PM +0200, Edgar Fuß wrote: > > You can have more than one in flight at a time. > My SCSI knowledge is probably out-dated. How can I have several commands > in flight concurrently? This is what tagged queueing is for. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: FUA and TCQ
> You can have more than one in flight at a time. My SCSI knowledge is probably out-dated. How can I have several commands in flight concurrently?
Re: FUA and TCQ
On 2016-09-23 11:49, Edgar Fuß wrote: The whole point of tagged queueing is to let you *not* set [the write cache] bit in the mode pages and still get good performance. I don't get that. My understanding was that TCQ allowed the drive to re-order commands within the bounds described by the tags. With the write cache disabled, all write commands must hit stable storage before being reported completed. So what's the point of tagging with cacheing disabled? Totally independent of any caching - disk I/O performance can be greatly improved by reordering operations to minimize disk head movement. Most of disk I/O times are head movements. I'd guess that makes up about 90% of the time. Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: b...@softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol
Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)
On Fri, Sep 23, 2016 at 11:49:50AM +0200, Edgar Fu? wrote: > > The whole point of tagged queueing is to let you *not* set [the write > > cache] bit in the mode pages and still get good performance. > > I don't get that. My understanding was that TCQ allowed the drive > to re-order commands within the bounds described by the tags. With > the write cache disabled, all write commands must hit stable > storage before being reported completed. So what's the point of > tagging with cacheing disabled? You can have more than one in flight at a time. Typically the more you can manage to have pending at once, the better the performance, especially with SSDs. -- David A. Holland dholl...@netbsd.org
FUA and TCQ (was: Plan: journalling fixes for WAPBL)
> The whole point of tagged queueing is to let you *not* set [the write > cache] bit in the mode pages and still get good performance. I don't get that. My understanding was that TCQ allowed the drive to re-order commands within the bounds described by the tags. With the write cache disabled, all write commands must hit stable storage before being reported completed. So what's the point of tagging with cacheing disabled?