On Mon, 30 Dec 2019, Warner Losh wrote:
On Mon, Dec 30, 2019 at 12:55 PM Alexander Motin <[email protected]> wrote:
On 30.12.2019 12:02, Alexey Dokuchaev wrote:
> On Mon, Dec 30, 2019 at 08:55:14AM -0700, Warner Losh wrote:
>> On Mon, Dec 30, 2019, 5:32 AM Alexey Dokuchaev wrote:
>>> On Sun, Dec 29, 2019 at 09:16:04PM +0000, Alexander Motin
wrote:
>>>> New Revision: 356185
>>>> URL: https://svnweb.freebsd.org/changeset/base/356185
>>>>
>>>> Log:
>>>>? ?Remove GEOM_SCHED class and gsched tool.
>>>>? ?[...]
>>>
>>> Wow, that was unexpected, I use it on all my machines' HDD
drives.
>>> Is there a planned replacement, or I'd better create a port
for the
>>> GEOM_SCHED class and gsched(8) tool?
>>
>> How much of a performance improvement do you see with it?
>>
>> There has been no tweaks to this geom in years and years. It
was tuned
>> to 10 year old hard drives and never retuned for anything
newer.
>
> Well, hard drives essentially didn't change since then, still
being the
> same roration media. :)
At least some papers about gsched I read mention adX devices,
which
means old ATA stack and no NCQ.? It can be quite a significant
change to
let HDD to do its own scheduling.? Also about a year ago in
r335066
Warner added sysctl debug.bioq_batchsize, which if set to
non-zero value
may, I think, improve fairness between several processes, just
not sure
why it was never enabled.
I never?enabled it because I never had a good?car size as the default. I'm
guessing? it's somewhere?on the order of 2 times the queue size in hardware,
but with modern drives I think phk might be right and that disabling
disksort entirely might be optimal, or close to optimal.
?
>> And when I played with it a few years ago, I saw no
improvements...
>
> Admittedly, I've only did some tests no later than in 8.4
times when I
> first started using it.? Fair point, though, I should redo them
again.
I'm sorry to create a regression for you, if there is really
one.? As I
have written I don't have so much against the scheduler part
itself, as
against the accumulated technical debt and the way integration
is done,
such as mechanism of live insertion, etc.? Without unmapped I/O
and
direct dispatch I bet it must be quite slow on bigger systems,
that is
why I doubted anybody really use it.
> Is there a planned replacement, or I'd better create a port
for the
> GEOM_SCHED class and gsched(8) tool?
I wasn't planning replacement.? And moving it to ports would be a
problem, since in process I removed few capabilities critical
for it:
nstart/nend for live insertion and BIO classification for
scheduling.
But the last I don't mind to return if there appear to be a
need.? It is
only the first I am strongly against.? But if somebody would like
to
reimplement it, may be it would be better to consider merging
it with
CAM I/O scheduler by Warner?? The one at least knows about device
queue
depth, etc.? We could return the BIO classification to be used by
CAM
scheduler instead, if needed.
I'd be keen on helping anybody that wants to experiment with hard disk
drive optmizations in iosched. My doodles to make it better showed no early
improvements, so Iv'e not tried to bring them into the tree. However, our
workload is basically 'large block random' which isn't the same as others
and others might have a workload that could benefit. I've found a marginal
improvement from the read over writes bias in our workload, and
another?marginal improvement for favoring metadata reads over normal reads
(because?for us, sendfile blocks for some of these reads, but others may see
no improvement). I'm working to clean up the metadata read stuff to get it
into the tree. I've not tested it on ZFS, though, so there will be no ZFS
metadata labeling in the initial commit.
So I like the idea, and would love to work with someone that needs it
and/or whose work loads can be improved by it.
The biggest issue I have found with drive sorting and traditional elevator
algorithms is that it is not latency limiting. We have other problems at
higher layers where we scheduling too many writes simultaneously that
contribute substantially to I/O latency. Also read-after-writes are
blocked in the buffer cache while a senseless number of buffers are queued
and locked.
An algorithm I have found effective and implemented at least twice is to
estimate I/O time and then give a maximum sort latency. For many drives
you have to go further and starve them for I/O until they complete a
particularly long running operation or they can continue to decide to sort
something out indefinitely if the I/O you add to the queue is preferable.
The basic notion is to give a boundary, perhaps 100-200ms, for reads and
usually twice that or more for writes. You can sort I/O within batches of
that size. You might violate the batch if a directly adjacent block is
scheduled and you an concatenate them into a single transfer.
You also have to consider whether the drive has a write cache enabled or
not and whether the filesystem or application is going to sync the disk.
Many SATA drives want an idle queue when they sync for best behavior. You
probably also want a larger write queue for uncached writes but preferably
not the entire drive queue. Eventually cached writes cause stalls on
flush and too many in queue will just hold up queue space while they
normally complete so quickly that a deep queue depth is not important.
Elements of this are also useful on SSDs where you want to manage latency
and queue depth.
I suspect the drive queue is indeed preferable to the simple
implementations we've had in tree.
Thanks,
Jeff
Warner
--
Alexander Motin
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "[email protected]"