subject:"Blocking I\/O, async I\/O and io

Re: Blocking I/O, async I/O and io_uring

2020-12-08 Thread Craig Ringer

On Tue, 8 Dec 2020 at 15:04, Andres Freund  wrote:

> Hi,
>
> On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote:
> > I'm looking forward to this from the async+direct I/O, since the
> > throughput of some write-heavy workload decreased by half or more
> > during checkpointing (due to fsync?)
>
> Depends on why that is. The most common, I think, cause is that your WAL
> volume increases drastically just after a checkpoint starts, because
> initially all page modification will trigger full-page writes.  There's
> a significant slowdown even if you prevent the checkpointer from doing
> *any* writes at that point.  I got the WAL AIO stuff to the point that I
> see a good bit of speedup at high WAL volumes, and I see it helping in
> this scenario.
>
> There's of course also the issue that checkpoint writes cause other IO
> (including WAL writes) to slow down and, importantly, cause a lot of
> jitter leading to unpredictable latencies.  I've seen some good and some
> bad results around this with the patch, but there's a bunch of TODOs to
> resolve before delving deeper really makes sense (the IO depth control
> is not good enough right now).
>
> A third issue is that sometimes checkpointer can't really keep up - and
> that I think I've seen pretty clearly addressed by the patch. I have
> managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by
> the checkpointer, and I think I know what to do for the remainder.
>
>
Thanks for explaining this. I'm really glad you're looking into it. If I
get the chance I'd like to try to apply some wait-analysis and blocking
stats tooling to it. I'll report back if I make any progress there.

Re: Blocking I/O, async I/O and io_uring

2020-12-08 Thread Fujii Masao





On 2020/12/08 11:55, Craig Ringer wrote:

Hi all

A new kernel API called io_uring has recently come to my attention. I assume 
some of you (Andres?) have been following it for a while.

io_uring appears to offer a way to make system calls including reads, writes, 
fsync()s, and more in a non-blocking, batched and pipelined manner, with or 
without O_DIRECT. Basically async I/O with usable buffered I/O and fsync 
support. It has ordering support which is really important for us.

This should be on our radar. The main barriers to benefiting from linux-aio 
based async I/O in postgres in the past has been its reliance on direct I/O, 
the various kernel-version quirks, platform portability, and its 
maybe-async-except-when-it's-randomly-not nature.

The kernel version and portability remain an issue with io_uring so it's not 
like this is something we can pivot over to completely. But we should probably 
take a closer look at it.

PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking 
I/O. If we can improve that then we could potentially realize some major 
increases in I/O utilization especially for bigger, less concurrent workloads. 
The most obvious candidates to benefit would be redo, logical apply, and bulk 
loading.

But I have no idea how to even begin to fit this into PostgreSQL's executor 
pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in 
nature, with a push/pull executor pipeline. It seems to have been recognised 
for some time that this is increasingly hurting our performance and scalability 
as platforms become more and more parallel.

To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we 
have to be able to dispatch I/O and do something else while we wait for the 
results. So we need the ability to pipeline the executor and pipeline redo.

I thought I'd start the discussion on this and see where we can go with it. 
What incremental steps can be done to move us toward parallelisable I/O without 
having to redesign everything?

I'm thinking that redo is probably a good first candidate. It doesn't depend on 
the guts of the executor. It is much less sensitive to ordering between 
operations in shmem and on disk since it runs in the startup process. And it 
hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown 
by an extension written by 2ndQuadrant that can double redo performance by 
doing read-ahead on btree pages that will soon be needed.

Thoughts anybody?


I was wondering if async I/O might be helpful for the performance
improvement of walreceiver. In physical replication, walreceiver receives,
writes and fsyncs WAL data. Also it does tasks like keepalive. Since
walreceiver is a single process, for example, currently it cannot do other
tasks while fsyncing WAL to the disk.

OTOH, if walreceiver can do other tasks even while fsyncing WAL by
using async I/O, ISTM that it might improve the performance of walreceiver.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Andres Freund

Hi,

On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote:
> I'm looking forward to this from the async+direct I/O, since the
> throughput of some write-heavy workload decreased by half or more
> during checkpointing (due to fsync?)

Depends on why that is. The most common, I think, cause is that your WAL
volume increases drastically just after a checkpoint starts, because
initially all page modification will trigger full-page writes.  There's
a significant slowdown even if you prevent the checkpointer from doing
*any* writes at that point.  I got the WAL AIO stuff to the point that I
see a good bit of speedup at high WAL volumes, and I see it helping in
this scenario.

There's of course also the issue that checkpoint writes cause other IO
(including WAL writes) to slow down and, importantly, cause a lot of
jitter leading to unpredictable latencies.  I've seen some good and some
bad results around this with the patch, but there's a bunch of TODOs to
resolve before delving deeper really makes sense (the IO depth control
is not good enough right now).

A third issue is that sometimes checkpointer can't really keep up - and
that I think I've seen pretty clearly addressed by the patch. I have
managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by
the checkpointer, and I think I know what to do for the remainder.

> Would you mind sharing any preliminary results on this if you have
> something?

I ran numbers at some point, but since then enough has changed
(including many correctness issues fixed) that they don't seem really
relevant anymore.  I'll try to include some in the post I'm planning to
do in a few weeks.

Greetings,

Andres Freund

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Andres Freund

Hi,

On 2020-12-08 13:01:38 +0800, Craig Ringer wrote:
> Have you done much bpf / systemtap / perf based work on measurement and
> tracing of latencies etc? If not that's something I'd be keen to help with.
> I've mostly been using systemtap so far but I'm trying to pivot over to
> bpf.

Not much - there's still so many low hanging fruits and architectural
things to finish that it didn't yet seem pressing.

> I've got asynchronous writing of WAL mostly working, but need to
> > redesign the locking a bit further. Right now it's a win in some cases,
> > but not others. The latter to a significant degree due to unnecessary
> > blocking

> That's where io_uring's I/O ordering operations looked interesting. But I
> haven't looked closely enough to see if they're going to help us with I/O
> ordering in a multiprocessing architecture like postgres.

The ordering ops aren't quite powerful enough to be a huge boon
performance-wise (yet). They can cut down on syscall and intra-process
context switch overhead to some degree, but otherwise it's not different
than userspace submitting another request upon receving of a completion.

> In an ideal world we could tell the kernel about WAL-to-heap I/O
> dependencies and even let it apply WAL then heap changes out-of-order so
> long as they didn't violate any ordering constraints we specify between
> particular WAL records or between WAL writes and their corresponding heap
> blocks. But I don't know if the io_uring interface is that capable.

It's not. And that kind of dependency inferrence wouldn't be cheap on
the PG side either.

I don't think it'd help that much for WAL apply anyway. You need
read-ahead of the WAL to avoid unnecessary waits for a lot of records
anyway. And the writes during WAL are mostly pretty asynchronous (mainly
writeback during buffer replacement).

An imo considerably more interesting case is avoiding blocking on a WAL
flush when needing to write a page out in an OLTPish workload. But I can
think of more efficient ways there too.

> How feasible do you think it'd be to take it a step further and structure
> redo as a pipelined queue, where redo calls enqueue I/O operations and
> completion handlers then return immediately? Everything still goes to disk
> in the order it's enqueued, and the callbacks will be invoked in order, so
> they can update appropriate shmem state etc. Since there's no concurrency
> during redo, it should be *much* simpler than normal user backend
> operations where we have all the tight coordination of buffer management,
> WAL write ordering, PGXACT and PGPROC, the clog, etc.

I think it'd be a fairly massive increase in complexity. And I don't see
a really large payoff: Once you have real readahead in the WAL there's
really not much synchronous IO left. What am I missing?

Greetings,

Andres Freund

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer

On Tue, 8 Dec 2020 at 12:02, Andres Freund  wrote:

> Hi,
>
> On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> > A new kernel API called io_uring has recently come to my attention. I
> > assume some of you (Andres?) have been following it for a while.
>
> Yea, I've spent a *lot* of time working on AIO support, utilizing
> io_uring. Recently Thomas also joined in the fun. I've given two talks
> referencing it (last pgcon, last pgday brussels), but otherwise I've not
> yet written much about. Things aren't *quite* right yet architecturally,
> but I think we're getting there.
>

That's wonderful. Thankyou.

I'm badly behind on the conference circuit due to geographic isolation and
small children. I'll hunt up your talks.

The current state is at https://github.com/anarazel/postgres/tree/aio
> (but it's not a very clean history at the moment).
>

Fantastic!

Have you done much bpf / systemtap / perf based work on measurement and
tracing of latencies etc? If not that's something I'd be keen to help with.
I've mostly been using systemtap so far but I'm trying to pivot over to
bpf.

I hope to submit a big tracepoints patch set for PostgreSQL soon to better
expose our wait points and latencies, improve visibility of blocking, and
help make activity traceable through all the stages of processing. I'll Cc
you when I do.

> > io_uring appears to offer a way to make system calls including reads,
> > writes, fsync()s, and more in a non-blocking, batched and pipelined
> manner,
> > with or without O_DIRECT. Basically async I/O with usable buffered I/O
> and
> > fsync support. It has ordering support which is really important for us.
>
> My results indicate that we really want to have have, optional & not
> enabled by default of course, O_DIRECT support. We just can't benefit
> fully of modern SSDs otherwise. Buffered is also important, of course.
>

Even more so for NVDRAM, Optane and all that, where zero-copy and low
context switches becomes important too.

We're a long way from that being a priority but it's still not to be
dismissed.

I'm pretty sure that I've got the basics of this working pretty well. I
> don't think the executor architecture is as big an issue as you seem to
> think. There are further benefits that could be unlocked if we had a
> more flexible executor model (imagine switching between different parts
> of the query whenever blocked on IO - can't do that due to the stack
> right now).
>

Yep, that's what I'm talking about being an issue.

Blocked on an index read? Move on to the next tuple and come back when the
index read is done.

I really like what I see of the io_uring architecture so far. It's ideal
for callback-based event-driven flow control. But that doesn't fit postgres
well for the executor. It's better for redo etc.

> The way it currently works is that things like sequential scans, vacuum,
> etc use a prefetching helper which will try to use AIO to read ahead of
> the next needed block. That helper uses callbacks to determine the next
> needed block, which e.g. vacuum uses to skip over all-visible/frozen
> blocks. There's plenty other places that should use that helper, but we
> already can get considerably higher throughput for seqscans, vacuum on
> both very fast local storage, and high-latency cloud storage.
>
> Similarly, for writes there's a small helper to manage a write-queue of
> configurable depth, which currently is used to by checkpointer and
> bgwriter (but should be used in more places). Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.
>

Sure sounds like a useful interim step. That's great.

I've got asynchronous writing of WAL mostly working, but need to
> redesign the locking a bit further. Right now it's a win in some cases,
> but not others. The latter to a significant degree due to unnecessary
> blocking
>

That's where io_uring's I/O ordering operations looked interesting. But I
haven't looked closely enough to see if they're going to help us with I/O
ordering in a multiprocessing architecture like postgres.

In an ideal world we could tell the kernel about WAL-to-heap I/O
dependencies and even let it apply WAL then heap changes out-of-order so
long as they didn't violate any ordering constraints we specify between
particular WAL records or between WAL writes and their corresponding heap
blocks. But I don't know if the io_uring interface is that capable.

I did some basic experiments a while ago with using write barriers between
WAL records and heap writes instead of fsync()ing, but as you note, the
increased blocking and reduction in the kernel's ability to do I/O
reordering is generally worse than the costs of the fsync()s we do now.

> I'm thinking that redo is probably a good first candidate. It doesn't
> > depend on the guts of the executor. It is much less sensitive to
> > ordering between operations in shmem and on disk since it runs in the
> > startup process. And it hurts REALLY BADLY from

RE: Blocking I/O, async I/O and io_uring

2020-12-07 Thread tsunakawa.ta...@fujitsu.com

From: Andres Freund 
> Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.

I'm looking forward to this from the async+direct I/O, since the throughput of 
some write-heavy workload decreased by half or more during checkpointing (due 
to fsync?) Would you mind sharing any preliminary results on this if you have 
something?


Regards
Takayuki Tsunakawa

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Andres Freund

Hi,

On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I
> assume some of you (Andres?) have been following it for a while.

Yea, I've spent a *lot* of time working on AIO support, utilizing
io_uring. Recently Thomas also joined in the fun. I've given two talks
referencing it (last pgcon, last pgday brussels), but otherwise I've not
yet written much about. Things aren't *quite* right yet architecturally,
but I think we're getting there.

Thomas is working on making the AIO infrastructure portable (a worker
based fallback, posix AIO support for freebsd & OSX). Once that's done,
and some of the architectural thins are resolved, I plan to write a long
email about what I think the right design is, and where I am at.

The current state is at https://github.com/anarazel/postgres/tree/aio
(but it's not a very clean history at the moment).

There's currently no windows AIO support, but it shouldn't be too hard
to add. My preliminary look indicates that we'd likely have to use
overlapped IO with WaitForMultipleObjects(), not IOCP, since we need to
be able to handle latches etc, which seems harder with IOCP. But perhaps
we can do something using the signal handling emulation posting events
onto IOCP instead.

> io_uring appears to offer a way to make system calls including reads,
> writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
> with or without O_DIRECT. Basically async I/O with usable buffered I/O and
> fsync support. It has ordering support which is really important for us.

My results indicate that we really want to have have, optional & not
enabled by default of course, O_DIRECT support. We just can't benefit
fully of modern SSDs otherwise. Buffered is also important, of course.

> But I have no idea how to even begin to fit this into PostgreSQL's executor
> pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
> in nature, with a push/pull executor pipeline. It seems to have been
> recognised for some time that this is increasingly hurting our performance
> and scalability as platforms become more and more parallel.

> To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
> have to be able to dispatch I/O and do something else while we wait for the
> results. So we need the ability to pipeline the executor and pipeline redo.

> I thought I'd start the discussion on this and see where we can go with it.
> What incremental steps can be done to move us toward parallelisable I/O
> without having to redesign everything?

I'm pretty sure that I've got the basics of this working pretty well. I
don't think the executor architecture is as big an issue as you seem to
think. There are further benefits that could be unlocked if we had a
more flexible executor model (imagine switching between different parts
of the query whenever blocked on IO - can't do that due to the stack
right now).

The way it currently works is that things like sequential scans, vacuum,
etc use a prefetching helper which will try to use AIO to read ahead of
the next needed block. That helper uses callbacks to determine the next
needed block, which e.g. vacuum uses to skip over all-visible/frozen
blocks. There's plenty other places that should use that helper, but we
already can get considerably higher throughput for seqscans, vacuum on
both very fast local storage, and high-latency cloud storage.

Similarly, for writes there's a small helper to manage a write-queue of
configurable depth, which currently is used to by checkpointer and
bgwriter (but should be used in more places). Especially with direct IO
checkpointing can be a lot faster *and* less impactful on the "regular"
load.

I've got asynchronous writing of WAL mostly working, but need to
redesign the locking a bit further. Right now it's a win in some cases,
but not others. The latter to a significant degree due to unnecessary
blocking

> I'm thinking that redo is probably a good first candidate. It doesn't
> depend on the guts of the executor. It is much less sensitive to
> ordering between operations in shmem and on disk since it runs in the
> startup process. And it hurts REALLY BADLY from its single-threaded
> blocking approach to I/O - as shown by an extension written by
> 2ndQuadrant that can double redo performance by doing read-ahead on
> btree pages that will soon be needed.

Thomas has a patch for prefetching during WAL apply. It currently uses
posix_fadvise(), but he took care that it'd be fairly easy to rebase it
onto "real" AIO. Most of the changes necessary are pretty independent of
posix_fadvise vs aio.

Greetings,

Andres Freund

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Andreas Karlsson


On 12/8/20 3:55 AM, Craig Ringer wrote:
A new kernel API called io_uring has recently come to my attention. I 
assume some of you (Andres?) have been following it for a while.


Andres did a talk on this at FOSDEM PGDay earlier this year. You can see 
his slides below, but since they are from January things might have 
changed since then.


https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

Andreas

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Thomas Munro

On Tue, Dec 8, 2020 at 3:56 PM Craig Ringer
 wrote:
> I thought I'd start the discussion on this and see where we can go with it. 
> What incremental steps can be done to move us toward parallelisable I/O 
> without having to redesign everything?
>
> I'm thinking that redo is probably a good first candidate. It doesn't depend 
> on the guts of the executor. It is much less sensitive to ordering between 
> operations in shmem and on disk since it runs in the startup process. And it 
> hurts REALLY BADLY from its single-threaded blocking approach to I/O - as 
> shown by an extension written by 2ndQuadrant that can double redo performance 
> by doing read-ahead on btree pages that will soon be needed.

About the redo suggestion: https://commitfest.postgresql.org/31/2410/
does exactly that!  It currently uses POSIX_FADV_WILLNEED because
that's what PrefetchSharedBuffer() does, but when combined with a
"real AIO" patch set (see earlier threads and conference talks on this
by Andres) and a few small tweaks to control batching of I/O
submissions, it does exactly what you're describing.  I tried to keep
the WAL prefetcher project entirely disentangled from the core AIO
work, though, hence the "poor man's AIO" for now.

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer

References to get things started:

* https://lwn.net/Articles/810414/
* https://unixism.net/loti/what_is_io_uring.html
*
https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
*
https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

You'll probably notice how this parallels my sporadic activities around
pipelining in other areas, and the PoC libpq pipelining patch I sent in a
few years ago.

Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer

Hi all

A new kernel API called io_uring has recently come to my attention. I
assume some of you (Andres?) have been following it for a while.

io_uring appears to offer a way to make system calls including reads,
writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
with or without O_DIRECT. Basically async I/O with usable buffered I/O and
fsync support. It has ordering support which is really important for us.

This should be on our radar. The main barriers to benefiting from linux-aio
based async I/O in postgres in the past has been its reliance on direct
I/O, the various kernel-version quirks, platform portability, and its
maybe-async-except-when-it's-randomly-not nature.

The kernel version and portability remain an issue with io_uring so it's
not like this is something we can pivot over to completely. But we should
probably take a closer look at it.

PostgreSQL spends a huge amount of time waiting, doing nothing, for
blocking I/O. If we can improve that then we could potentially realize some
major increases in I/O utilization especially for bigger, less concurrent
workloads. The most obvious candidates to benefit would be redo, logical
apply, and bulk loading.

But I have no idea how to even begin to fit this into PostgreSQL's executor
pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
in nature, with a push/pull executor pipeline. It seems to have been
recognised for some time that this is increasingly hurting our performance
and scalability as platforms become more and more parallel.

To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
have to be able to dispatch I/O and do something else while we wait for the
results. So we need the ability to pipeline the executor and pipeline redo.

I thought I'd start the discussion on this and see where we can go with it.
What incremental steps can be done to move us toward parallelisable I/O
without having to redesign everything?

I'm thinking that redo is probably a good first candidate. It doesn't
depend on the guts of the executor. It is much less sensitive to ordering
between operations in shmem and on disk since it runs in the startup
process. And it hurts REALLY BADLY from its single-threaded blocking
approach to I/O - as shown by an extension written by 2ndQuadrant that can
double redo performance by doing read-ahead on btree pages that will soon
be needed.

Thoughts anybody?

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

RE: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Re: Blocking I/O, async I/O and io_uring

Blocking I/O, async I/O and io_uring

11 matches

Site Navigation

Mail list logo

Footer information