Re: Blocking I/O, async I/O and io_uring
On Tue, 8 Dec 2020 at 15:04, Andres Freund wrote: > Hi, > > On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote: > > I'm looking forward to this from the async+direct I/O, since the > > throughput of some write-heavy workload decreased by half or more > > during checkpointing (due to fsync?) > > Depends on why that is. The most common, I think, cause is that your WAL > volume increases drastically just after a checkpoint starts, because > initially all page modification will trigger full-page writes. There's > a significant slowdown even if you prevent the checkpointer from doing > *any* writes at that point. I got the WAL AIO stuff to the point that I > see a good bit of speedup at high WAL volumes, and I see it helping in > this scenario. > > There's of course also the issue that checkpoint writes cause other IO > (including WAL writes) to slow down and, importantly, cause a lot of > jitter leading to unpredictable latencies. I've seen some good and some > bad results around this with the patch, but there's a bunch of TODOs to > resolve before delving deeper really makes sense (the IO depth control > is not good enough right now). > > A third issue is that sometimes checkpointer can't really keep up - and > that I think I've seen pretty clearly addressed by the patch. I have > managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by > the checkpointer, and I think I know what to do for the remainder. > > Thanks for explaining this. I'm really glad you're looking into it. If I get the chance I'd like to try to apply some wait-analysis and blocking stats tooling to it. I'll report back if I make any progress there.
Re: Blocking I/O, async I/O and io_uring
On 2020/12/08 11:55, Craig Ringer wrote: Hi all A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been following it for a while. io_uring appears to offer a way to make system calls including reads, writes, fsync()s, and more in a non-blocking, batched and pipelined manner, with or without O_DIRECT. Basically async I/O with usable buffered I/O and fsync support. It has ordering support which is really important for us. This should be on our radar. The main barriers to benefiting from linux-aio based async I/O in postgres in the past has been its reliance on direct I/O, the various kernel-version quirks, platform portability, and its maybe-async-except-when-it's-randomly-not nature. The kernel version and portability remain an issue with io_uring so it's not like this is something we can pivot over to completely. But we should probably take a closer look at it. PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking I/O. If we can improve that then we could potentially realize some major increases in I/O utilization especially for bigger, less concurrent workloads. The most obvious candidates to benefit would be redo, logical apply, and bulk loading. But I have no idea how to even begin to fit this into PostgreSQL's executor pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in nature, with a push/pull executor pipeline. It seems to have been recognised for some time that this is increasingly hurting our performance and scalability as platforms become more and more parallel. To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we have to be able to dispatch I/O and do something else while we wait for the results. So we need the ability to pipeline the executor and pipeline redo. I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to move us toward parallelisable I/O without having to redesign everything? I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much less sensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that can double redo performance by doing read-ahead on btree pages that will soon be needed. Thoughts anybody? I was wondering if async I/O might be helpful for the performance improvement of walreceiver. In physical replication, walreceiver receives, writes and fsyncs WAL data. Also it does tasks like keepalive. Since walreceiver is a single process, for example, currently it cannot do other tasks while fsyncing WAL to the disk. OTOH, if walreceiver can do other tasks even while fsyncing WAL by using async I/O, ISTM that it might improve the performance of walreceiver. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Blocking I/O, async I/O and io_uring
Hi, On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote: > I'm looking forward to this from the async+direct I/O, since the > throughput of some write-heavy workload decreased by half or more > during checkpointing (due to fsync?) Depends on why that is. The most common, I think, cause is that your WAL volume increases drastically just after a checkpoint starts, because initially all page modification will trigger full-page writes. There's a significant slowdown even if you prevent the checkpointer from doing *any* writes at that point. I got the WAL AIO stuff to the point that I see a good bit of speedup at high WAL volumes, and I see it helping in this scenario. There's of course also the issue that checkpoint writes cause other IO (including WAL writes) to slow down and, importantly, cause a lot of jitter leading to unpredictable latencies. I've seen some good and some bad results around this with the patch, but there's a bunch of TODOs to resolve before delving deeper really makes sense (the IO depth control is not good enough right now). A third issue is that sometimes checkpointer can't really keep up - and that I think I've seen pretty clearly addressed by the patch. I have managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by the checkpointer, and I think I know what to do for the remainder. > Would you mind sharing any preliminary results on this if you have > something? I ran numbers at some point, but since then enough has changed (including many correctness issues fixed) that they don't seem really relevant anymore. I'll try to include some in the post I'm planning to do in a few weeks. Greetings, Andres Freund
Re: Blocking I/O, async I/O and io_uring
Hi, On 2020-12-08 13:01:38 +0800, Craig Ringer wrote: > Have you done much bpf / systemtap / perf based work on measurement and > tracing of latencies etc? If not that's something I'd be keen to help with. > I've mostly been using systemtap so far but I'm trying to pivot over to > bpf. Not much - there's still so many low hanging fruits and architectural things to finish that it didn't yet seem pressing. > I've got asynchronous writing of WAL mostly working, but need to > > redesign the locking a bit further. Right now it's a win in some cases, > > but not others. The latter to a significant degree due to unnecessary > > blocking > That's where io_uring's I/O ordering operations looked interesting. But I > haven't looked closely enough to see if they're going to help us with I/O > ordering in a multiprocessing architecture like postgres. The ordering ops aren't quite powerful enough to be a huge boon performance-wise (yet). They can cut down on syscall and intra-process context switch overhead to some degree, but otherwise it's not different than userspace submitting another request upon receving of a completion. > In an ideal world we could tell the kernel about WAL-to-heap I/O > dependencies and even let it apply WAL then heap changes out-of-order so > long as they didn't violate any ordering constraints we specify between > particular WAL records or between WAL writes and their corresponding heap > blocks. But I don't know if the io_uring interface is that capable. It's not. And that kind of dependency inferrence wouldn't be cheap on the PG side either. I don't think it'd help that much for WAL apply anyway. You need read-ahead of the WAL to avoid unnecessary waits for a lot of records anyway. And the writes during WAL are mostly pretty asynchronous (mainly writeback during buffer replacement). An imo considerably more interesting case is avoiding blocking on a WAL flush when needing to write a page out in an OLTPish workload. But I can think of more efficient ways there too. > How feasible do you think it'd be to take it a step further and structure > redo as a pipelined queue, where redo calls enqueue I/O operations and > completion handlers then return immediately? Everything still goes to disk > in the order it's enqueued, and the callbacks will be invoked in order, so > they can update appropriate shmem state etc. Since there's no concurrency > during redo, it should be *much* simpler than normal user backend > operations where we have all the tight coordination of buffer management, > WAL write ordering, PGXACT and PGPROC, the clog, etc. I think it'd be a fairly massive increase in complexity. And I don't see a really large payoff: Once you have real readahead in the WAL there's really not much synchronous IO left. What am I missing? Greetings, Andres Freund
Re: Blocking I/O, async I/O and io_uring
On Tue, 8 Dec 2020 at 12:02, Andres Freund wrote: > Hi, > > On 2020-12-08 10:55:37 +0800, Craig Ringer wrote: > > A new kernel API called io_uring has recently come to my attention. I > > assume some of you (Andres?) have been following it for a while. > > Yea, I've spent a *lot* of time working on AIO support, utilizing > io_uring. Recently Thomas also joined in the fun. I've given two talks > referencing it (last pgcon, last pgday brussels), but otherwise I've not > yet written much about. Things aren't *quite* right yet architecturally, > but I think we're getting there. > That's wonderful. Thankyou. I'm badly behind on the conference circuit due to geographic isolation and small children. I'll hunt up your talks. The current state is at https://github.com/anarazel/postgres/tree/aio > (but it's not a very clean history at the moment). > Fantastic! Have you done much bpf / systemtap / perf based work on measurement and tracing of latencies etc? If not that's something I'd be keen to help with. I've mostly been using systemtap so far but I'm trying to pivot over to bpf. I hope to submit a big tracepoints patch set for PostgreSQL soon to better expose our wait points and latencies, improve visibility of blocking, and help make activity traceable through all the stages of processing. I'll Cc you when I do. > > io_uring appears to offer a way to make system calls including reads, > > writes, fsync()s, and more in a non-blocking, batched and pipelined > manner, > > with or without O_DIRECT. Basically async I/O with usable buffered I/O > and > > fsync support. It has ordering support which is really important for us. > > My results indicate that we really want to have have, optional & not > enabled by default of course, O_DIRECT support. We just can't benefit > fully of modern SSDs otherwise. Buffered is also important, of course. > Even more so for NVDRAM, Optane and all that, where zero-copy and low context switches becomes important too. We're a long way from that being a priority but it's still not to be dismissed. I'm pretty sure that I've got the basics of this working pretty well. I > don't think the executor architecture is as big an issue as you seem to > think. There are further benefits that could be unlocked if we had a > more flexible executor model (imagine switching between different parts > of the query whenever blocked on IO - can't do that due to the stack > right now). > Yep, that's what I'm talking about being an issue. Blocked on an index read? Move on to the next tuple and come back when the index read is done. I really like what I see of the io_uring architecture so far. It's ideal for callback-based event-driven flow control. But that doesn't fit postgres well for the executor. It's better for redo etc. > The way it currently works is that things like sequential scans, vacuum, > etc use a prefetching helper which will try to use AIO to read ahead of > the next needed block. That helper uses callbacks to determine the next > needed block, which e.g. vacuum uses to skip over all-visible/frozen > blocks. There's plenty other places that should use that helper, but we > already can get considerably higher throughput for seqscans, vacuum on > both very fast local storage, and high-latency cloud storage. > > Similarly, for writes there's a small helper to manage a write-queue of > configurable depth, which currently is used to by checkpointer and > bgwriter (but should be used in more places). Especially with direct IO > checkpointing can be a lot faster *and* less impactful on the "regular" > load. > Sure sounds like a useful interim step. That's great. I've got asynchronous writing of WAL mostly working, but need to > redesign the locking a bit further. Right now it's a win in some cases, > but not others. The latter to a significant degree due to unnecessary > blocking > That's where io_uring's I/O ordering operations looked interesting. But I haven't looked closely enough to see if they're going to help us with I/O ordering in a multiprocessing architecture like postgres. In an ideal world we could tell the kernel about WAL-to-heap I/O dependencies and even let it apply WAL then heap changes out-of-order so long as they didn't violate any ordering constraints we specify between particular WAL records or between WAL writes and their corresponding heap blocks. But I don't know if the io_uring interface is that capable. I did some basic experiments a while ago with using write barriers between WAL records and heap writes instead of fsync()ing, but as you note, the increased blocking and reduction in the kernel's ability to do I/O reordering is generally worse than the costs of the fsync()s we do now. > I'm thinking that redo is probably a good first candidate. It doesn't > > depend on the guts of the executor. It is much less sensitive to > > ordering between operations in shmem and on disk since it runs in the > > startup process. And it hurts REALLY BADLY from
RE: Blocking I/O, async I/O and io_uring
From: Andres Freund > Especially with direct IO > checkpointing can be a lot faster *and* less impactful on the "regular" > load. I'm looking forward to this from the async+direct I/O, since the throughput of some write-heavy workload decreased by half or more during checkpointing (due to fsync?) Would you mind sharing any preliminary results on this if you have something? Regards Takayuki Tsunakawa
Re: Blocking I/O, async I/O and io_uring
Hi, On 2020-12-08 10:55:37 +0800, Craig Ringer wrote: > A new kernel API called io_uring has recently come to my attention. I > assume some of you (Andres?) have been following it for a while. Yea, I've spent a *lot* of time working on AIO support, utilizing io_uring. Recently Thomas also joined in the fun. I've given two talks referencing it (last pgcon, last pgday brussels), but otherwise I've not yet written much about. Things aren't *quite* right yet architecturally, but I think we're getting there. Thomas is working on making the AIO infrastructure portable (a worker based fallback, posix AIO support for freebsd & OSX). Once that's done, and some of the architectural thins are resolved, I plan to write a long email about what I think the right design is, and where I am at. The current state is at https://github.com/anarazel/postgres/tree/aio (but it's not a very clean history at the moment). There's currently no windows AIO support, but it shouldn't be too hard to add. My preliminary look indicates that we'd likely have to use overlapped IO with WaitForMultipleObjects(), not IOCP, since we need to be able to handle latches etc, which seems harder with IOCP. But perhaps we can do something using the signal handling emulation posting events onto IOCP instead. > io_uring appears to offer a way to make system calls including reads, > writes, fsync()s, and more in a non-blocking, batched and pipelined manner, > with or without O_DIRECT. Basically async I/O with usable buffered I/O and > fsync support. It has ordering support which is really important for us. My results indicate that we really want to have have, optional & not enabled by default of course, O_DIRECT support. We just can't benefit fully of modern SSDs otherwise. Buffered is also important, of course. > But I have no idea how to even begin to fit this into PostgreSQL's executor > pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative > in nature, with a push/pull executor pipeline. It seems to have been > recognised for some time that this is increasingly hurting our performance > and scalability as platforms become more and more parallel. > To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we > have to be able to dispatch I/O and do something else while we wait for the > results. So we need the ability to pipeline the executor and pipeline redo. > I thought I'd start the discussion on this and see where we can go with it. > What incremental steps can be done to move us toward parallelisable I/O > without having to redesign everything? I'm pretty sure that I've got the basics of this working pretty well. I don't think the executor architecture is as big an issue as you seem to think. There are further benefits that could be unlocked if we had a more flexible executor model (imagine switching between different parts of the query whenever blocked on IO - can't do that due to the stack right now). The way it currently works is that things like sequential scans, vacuum, etc use a prefetching helper which will try to use AIO to read ahead of the next needed block. That helper uses callbacks to determine the next needed block, which e.g. vacuum uses to skip over all-visible/frozen blocks. There's plenty other places that should use that helper, but we already can get considerably higher throughput for seqscans, vacuum on both very fast local storage, and high-latency cloud storage. Similarly, for writes there's a small helper to manage a write-queue of configurable depth, which currently is used to by checkpointer and bgwriter (but should be used in more places). Especially with direct IO checkpointing can be a lot faster *and* less impactful on the "regular" load. I've got asynchronous writing of WAL mostly working, but need to redesign the locking a bit further. Right now it's a win in some cases, but not others. The latter to a significant degree due to unnecessary blocking > I'm thinking that redo is probably a good first candidate. It doesn't > depend on the guts of the executor. It is much less sensitive to > ordering between operations in shmem and on disk since it runs in the > startup process. And it hurts REALLY BADLY from its single-threaded > blocking approach to I/O - as shown by an extension written by > 2ndQuadrant that can double redo performance by doing read-ahead on > btree pages that will soon be needed. Thomas has a patch for prefetching during WAL apply. It currently uses posix_fadvise(), but he took care that it'd be fairly easy to rebase it onto "real" AIO. Most of the changes necessary are pretty independent of posix_fadvise vs aio. Greetings, Andres Freund
Re: Blocking I/O, async I/O and io_uring
On 12/8/20 3:55 AM, Craig Ringer wrote: A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been following it for a while. Andres did a talk on this at FOSDEM PGDay earlier this year. You can see his slides below, but since they are from January things might have changed since then. https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/ Andreas
Re: Blocking I/O, async I/O and io_uring
On Tue, Dec 8, 2020 at 3:56 PM Craig Ringer wrote: > I thought I'd start the discussion on this and see where we can go with it. > What incremental steps can be done to move us toward parallelisable I/O > without having to redesign everything? > > I'm thinking that redo is probably a good first candidate. It doesn't depend > on the guts of the executor. It is much less sensitive to ordering between > operations in shmem and on disk since it runs in the startup process. And it > hurts REALLY BADLY from its single-threaded blocking approach to I/O - as > shown by an extension written by 2ndQuadrant that can double redo performance > by doing read-ahead on btree pages that will soon be needed. About the redo suggestion: https://commitfest.postgresql.org/31/2410/ does exactly that! It currently uses POSIX_FADV_WILLNEED because that's what PrefetchSharedBuffer() does, but when combined with a "real AIO" patch set (see earlier threads and conference talks on this by Andres) and a few small tweaks to control batching of I/O submissions, it does exactly what you're describing. I tried to keep the WAL prefetcher project entirely disentangled from the core AIO work, though, hence the "poor man's AIO" for now.
Re: Blocking I/O, async I/O and io_uring
References to get things started: * https://lwn.net/Articles/810414/ * https://unixism.net/loti/what_is_io_uring.html * https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework * https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/ You'll probably notice how this parallels my sporadic activities around pipelining in other areas, and the PoC libpq pipelining patch I sent in a few years ago.
Blocking I/O, async I/O and io_uring
Hi all A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been following it for a while. io_uring appears to offer a way to make system calls including reads, writes, fsync()s, and more in a non-blocking, batched and pipelined manner, with or without O_DIRECT. Basically async I/O with usable buffered I/O and fsync support. It has ordering support which is really important for us. This should be on our radar. The main barriers to benefiting from linux-aio based async I/O in postgres in the past has been its reliance on direct I/O, the various kernel-version quirks, platform portability, and its maybe-async-except-when-it's-randomly-not nature. The kernel version and portability remain an issue with io_uring so it's not like this is something we can pivot over to completely. But we should probably take a closer look at it. PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking I/O. If we can improve that then we could potentially realize some major increases in I/O utilization especially for bigger, less concurrent workloads. The most obvious candidates to benefit would be redo, logical apply, and bulk loading. But I have no idea how to even begin to fit this into PostgreSQL's executor pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in nature, with a push/pull executor pipeline. It seems to have been recognised for some time that this is increasingly hurting our performance and scalability as platforms become more and more parallel. To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we have to be able to dispatch I/O and do something else while we wait for the results. So we need the ability to pipeline the executor and pipeline redo. I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to move us toward parallelisable I/O without having to redesign everything? I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much less sensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that can double redo performance by doing read-ahead on btree pages that will soon be needed. Thoughts anybody?