Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?
On 1 November 2017 at 11:49, Andres Freund wrote: > Right. It'd probably be good to be a bit more adaptive here. But it's > hard to do with posix_fadvise - we'd need an operation that actually > notifies us of IO completion. If we were using, say, asynchronous > direct IO, we could initiate the request and regularly check how many > blocks ahead of the current window are already completed and adjust the > queue based on that, rather than jus tfiring off fadvises and hoping for > the best. In case it's of interest, I did some looking into using Linux's AIO support in Pg a while ago, when chasing some issues around fsync retries and handling of I/O errors. It was a pretty serious dead end; it was clear that fsync support in AIO is not only incomplete but inconsistent across kernel versions, let alone other platforms. But I see your name in the relevant threads, so you know that. To save others the time, see: * https://lwn.net/Articles/724198/ * https://lwn.net/Articles/671649/ -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?
Hi, On 2017-10-31 18:47:06 +0100, Tomas Vondra wrote: > On 10/31/2017 04:48 PM, Greg Stark wrote: > > On 31 October 2017 at 07:05, Chris Travers > wrote: > >> Hi; > >> > >> After Andres's excellent talk at PGConf we tried benchmarking > >> effective_io_concurrency on some of our servers and found that those > which > >> have a number of NVME storage volumes could not fill the I/O queue > even at > >> the maximum setting (1000). > > > > And was the system still i/o bound? If the cpu was 100% busy then > > perhaps Postgres just can't keep up with the I/O system. It would > > depend on workload though, if you start many very large sequential > > scans you may be able to push the i/o system harder. > > > > Keep in mind effective_io_concurrency only really affects bitmap > > index scans (and to a small degree index scans). It works by issuing > > posix_fadvise() calls for upcoming buffers one by one. That gets > > multiple spindles active but it's not really going to scale to many > > thousands of prefetches (and effective_io_concurrency of 1000 > > actually means 7485 prefetches). At some point those i/o are going > > to start completing before Postgres even has a chance to start > > processing the data. Note that even if they finish well before postgres gets around to looking at the block, you might still be seeing benefits. SSDs benefit from larger reads, and a deeper queue gives more chances for reordering / coalescing of requests. Won't beenefit the individual reader, but might help the overall capacity of the system. > Yeah, initiating the prefetches is not expensive, but it's not free > either. So there's a trade-off between time spent on prefetching and > processing the data. Right. It'd probably be good to be a bit more adaptive here. But it's hard to do with posix_fadvise - we'd need an operation that actually notifies us of IO completion. If we were using, say, asynchronous direct IO, we could initiate the request and regularly check how many blocks ahead of the current window are already completed and adjust the queue based on that, rather than jus tfiring off fadvises and hoping for the best. > I believe this may be actually illustrated using Amdahl's law - the I/O > is the parallel part, and processing the data is the serial part. And no > matter what you do, the device only has so much bandwidth, which defines > the maximum possible speedup (compared to "no prefetch" case). Right. > Furthermore, the device does not wait for all the I/O requests to be > submitted - it won't wait for 1000 requests and then go "OMG! There's a > lot of work to do!" It starts processing the requests as they arrive, > and some of them will complete before you're done with submitting the > rest, so you'll never see all the requests in the queue at once. It'd be interesting to see how much it helps to scale the size of readahead requests with the distance from the current read iterator. E.g. if you're less than 16 blocks away from the current head, issue size 1, up to 32 issue a 2 block request for consecutive blocks. I suspect it won't help because at least linux's block layer / io elevator seems quite successfully at merging. E.g. for the query: EXPLAIN ANALYZE SELECT sum(l_quantity) FROM lineitem where l_receiptdate between '1993-05-03' and '1993-08-03'; on a tpc-h scale dataset on my laptop, I see: Devicer/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 25702.000.00495.27 0.00 37687.00 0.00 59.45 0.005.130.00 132.0919.73 0.00 0.04 100.00 but it'd be worthwhile to see whether doing the merging ourselves allows for deeper queues. I think we really should start incorporating explicit prefetching in more places. Ordered indexscans might actually be one case that's not too hard to do in a simple manner - whenever at an inner node, prefetch the leaf nodes below it. We obviously could do better, but that might be a decent starting point to get some numbers. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?
Hi, On 10/31/2017 04:48 PM, Greg Stark wrote: > On 31 October 2017 at 07:05, Chris Travers wrote: >> Hi; >> >> After Andres's excellent talk at PGConf we tried benchmarking >> effective_io_concurrency on some of our servers and found that those which >> have a number of NVME storage volumes could not fill the I/O queue even at >> the maximum setting (1000). > > And was the system still i/o bound? If the cpu was 100% busy then > perhaps Postgres just can't keep up with the I/O system. It would > depend on workload though, if you start many very large sequential > scans you may be able to push the i/o system harder. > > Keep in mind effective_io_concurrency only really affects bitmap > index scans (and to a small degree index scans). It works by issuing > posix_fadvise() calls for upcoming buffers one by one. That gets > multiple spindles active but it's not really going to scale to many > thousands of prefetches (and effective_io_concurrency of 1000 > actually means 7485 prefetches). At some point those i/o are going > to start completing before Postgres even has a chance to start > processing the data. > Yeah, initiating the prefetches is not expensive, but it's not free either. So there's a trade-off between time spent on prefetching and processing the data. I believe this may be actually illustrated using Amdahl's law - the I/O is the parallel part, and processing the data is the serial part. And no matter what you do, the device only has so much bandwidth, which defines the maximum possible speedup (compared to "no prefetch" case). Furthermore, the device does not wait for all the I/O requests to be submitted - it won't wait for 1000 requests and then go "OMG! There's a lot of work to do!" It starts processing the requests as they arrive, and some of them will complete before you're done with submitting the rest, so you'll never see all the requests in the queue at once. And of course, iostat and other tools only give you "average queue length", which is mostly determined by the average throughput. In my experience (on all types of storage, including SSDs and NVMe), the performance quickly and significantly improves once you start increasing the value (say, to 8 or 16, maybe 64). And then the gains become much more modest - not because the device could not handle more, but because of the prefetch/processing ratio reached the optimal value. But all this is actually per-process, if you can run multiple backends (particularly when doing bitmap index scans), I'm sure you'll see the queues being more full. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers