Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?

2017-10-31 Thread Craig Ringer
On 1 November 2017 at 11:49, Andres Freund  wrote:

> Right. It'd probably be good to be a bit more adaptive here. But it's
> hard to do with posix_fadvise - we'd need an operation that actually
> notifies us of IO completion.  If we were using, say, asynchronous
> direct IO, we could initiate the request and regularly check how many
> blocks ahead of the current window are already completed and adjust the
> queue based on that, rather than jus tfiring off fadvises and hoping for
> the best.

In case it's of interest, I did some looking into using Linux's AIO
support in Pg a while ago, when chasing some issues around fsync
retries and handling of I/O errors.

It was a pretty serious dead end; it was clear that fsync support in
AIO is not only incomplete but inconsistent across kernel versions,
let alone other platforms.

But I see your name in the relevant threads, so you know that. To save
others the time, see:

* https://lwn.net/Articles/724198/
* https://lwn.net/Articles/671649/

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?

2017-10-31 Thread Andres Freund
Hi,

On 2017-10-31 18:47:06 +0100, Tomas Vondra wrote:
> On 10/31/2017 04:48 PM, Greg Stark wrote:
> > On 31 October 2017 at 07:05, Chris Travers 
> wrote:
> >> Hi;
> >>
> >> After Andres's excellent talk at PGConf we tried benchmarking
> >> effective_io_concurrency on some of our servers and found that those
> which
> >> have a number of NVME storage volumes could not fill the I/O queue
> even at
> >> the maximum setting (1000).
> >
> > And was the system still i/o bound? If the cpu was 100% busy then
> > perhaps Postgres just can't keep up with the I/O system. It would
> > depend on workload though, if you start many very large sequential
> > scans you may be able to push the i/o system harder.
> >
> > Keep in mind effective_io_concurrency only really affects bitmap
> > index scans (and to a small degree index scans). It works by issuing
> > posix_fadvise() calls for upcoming buffers one by one. That gets
> > multiple spindles active but it's not really going to scale to many
> > thousands of prefetches (and effective_io_concurrency of 1000
> > actually means 7485 prefetches). At some point those i/o are going
> > to start completing before Postgres even has a chance to start
> > processing the data.

Note that even if they finish well before postgres gets around to
looking at the block, you might still be seeing benefits. SSDs benefit
from larger reads, and a deeper queue gives more chances for reordering
/ coalescing of requests. Won't beenefit the individual reader, but
might help the overall capacity of the system.


> Yeah, initiating the prefetches is not expensive, but it's not free
> either. So there's a trade-off between time spent on prefetching and
> processing the data.

Right. It'd probably be good to be a bit more adaptive here. But it's
hard to do with posix_fadvise - we'd need an operation that actually
notifies us of IO completion.  If we were using, say, asynchronous
direct IO, we could initiate the request and regularly check how many
blocks ahead of the current window are already completed and adjust the
queue based on that, rather than jus tfiring off fadvises and hoping for
the best.


> I believe this may be actually illustrated using Amdahl's law - the I/O
> is the parallel part, and processing the data is the serial part. And no
> matter what you do, the device only has so much bandwidth, which defines
> the maximum possible speedup (compared to "no prefetch" case).

Right.


> Furthermore, the device does not wait for all the I/O requests to be
> submitted - it won't wait for 1000 requests and then go "OMG! There's a
> lot of work to do!" It starts processing the requests as they arrive,
> and some of them will complete before you're done with submitting the
> rest, so you'll never see all the requests in the queue at once.

It'd be interesting to see how much it helps to scale the size of
readahead requests with the distance from the current read
iterator. E.g. if you're less than 16 blocks away from the current head,
issue size 1, up to 32 issue a 2 block request for consecutive blocks.
I suspect it won't help because at least linux's block layer / io
elevator seems quite successfully at merging. E.g. for the query:
EXPLAIN ANALYZE SELECT sum(l_quantity) FROM lineitem where l_receiptdate 
between '1993-05-03' and '1993-08-03';
on a tpc-h scale dataset on my laptop, I see:
Devicer/s w/s rMB/s wMB/s   rrqm/s   wrqm/s  %rrqm  
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda   25702.000.00495.27  0.00 37687.00 0.00  59.45   
0.005.130.00 132.0919.73 0.00   0.04 100.00

but it'd be worthwhile to see whether doing the merging ourselves allows
for deeper queues.


I think we really should start incorporating explicit prefetching in
more places. Ordered indexscans might actually be one case that's not
too hard to do in a simple manner - whenever at an inner node, prefetch
the leaf nodes below it. We obviously could do better, but that might be
a decent starting point to get some numbers.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?

2017-10-31 Thread Tomas Vondra
Hi,

On 10/31/2017 04:48 PM, Greg Stark wrote:
> On 31 October 2017 at 07:05, Chris Travers 
wrote:
>> Hi;
>>
>> After Andres's excellent talk at PGConf we tried benchmarking
>> effective_io_concurrency on some of our servers and found that those
which
>> have a number of NVME storage volumes could not fill the I/O queue
even at
>> the maximum setting (1000).
>
> And was the system still i/o bound? If the cpu was 100% busy then
> perhaps Postgres just can't keep up with the I/O system. It would
> depend on workload though, if you start many very large sequential
> scans you may be able to push the i/o system harder.
>
> Keep in mind effective_io_concurrency only really affects bitmap
> index scans (and to a small degree index scans). It works by issuing
> posix_fadvise() calls for upcoming buffers one by one. That gets
> multiple spindles active but it's not really going to scale to many
> thousands of prefetches (and effective_io_concurrency of 1000
> actually means 7485 prefetches). At some point those i/o are going
> to start completing before Postgres even has a chance to start
> processing the data.
>
Yeah, initiating the prefetches is not expensive, but it's not free
either. So there's a trade-off between time spent on prefetching and
processing the data.

I believe this may be actually illustrated using Amdahl's law - the I/O
is the parallel part, and processing the data is the serial part. And no
matter what you do, the device only has so much bandwidth, which defines
the maximum possible speedup (compared to "no prefetch" case).

Furthermore, the device does not wait for all the I/O requests to be
submitted - it won't wait for 1000 requests and then go "OMG! There's a
lot of work to do!" It starts processing the requests as they arrive,
and some of them will complete before you're done with submitting the
rest, so you'll never see all the requests in the queue at once.

And of course, iostat and other tools only give you "average queue
length", which is mostly determined by the average throughput.

In my experience (on all types of storage, including SSDs and NVMe), the
performance quickly and significantly improves once you start increasing
the value (say, to 8 or 16, maybe 64). And then the gains become much
more modest - not because the device could not handle more, but because
of the prefetch/processing ratio reached the optimal value.

But all this is actually per-process, if you can run multiple backends
(particularly when doing bitmap index scans), I'm sure you'll see the
queues being more full.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?

2017-10-31 Thread Greg Stark
On 31 October 2017 at 07:05, Chris Travers  wrote:
> Hi;
>
> After Andres's excellent talk at PGConf we tried benchmarking
> effective_io_concurrency on some of our servers and found that those which
> have a number of NVME storage volumes could not fill the I/O queue even at
> the maximum setting (1000).

And was the system still i/o bound? If the cpu was 100% busy then
perhaps Postgres just can't keep up with the I/O system. It would
depend on workload though, if you start many very large sequential
scans you may be able to push the i/o system harder.

Keep in mind effective_io_concurrency only really affects bitmap index
scans (and to a small degree index scans). It works by issuing
posix_fadvise() calls for upcoming buffers one by one. That gets
multiple spindles active but it's not really going to scale to many
thousands of prefetches (and effective_io_concurrency of 1000 actually
means 7485 prefetches). At some point those i/o are going to start
completing before Postgres even has a chance to start processing the
data.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers