Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

2018-05-24 Thread Thomas Munro
On Thu, Apr 26, 2018 at 8:33 AM, Andres Freund  wrote:
> On 2018-04-25 14:41:44 -0400, Robert Haas wrote:
>> On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
>>  wrote:
>> > The code that detects sequential behavior can not distinguish between
>> > pread() and lseek+read, it looks only at the actual offset of the
>> > current request compared to the previous one for the same fp.
>> >
>> >  Thomas> +1 for adopting pread()/pwrite() in PG12.
>> >
>> > ditto
>>
>> Likewise.
>
> +1 as well. Medium term I forsee usage of at least pwritev(), and
> possibly also preadv(). Being able to write out multiple buffers at once
> is pretty crucial if we ever want to do direct IO.

Also if we ever use threads and want to share file descriptors we'd
have to use it.

CC'ing Oskari Saarenmaa who proposed a patch for this a couple of years back[1].

Oskari, would you like to update your patch and post it for the
September commitfest?  At first glance, it probably needs autoconf-fu
to check if pread()/pwrite() are supported and fallback code, so
someone should update the patch to do that or explain why it's not
needed based on standards we require.  At least Windows apparently
needs special handling (ReadFile() and WriteFile() with an OVERLAPPED
object).

Practically speaking, are there any Unix-like systems outside museums
that don't have it?  According to the man pages I looked at, this
stuff is from System V R4 (1988) and appeared in ancient BSD strains
too.  Hmm, I suppose it's possible that pademelon and gaur don't: they
apparently run HP-UX 10.20 (1996) which Wikipedia tells me is derived
from System V R3!  I can see that current HP-UX does have them... but
unfortunately their man pages don't have a HISTORY section.

FWIW these functions just showed up in the latest POSIX standard[2]
(issue 7, 2017/2018?), having moved from "XSI option" to "base".

[1] 
https://www.postgresql.org/message-id/flat/7fdcb664-4f8a-8626-75df-ffde85005829%40ohmu.fi
[2] http://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

2018-04-25 Thread Andres Freund
On 2018-04-25 14:41:44 -0400, Robert Haas wrote:
> On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
>  wrote:
> > The code that detects sequential behavior can not distinguish between
> > pread() and lseek+read, it looks only at the actual offset of the
> > current request compared to the previous one for the same fp.
> >
> >  Thomas> +1 for adopting pread()/pwrite() in PG12.
> >
> > ditto
> 
> Likewise.

+1 as well. Medium term I forsee usage of at least pwritev(), and
possibly also preadv(). Being able to write out multiple buffers at once
is pretty crucial if we ever want to do direct IO.

Greetings,

Andres Freund



Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

2018-04-25 Thread Robert Haas
On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth
 wrote:
> The code that detects sequential behavior can not distinguish between
> pread() and lseek+read, it looks only at the actual offset of the
> current request compared to the previous one for the same fp.
>
>  Thomas> +1 for adopting pread()/pwrite() in PG12.
>
> ditto

Likewise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

2018-04-15 Thread Andrew Gierth
> "Thomas" == Thomas Munro  writes:

 Thomas> * it's also been claimed that readahead heuristics are not
 Thomas> defeated on Linux or FreeBSD, which isn't too surprising
 Thomas> because you'd expect it to be about blocks being faulted in,
 Thomas> not syscalls

I don't know about linux, but on FreeBSD, readahead/writebehind is
tracked at the level of open files but implemented at the level of
read/write clustering. I have patched kernels in the past to improve the
performance in mixed read/write cases; pg would benefit on unpatched
kernels from using separate file opens for backend reads and writes.
(The typical bad scenario is doing a create index, or other seqscan that
updates hint bits, on a freshly-restored table; the alternation of
reading block N and writing block N-x destroys the readahead/writebehind
since they use a common offset.)

The code that detects sequential behavior can not distinguish between
pread() and lseek+read, it looks only at the actual offset of the
current request compared to the previous one for the same fp.

 Thomas> +1 for adopting pread()/pwrite() in PG12.

ditto

-- 
Andrew (irc:RhodiumToad)



Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

2018-04-15 Thread Thomas Munro
On Fri, Jun 23, 2017 at 4:50 AM, Andres Freund  wrote:
> On 2017-06-22 12:43:16 -0400, Robert Haas wrote:
>> On Wed, Jan 25, 2017 at 2:52 PM, Andres Freund  wrote:
>> > You'll, depending on your workload, still have a lot of lseeks even if
>> > we were to use pread/pwrite because we do lseek(SEEK_END) to get file
>> > sizes.
>>
>> I'm pretty convinced that the lseek overhead that we're incurring
>> right now is excessive.
>
> No argument there.

My 2c:

* every comparable open source system I looked at uses pread() if it's available
* speedups have been claimed
* it's also been claimed that readahead heuristics are not defeated on
Linux or FreeBSD, which isn't too surprising because you'd expect it
to be about blocks being faulted in, not syscalls
* just in case there exists an operating system that has pread() but
doesn't do readahead in that case, we could provide a compile-time
option to select the fallback mode (until such time as you can get
that bug fixed in your OS?)
* syscalls aren't getting cheaper, and this is a 2-for-1 deal, what's
not to like?

+1 for adopting pread()/pwrite() in PG12.

I understand that the use of lseek() to find file sizes is a different
problem and unrelated.

-- 
Thomas Munro
http://www.enterprisedb.com