Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
On Thu, Apr 26, 2018 at 8:33 AM, Andres Freund wrote: > On 2018-04-25 14:41:44 -0400, Robert Haas wrote: >> On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth >> wrote: >> > The code that detects sequential behavior can not distinguish between >> > pread() and lseek+read, it looks only at the actual offset of the >> > current request compared to the previous one for the same fp. >> > >> > Thomas> +1 for adopting pread()/pwrite() in PG12. >> > >> > ditto >> >> Likewise. > > +1 as well. Medium term I forsee usage of at least pwritev(), and > possibly also preadv(). Being able to write out multiple buffers at once > is pretty crucial if we ever want to do direct IO. Also if we ever use threads and want to share file descriptors we'd have to use it. CC'ing Oskari Saarenmaa who proposed a patch for this a couple of years back[1]. Oskari, would you like to update your patch and post it for the September commitfest? At first glance, it probably needs autoconf-fu to check if pread()/pwrite() are supported and fallback code, so someone should update the patch to do that or explain why it's not needed based on standards we require. At least Windows apparently needs special handling (ReadFile() and WriteFile() with an OVERLAPPED object). Practically speaking, are there any Unix-like systems outside museums that don't have it? According to the man pages I looked at, this stuff is from System V R4 (1988) and appeared in ancient BSD strains too. Hmm, I suppose it's possible that pademelon and gaur don't: they apparently run HP-UX 10.20 (1996) which Wikipedia tells me is derived from System V R3! I can see that current HP-UX does have them... but unfortunately their man pages don't have a HISTORY section. FWIW these functions just showed up in the latest POSIX standard[2] (issue 7, 2017/2018?), having moved from "XSI option" to "base". [1] https://www.postgresql.org/message-id/flat/7fdcb664-4f8a-8626-75df-ffde85005829%40ohmu.fi [2] http://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
On 2018-04-25 14:41:44 -0400, Robert Haas wrote: > On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth > wrote: > > The code that detects sequential behavior can not distinguish between > > pread() and lseek+read, it looks only at the actual offset of the > > current request compared to the previous one for the same fp. > > > > Thomas> +1 for adopting pread()/pwrite() in PG12. > > > > ditto > > Likewise. +1 as well. Medium term I forsee usage of at least pwritev(), and possibly also preadv(). Being able to write out multiple buffers at once is pretty crucial if we ever want to do direct IO. Greetings, Andres Freund
Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
On Mon, Apr 16, 2018 at 2:13 AM, Andrew Gierth wrote: > The code that detects sequential behavior can not distinguish between > pread() and lseek+read, it looks only at the actual offset of the > current request compared to the previous one for the same fp. > > Thomas> +1 for adopting pread()/pwrite() in PG12. > > ditto Likewise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
> "Thomas" == Thomas Munro writes: Thomas> * it's also been claimed that readahead heuristics are not Thomas> defeated on Linux or FreeBSD, which isn't too surprising Thomas> because you'd expect it to be about blocks being faulted in, Thomas> not syscalls I don't know about linux, but on FreeBSD, readahead/writebehind is tracked at the level of open files but implemented at the level of read/write clustering. I have patched kernels in the past to improve the performance in mixed read/write cases; pg would benefit on unpatched kernels from using separate file opens for backend reads and writes. (The typical bad scenario is doing a create index, or other seqscan that updates hint bits, on a freshly-restored table; the alternation of reading block N and writing block N-x destroys the readahead/writebehind since they use a common offset.) The code that detects sequential behavior can not distinguish between pread() and lseek+read, it looks only at the actual offset of the current request compared to the previous one for the same fp. Thomas> +1 for adopting pread()/pwrite() in PG12. ditto -- Andrew (irc:RhodiumToad)
Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..
On Fri, Jun 23, 2017 at 4:50 AM, Andres Freund wrote: > On 2017-06-22 12:43:16 -0400, Robert Haas wrote: >> On Wed, Jan 25, 2017 at 2:52 PM, Andres Freund wrote: >> > You'll, depending on your workload, still have a lot of lseeks even if >> > we were to use pread/pwrite because we do lseek(SEEK_END) to get file >> > sizes. >> >> I'm pretty convinced that the lseek overhead that we're incurring >> right now is excessive. > > No argument there. My 2c: * every comparable open source system I looked at uses pread() if it's available * speedups have been claimed * it's also been claimed that readahead heuristics are not defeated on Linux or FreeBSD, which isn't too surprising because you'd expect it to be about blocks being faulted in, not syscalls * just in case there exists an operating system that has pread() but doesn't do readahead in that case, we could provide a compile-time option to select the fallback mode (until such time as you can get that bug fixed in your OS?) * syscalls aren't getting cheaper, and this is a 2-for-1 deal, what's not to like? +1 for adopting pread()/pwrite() in PG12. I understand that the use of lseek() to find file sizes is a different problem and unrelated. -- Thomas Munro http://www.enterprisedb.com