Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 7:53 AM, Mel Gorman mgor...@suse.de wrote: The second is have pages that are strictly kept dirty until the application syncs them. An unbounded number of these pages would blow up but maybe bounds could be placed on it. There are no solid conclusions on that part yet. I think the interface would be subtler than that. The current architecture is that if an individual process decides to evict one of these pages it knows how much of the log needs to be flushed and fsynced before it can do so and proceeds to do it itself. This is a situation to be avoided as much as possible but there are workloads where it's inevitable (the typical example is mass data loads). There would need to be some kind of similar interface where there would be some way for the kernel to force log pages to be written to allow it to advance the epoch. Either some way to wake Postgres up and inform it of the urgency or better yet Postgres would just always be writing out pages without fsyncing them and instead be issuing some other syscall to mark the points in the log file that correspond to the write barriers that would unpin these buffers. Ted T'so was concerned this would all be a massive layering violation and I have to admit that's a huge risk. It would take some clever API engineering to come with a clean set of primitives to express the kind of ordering guarantees we need without being too tied to Postgres's specific implementation. The reason I think it's more interesting though is that Postgres's journalling and checkpointing architecture is pretty bog-standard CS stuff and there are hundreds or thousands of pieces of software out there that do pretty much the same work and trying to do it efficiently with fsync or O_DIRECT is like working with both hands tied to your feet. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Thursday, January 16, 2014, Dave Chinner da...@fromorbit.comjavascript:_e({}, 'cvml', 'da...@fromorbit.com'); wrote: On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote: On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote: On 1/15/14, 12:00 AM, Claudio Freire wrote: My completely unproven theory is that swapping is overwhelmed by near-misses. Ie: a process touches a page, and before it's actually swapped in, another process touches it too, blocking on the other process' read. But the second process doesn't account for that page when evaluating predictive models (ie: read-ahead), so the next I/O by process 2 is unexpected to the kernel. Then the same with 1. Etc... In essence, swap, by a fluke of its implementation, fails utterly to predict the I/O pattern, and results in far sub-optimal reads. Explicit I/O is free from that effect, all read calls are accountable, and that makes a difference. Maybe, if the kernel could be fixed in that respect, you could consider mmap'd files as a suitable form of temporary storage. But that would depend on the success and availability of such a fix/patch. Another option is to consider some of the more radical ideas in this thread, but only for temporary data. Our write sequencing and other needs are far less stringent for this stuff. -- Jim C. I suspect that a lot of the temporary data issues can be solved by using tmpfs for temporary files Temp files can collectively reach hundreds of gigs. So unless you have terabytes of RAM you're going to have to write them back to disk. If they turn out to be hundreds of gigs, then yes they have to hit disk (at least on my hardware). But if they are 10 gig, then maybe not (depending on whether other people decide to do similar things at the same time I'm going to be doing it--something which is often hard to predict). But now for every action I take, I have to decide, is this going to take 10 gig, or 14 gig, and how absolutely certain am I? And is someone else going to try something similar at the same time? What a hassle. It would be so much nicer to say This is accessed sequentially, and will never be fsynced. Maybe it will fit entirely in memory, maybe it won't, either way, you know what to do. If I start out writing to tmpfs, I can't very easily change my mind 94% of the way through and decide to go somewhere else. But the kernel, effectively, can. But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Those are mostly orthogonal issues. The permanent files need to be fsynced on a regular basis, and might have gigabytes of data dirtied at random from within terabytes of underlying storage. We better start writing that pretty quickly or when do issue the fsyncs, the world will fall apart. The temporary files will never need to be fsynced, and can be written out sequentially if they do ever need to be written out. Better to delay this as much as feasible. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? The permanent files tend to be things like Jane Doe just bought a pair of green shoes from Hendrick Green Shoes Limited--record that, charge her credit card, and schedule delivery. The temp files are more like It is the end of the year, how many shoes have been purchased in each color from each manufacturer for each quarter over the last 6 years? So the temp files quickly manipulate data that has slowly been accumulating over very long times, while the permanent files represent the processes of those accumulations. If you are Amazon, of course, you have thousands of people who can keep two sets of records, one organized for fast update and one slightly delayed copy reorganized for fast analysis, and also do partial analysis on an ongoing basis and roll them up in ways that can be incrementally updated. If you are not Amazon, it would be nice if one system did a better job of doing both things with the trade off between the two being dynamic and automatic. Cheers, Jeff
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
(This thread is now massive and I have not read it all yet. If anything I say has already been discussed then whoops) On Tue, Jan 14, 2014 at 12:09:46PM +1100, Dave Chinner wrote: On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: For one, postgres doesn't use mmap for files (and can't without major new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has horrible consequences for performance/scalability - very quickly you contend on locks in the kernel. I may as well dump this in this thread. We've discussed this in person a few times, including at least once with Ted T'so when he visited Dublin last year. The fundamental conflict is that the kernel understands better the hardware and other software using the same resources, Postgres understands better its own access patterns. We need to either add interfaces so Postgres can teach the kernel what it needs about its access patterns or add interfaces so Postgres can find out what it needs to know about the hardware context. In my experience applications don't need to know anything about the underlying storage hardware - all they need is for someone to tell them the optimal IO size and alignment to use. That potentially misses details on efficient IO patterns. They might submit many small requests for example each of which are of the optimal IO size and alignment but which is sub-optimal overall. While these still go through the underlying block layers there is no guarantee that the requests will arrive in time for efficient merging to occur. The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. That applies if the dirty pages are forced to be kept dirty. You call this pinned but pinned has special meaning so I would suggest calling it something like dirty-sticky pages. It could be the case that such hinting will have the pages excluded from dirty background writing but can still be cleaned if dirty limits are hit or if fsync is called. It's a hint, not a forced guarantee. It's still a hand grenade because if this is tracked on a per-page basis because of what happens if the process crashes? Those pages stay dirty potentially forever. An alternative would be to track this on a per-inode instead of per-page basis. The hint would only exist where there is an open fd for that inode. Treat it as a privileged call with a sysctl controlling how many dirty-sticky pages can exist in the system with the information presented during OOM kills and maybe it starts becoming a bit more manageable. Dirty-sticky pages are not guaranteed to stay dirty until userspace action, the kernel just stays away until there are no other sensible options. Indeed, what happens if you do pin_dirty_pages(fd); fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. Indeed. Forcing pages with this hint to stay dirty until user space decides to clean them is eventually going to blow up. SNIP H. What happens if the process crashes after pinning the dirty pages? How do we even know what process pinned the dirty pages so we can clean up after it? What happens if the same page is pinned by multiple processes? What happens on truncate/hole punch if the partial pages in the range that need to be zeroed and written are pinned? What happens if we do direct IO to a range with pinned, unflushable pages in the page cache? Proposal: A process with an open fd can hint that pages managed by this inode will have dirty-sticky pages. Pages will be ignored by dirty background writing unless there is an fsync call or dirty page limits are hit. The hint is cleared when no process has the file open. If the process crashes, the hint is cleared and the pages get cleaned as normal Multiple processes do not matter as such as all of them will have the file open. There is a problem if the processes
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 09:44:21AM +, Mel Gorman wrote: SNIP H. What happens if the process crashes after pinning the dirty pages? How do we even know what process pinned the dirty pages so we can clean up after it? What happens if the same page is pinned by multiple processes? What happens on truncate/hole punch if the partial pages in the range that need to be zeroed and written are pinned? What happens if we do direct IO to a range with pinned, unflushable pages in the page cache? Proposal: A process with an open fd can hint that pages managed by this inode will have dirty-sticky pages. Pages will be ignored by dirty background writing unless there is an fsync call or dirty page limits are hit. The hint is cleared when no process has the file open. I'm still processing the rest of the thread and putting it into my head but it's at least clear that this proposal would only cover the case where large temporarily files are created that do not necessarily need to be persisted. They still have cases where the ordering of writes matter and the kernel cleaning pages behind their back would lead to corruption. -- Mel Gorman SUSE Labs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman mgor...@suse.de wrote: That applies if the dirty pages are forced to be kept dirty. You call this pinned but pinned has special meaning so I would suggest calling it something like dirty-sticky pages. It could be the case that such hinting will have the pages excluded from dirty background writing but can still be cleaned if dirty limits are hit or if fsync is called. It's a hint, not a forced guarantee. It's still a hand grenade because if this is tracked on a per-page basis because of what happens if the process crashes? Those pages stay dirty potentially forever. An alternative would be to track this on a per-inode instead of per-page basis. The hint would only exist where there is an open fd for that inode. Treat it as a privileged call with a sysctl controlling how many dirty-sticky pages can exist in the system with the information presented during OOM kills and maybe it starts becoming a bit more manageable. Dirty-sticky pages are not guaranteed to stay dirty until userspace action, the kernel just stays away until there are no other sensible options. I think this discussion is vividly illustrating why this whole line of inquiry is a pile of fail. If all the processes that have the file open crash, the changes have to be *thrown away* not written to disk whenever the kernel likes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:16:27AM -0500, Robert Haas wrote: On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman mgor...@suse.de wrote: That applies if the dirty pages are forced to be kept dirty. You call this pinned but pinned has special meaning so I would suggest calling it something like dirty-sticky pages. It could be the case that such hinting will have the pages excluded from dirty background writing but can still be cleaned if dirty limits are hit or if fsync is called. It's a hint, not a forced guarantee. It's still a hand grenade because if this is tracked on a per-page basis because of what happens if the process crashes? Those pages stay dirty potentially forever. An alternative would be to track this on a per-inode instead of per-page basis. The hint would only exist where there is an open fd for that inode. Treat it as a privileged call with a sysctl controlling how many dirty-sticky pages can exist in the system with the information presented during OOM kills and maybe it starts becoming a bit more manageable. Dirty-sticky pages are not guaranteed to stay dirty until userspace action, the kernel just stays away until there are no other sensible options. I think this discussion is vividly illustrating why this whole line of inquiry is a pile of fail. If all the processes that have the file open crash, the changes have to be *thrown away* not written to disk whenever the kernel likes. I realise that now and sorry for the noise. I later read the parts of the thread that covered the strict ordering requirements and in a summary mail I split the requirements in two. In one, there are dirty sticky pages that the kernel should not writeback unless it has no other option or fsync is called. This may be suitable for large temporary files that Postgres does not necessarily want to hit the platter but also does not have strict ordering requirements for. The second is have pages that are strictly kept dirty until the application syncs them. An unbounded number of these pages would blow up but maybe bounds could be placed on it. There are no solid conclusions on that part yet. -- Mel Gorman SUSE Labs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:53 AM, Mel Gorman mgor...@suse.de wrote: I realise that now and sorry for the noise. I later read the parts of the thread that covered the strict ordering requirements and in a summary mail I split the requirements in two. In one, there are dirty sticky pages that the kernel should not writeback unless it has no other option or fsync is called. This may be suitable for large temporary files that Postgres does not necessarily want to hit the platter but also does not have strict ordering requirements for. The second is have pages that are strictly kept dirty until the application syncs them. An unbounded number of these pages would blow up but maybe bounds could be placed on it. There are no solid conclusions on that part yet. I think that the bottom line is that we're not likely to make massive changes to the way that we do block caching now. Even if some other scheme could work much better on Linux (and so far I'm unconvinced that any of the proposals made here would in fact work much better), we aim to be portable to Windows as well as other UNIX-like systems (BSD, Solaris, etc.). So using completely Linux-specific technology in an overhaul of our block cache seems to me to have no future. On the other hand, giving the kernel hints about what we're doing that would enable it to be smarter seems to me to have a lot of potential. Ideas so far mentioned include: - Hint that we're going to do an fsync on file X at time Y, so that the kernel can schedule the write-out to complete right around that time. - Hint that a block is a good candidate for reclaim without actually purging it if there's no memory pressure. - Hint that a page we modify in our cache should be dropped from the kernel cache. - Hint that a page we write back to the operating system should be dropped from the kernel cache after the I/O completes. It's hard to say which of these ideas will work well without testing them, and the overhead of the extra system calls might be significant in some of those cases, but it seems a promising line of inquiry. And the idea of being able to do an 8kB atomic write with OS support so that we don't have to save full page images in our write-ahead log to cover the torn page scenario seems very intriguing indeed. If that worked well, it would be a *big* deal for us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
Robert Haas robertmh...@gmail.com writes: I think that the bottom line is that we're not likely to make massive changes to the way that we do block caching now. Even if some other scheme could work much better on Linux (and so far I'm unconvinced that any of the proposals made here would in fact work much better), we aim to be portable to Windows as well as other UNIX-like systems (BSD, Solaris, etc.). So using completely Linux-specific technology in an overhaul of our block cache seems to me to have no future. Unfortunately, I have to agree with this. Even if there were a way to merge our internal buffers with the kernel's, it would surely be far too invasive to coexist with buffer management that'd still work on more traditional platforms. But we could add hint calls, or modify the I/O calls we use, and that ought to be a reasonably localized change. And the idea of being able to do an 8kB atomic write with OS support so that we don't have to save full page images in our write-ahead log to cover the torn page scenario seems very intriguing indeed. If that worked well, it would be a *big* deal for us. +1. That would be a significant win, and trivial to implement, since we already have a way to switch off full-page images for people who trust their filesystems to do atomic writes. It's just that safe use of that switch isn't widely possible ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 2:52 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I think that the bottom line is that we're not likely to make massive changes to the way that we do block caching now. Even if some other scheme could work much better on Linux (and so far I'm unconvinced that any of the proposals made here would in fact work much better), we aim to be portable to Windows as well as other UNIX-like systems (BSD, Solaris, etc.). So using completely Linux-specific technology in an overhaul of our block cache seems to me to have no future. Unfortunately, I have to agree with this. Even if there were a way to merge our internal buffers with the kernel's, it would surely be far too invasive to coexist with buffer management that'd still work on more traditional platforms. But we could add hint calls, or modify the I/O calls we use, and that ought to be a reasonably localized change. That's what's pretty nice with the zero-copy read idea. It's almost transparent. You read to a page-aligned address, and it works. The only code change would be enabling zero-copy reads, which I'm not sure will be low-overhead enough to leave enabled by default. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On 01/13/2014 11:22 PM, James Bottomley wrote: The less exciting, more conservative option would be to add kernel interfaces to teach Postgres about things like raid geometries. Then Postgres could use directio and decide to do prefetching based on the raid geometry, how much available i/o bandwidth and iops is available, etc. Reimplementing i/o schedulers and all the rest of the work that the kernel provides inside Postgres just seems like something outside our competency and that none of us is really excited about doing. This would also be a well trodden path ... I believe that some large database company introduced Direct IO for roughly this purpose. The file system at that time were much worse than they are now, so said large companies had no choice but to write their own. As linux file handling has been much better for most of active development of postgresql we have been able to avoid it and still have reasonable performance. What was been pointed out above are some (allegedly desktop/mobile influenced) decisions which broke good performance. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: For one, postgres doesn't use mmap for files (and can't without major new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has horrible consequences for performance/scalability - very quickly you contend on locks in the kernel. I may as well dump this in this thread. We've discussed this in person a few times, including at least once with Ted T'so when he visited Dublin last year. The fundamental conflict is that the kernel understands better the hardware and other software using the same resources, Postgres understands better its own access patterns. We need to either add interfaces so Postgres can teach the kernel what it needs about its access patterns or add interfaces so Postgres can find out what it needs to know about the hardware context. In my experience applications don't need to know anything about the underlying storage hardware - all they need is for someone to tell them the optimal IO size and alignment to use. The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees H. What happens if the process crashes after pinning the dirty pages? How do we even know what process pinned the dirty pages so we can clean up after it? What happens if the same page is pinned by multiple processes? What happens on truncate/hole punch if the partial pages in the range that need to be zeroed and written are pinned? What happens if we do direct IO to a range with pinned, unflushable pages in the page cache? These are all complex corner cases that are introduced by allowing applications to pin dirty pages in memory. I've only spent a few minutes coming up with these, and I'm sure there's more of them. As such, I just don't see that allowing userspace to pin dirty page cache pages in memory being a workable solution. The less exciting, more conservative option would be to add kernel interfaces to teach Postgres about things like raid geometries. Then /sys/block/dev/queue/* contains all the information that is exposed to filesystems to optimise layout for storage geometry. Some filesystems can already expose the relevant parts of this information to userspace, others don't. What I think we really need to provide is a generic interface similar to the old XFS_IOC_DIOINFO ioctl that can be used to expose IO characteristics to applications in a simple, easy to gather manner. Something like: struct io_info { u64 minimum_io_size;/* sector size */ u64 maximum_io_size;/* currently 2GB */ u64 optimal_io_size;/* stripe unit/width */ u64 optimal_io_alignment; /* stripe unit/width */ u64 mem_alignment; /* PAGE_SIZE */ u32 queue_depth;/* max IO concurrency */ }; Postgres could use directio and decide to do prefetching based on the raid geometry, Underlying storage array raid geometry and optimal IO sizes for the filesystem may be different. Hence you want what the filesystem considers optimal, not what the underlying storage is configured with. Indeed, a filesystem might be able to supply per-file IO characteristics depending on where it is located in the filesystem (think tiered storage) how much available i/o bandwidth and iops is available, etc. The kernel
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On 14/01/14 14:09, Dave Chinner wrote: On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: [...] The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees [...] What if Postgres could tell the kernel how strongly that it wanted to hold on to the pages? Say a byte (this is arbitrary, it could be a single hint bit which meant please, Please, PLEASE don't flush, if that is okay with you Mr Kernel...), so strength would be S = (unsigned byte value)/256, so 0 = S 1. S = 0 flush now. 0 S 1 flush if the 'need' is greater than the S S = 1 never flush (note a value of 1 cannot occur, as max S = 255/256) Postgres could use low non-zero S values if it thinks that pages /might/ still be useful later, and very high values when it is /more certain/. I am sure Postgres must sometimes know when some pages are more important to held onto than others, hence my feeling that S should be more than one bit. The kernel might simply flush pages starting at ones with low values of S working upwards until it has freed enough memory to resolve its memory pressure. So an explicit numerical value of 'need' (as implied above) is not required. Also any practical implementation would not use 'S' as a float/double, but use integer values for 'S' 'need' - assuming that 'need' did have to be an actual value, which I suspect would not be reequired. This way the kernel is free to flush all such pages, when sufficient need arises - yet usually, when there is sufficient memory, the pages will be held unflushed. Cheers, Gavin
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 2:03 PM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: Say a byte (this is arbitrary, it could be a single hint bit which meant please, Please, PLEASE don't flush, if that is okay with you Mr Kernel...), so strength would be S = (unsigned byte value)/256, so 0 = S 1. S = 0 flush now. 0 S 1 flush if the 'need' is greater than the S S = 1 never flush (note a value of 1 cannot occur, as max S = 255/256) Postgres could use low non-zero S values if it thinks that pages might still be useful later, and very high values when it is more certain. I am sure Postgres must sometimes know when some pages are more important to held onto than others, hence my feeling that S should be more than one bit. The kernel might simply flush pages starting at ones with low values of S working upwards until it has freed enough memory to resolve its memory pressure. So an explicit numerical value of 'need' (as implied above) is not required. Also any practical implementation would not use 'S' as a float/double, but use integer values for 'S' 'need' - assuming that 'need' did have to be an actual value, which I suspect would not be reequired. This way the kernel is free to flush all such pages, when sufficient need arises - yet usually, when there is sufficient memory, the pages will be held unflushed. Well, this just begs the question of what value PG ought to pass as the parameter. I think the alternate don't-need semantics (we don't think we need this but please don't throw it away arbitrarily if there's no memory pressure) would be a big win. I don't think we know enough in user space to be more precise than that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 08:03:28AM +1300, Gavin Flower wrote: On 14/01/14 14:09, Dave Chinner wrote: On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: [...] The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees [...] What if Postgres could tell the kernel how strongly that it wanted to hold on to the pages? That doesn't get rid of the problems, it just makes it harder to diagnose them when they occur. :/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, 15 Jan 2014 09:23:52 +1100 Dave Chinner da...@fromorbit.com wrote: It appears to me that we are seeing large memory machines much more commonly in data centers - a couple of years ago 256GB RAM was only seen in supercomputers. Hence machines of this size are moving from tweaking settings for supercomputers is OK class to tweaking settings for enterprise servers is not OK Perhaps what we need to do is deprecate dirty_ratio and dirty_background_ratio as the default values as move to the byte based values as the defaults and cap them appropriately. e.g. 10/20% of RAM for small machines down to a couple of GB for large machines I had thought that was already in the works...it hits people on far smaller systems than those described here. http://lwn.net/Articles/572911/ I wonder if anybody ever finished this work out for 3.14? jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 05:38:10PM -0700, Jonathan Corbet wrote: On Wed, 15 Jan 2014 09:23:52 +1100 Dave Chinner da...@fromorbit.com wrote: It appears to me that we are seeing large memory machines much more commonly in data centers - a couple of years ago 256GB RAM was only seen in supercomputers. Hence machines of this size are moving from tweaking settings for supercomputers is OK class to tweaking settings for enterprise servers is not OK Perhaps what we need to do is deprecate dirty_ratio and dirty_background_ratio as the default values as move to the byte based values as the defaults and cap them appropriately. e.g. 10/20% of RAM for small machines down to a couple of GB for large machines I had thought that was already in the works...it hits people on far smaller systems than those described here. http://lwn.net/Articles/572911/ I wonder if anybody ever finished this work out for 3.14? Not that I know of. This patch was suggested as the solution to the slow/fast drive issue that started the whole thread: http://thread.gmane.org/gmane.linux.kernel/1584789/focus=1587059 but I don't see it in a current kernel. It might be in Andrew's tree for 3.14, but I haven't checked. However, most of the discussion in that thread about dirty limits was a side show that rehashed old territory. Rate limiting and throttling in a generic, scalable manner is a complex problem. We've got some of the infrastructure we need to solve the problem, but there was no conclusion as to the correct way to connect all the dots. Perhaps it's another topic for the LSFMM conf? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: For one, postgres doesn't use mmap for files (and can't without major new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has horrible consequences for performance/scalability - very quickly you contend on locks in the kernel. I may as well dump this in this thread. We've discussed this in person a few times, including at least once with Ted T'so when he visited Dublin last year. The fundamental conflict is that the kernel understands better the hardware and other software using the same resources, Postgres understands better its own access patterns. We need to either add interfaces so Postgres can teach the kernel what it needs about its access patterns or add interfaces so Postgres can find out what it needs to know about the hardware context. The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. The less exciting, more conservative option would be to add kernel interfaces to teach Postgres about things like raid geometries. Then Postgres could use directio and decide to do prefetching based on the raid geometry, how much available i/o bandwidth and iops is available, etc. Reimplementing i/o schedulers and all the rest of the work that the kernel provides inside Postgres just seems like something outside our competency and that none of us is really excited about doing. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
Everyone, I am looking for one or more hackers to go to Collab with me to discuss this. If you think that might be you, please let me know and I'll look for funding for your travel. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 21:29 +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: For one, postgres doesn't use mmap for files (and can't without major new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has horrible consequences for performance/scalability - very quickly you contend on locks in the kernel. I may as well dump this in this thread. We've discussed this in person a few times, including at least once with Ted T'so when he visited Dublin last year. The fundamental conflict is that the kernel understands better the hardware and other software using the same resources, Postgres understands better its own access patterns. We need to either add interfaces so Postgres can teach the kernel what it needs about its access patterns or add interfaces so Postgres can find out what it needs to know about the hardware context. The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. So in this case, the question would be what additional information do we need to exchange that's not covered by the existing interfaces. Between madvise and splice, we seem to have most of what you want; what's missing? The less exciting, more conservative option would be to add kernel interfaces to teach Postgres about things like raid geometries. Then Postgres could use directio and decide to do prefetching based on the raid geometry, how much available i/o bandwidth and iops is available, etc. Reimplementing i/o schedulers and all the rest of the work that the kernel provides inside Postgres just seems like something outside our competency and that none of us is really excited about doing. This would also be a well trodden path ... I believe that some large database company introduced Direct IO for roughly this purpose. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers