Re: [HACKERS] Could we replace SysV semaphores with latches?
On 07.06.2012 07:09, Tom Lane wrote: There has been regular griping in this list about our dependence on SysV shared memory, but not so much about SysV semaphores, even though the latter cause their fair share of issues; as seen for example in buildfarm member spoonbill's recent string of failures: creating template1 database in /home/pgbuild/pgbuildfarm/HEAD/pgsql.25563/src/test/regress/./tmp_check/data/base/1 ... FATAL: could not create semaphores: No space left on device DETAIL: Failed system call was semget(1, 17, 03600). HINT: This error does *not* mean that you have run out of disk space. It occurs when either the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded. You need to raise the respective kernel parameter. Alternatively, reduce PostgreSQL's consumption of semaphores by reducing its max_connections parameter. The PostgreSQL documentation contains more information about configuring your system for PostgreSQL. child process exited with exit code 1 It strikes me that we have recently put together an independent but just about equivalent waiting mechanism in the form of latches. And not only that, but there's already a latch for each process. Could we replace our usages of SysV semaphores with WaitLatch on the procLatch? Unlike the situation with shared memory where we need some secondary features (mumble shm_nattch mumble), I think we aren't really using anything interesting about SysV semaphores except for the raw ability to wait for somebody to signal us. Would have to performance test that carefully. We use semaphores in lwlocks, so it's performance critical. A latch might well be slower, especially on platforms where a signal does not interrupt sleep, and we rely on the signal handler and self-pipe to wake up. Although perhaps we could improve the latch implementation. pselect() might perform better than the self-pipe trick, on platforms where it works. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interrupting long external library calls
On Sat, May 26, 2012 at 01:24:26PM +0200, Florian Pflug wrote: On May26, 2012, at 12:40 , Simon Riggs wrote: On 25 May 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote: I assume that the geos::util::Interrupt::request() call sets a flag somewhere that's going to be periodically checked in long-running loops. Would it be possible for the periodic checks to include a provision for a callback into Postgres-specific glue code, wherein you could test the same flags CHECK_FOR_INTERRUPTS does? A similar approach might then be usable in other contexts, and it seems safer to me than messing with a host environment's signal handling. Can we do that as a macro, e.g. POSTGRES_LIBRARY_INTERRUPT_CHECK() Otherwise the fix to this problem may be worse - faulty interrupt handlers are worse than none at all. As it stands, ProcessInterrupts() sometimes returns instead of ereport()ing even if InterruptPending is set, interrupts aren't held off, and the code isn't in a critical section. That happens if QueryCancelPending (or however that's called) gets reset after a SIGINT arrived but before CHECK_FOR_INTERRUPTS() is called. Or at least that is how I interpret the comment at the bottom of that function... We could thus easily provide POSTGRES_LIBRARY_INTERRUPT_CHECK() if we're content with the (slim, probably) chance of false positives. Or we'd need to refactor things in a way that allows the hypothetical POSTGRES_LIBRARY_INTERRUPT_CHECK() to re-use the tests in ProcessInterrupts(), but without modifying any of the flags. So back to this, shortly after discovering (sorry for ignorance, but I just don't care about programming in a black box environment) that windows doesn't support posix signals. If I understood correctly the CHECK_FOR_INTERRUPTS postgresql function does also take care of events dispatching within windows, so that if nothing calls that macro there's no way that a pqsignal handler would be invoked. Is that right ? In that case I can understand Tom's advice about providing a callback, and then I would only need to perform the events flushing part of from within the callback, and only for windows. Does it sound right ? --strk; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
On Wed, Jun 6, 2012 at 1:55 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 6, 2012 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote: Well, that's what I wanted to discuss before Honza starts coding. It's not obvious that there are any use-cases for more than two. It's also not clear whether there is any value in supporting run-time rather than build-time configuration of the socket locations. The Fedora use-case has no need of that, but if people can point to other cases where it would be sensible, we can write the patch that way. You might think we should design this exactly like the TCP-socket multiple-listen-addresses case, ie just have a config variable containing a list of directory names. The sticking point there is that the directories aren't really interchangeable. In particular, there is still going to be one directory that is the one hard-wired into libpq. So whereas multiple TCP sockets really are pretty much interchangeable, I think in the Unix-socket case we are going to have to think of it as being a primary socket and one or more alternate sockets. Is there a reason to have more than one alternate, and if so what is the use-case? (BTW, we would probably just adopt the Debian solution if we were sure there were no non-libpq clients out there; but we aren't.) I recently had an urge to make it possible for the postmaster to listen on multiple ports and even went so far as to code up a patch to allow that. It still applies, with offsets, so I'll attach it here. So I guess I'm +1 on the idea of allowing N UNIX sockets rather than limiting it to N=2, and really I'd like to do one better and allow listening on multiple TCP ports as well. Since the PID file contains the port number, multiple TCP sockets stop being interchangeable as soon as you allow multiple ports, but that's not very difficult to handle. Now, you might ask whether this has any real-world value, and obviously I'm going to say yes or I wouldn't be proposing it. The reason for wanting multiple UNIX sockets is because those sockets might be in different places that are not all equally accessible to everyone, because of things like chroot. But of course the same thing is possible in the network space using iptables and similar tools. For example, you might want to have users connect to application A using port 5432, and to application B using port 15432. Now you can use network monitoring tools to see how much data each application is sending and receiving, without needing deep packet inspection. You can firewall those ports differently to provide access to different groups of users. And you can even decide, if the database gets overloaded, to cut off access to one of those ports, so that the application causing the problem becomes inaccessible but the rest of the database ceases being overloaded and you can still operate. Of course, you could also do that by changing pg_hba.conf, but for some people it might be more convenient (or feel more bullet-proof) to do it using network management tools. There are probably other use cases, as well. +1 for multiple TCP port numbers. A few days ago I started working on enabling Postgres to communicate using WebSockets protocol (acting as a wrapper around FE/BE), and I found it difficult (not impossible) to use the same port for communicating FE/BE protocol and for https+WebSockets too. It would have been a lot simpler if I could say that WebSockets is enabled on 5431 and FE/BE on 5432. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Avoiding adjacent checkpoint records
On 6 June 2012 20:08, Tom Lane t...@sss.pgh.pa.us wrote: In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the rule for when to skip checkpoints on the grounds that not enough activity has happened since the last one. However, that commit left the comment block about it in a nonsensical state: * If this isn't a shutdown or forced checkpoint, and we have not switched * to the next WAL file since the start of the last checkpoint, skip the * checkpoint. The idea here is to avoid inserting duplicate checkpoints * when the system is idle. That wastes log space, and more importantly it * exposes us to possible loss of both current and previous checkpoint * records if the machine crashes just as we're writing the update. * (Perhaps it'd make even more sense to checkpoint only when the previous * checkpoint record is in a different xlog page?) The new code entirely fails to prevent writing adjacent checkpoint records, because what it checks is the distance from the previous checkpoint's REDO pointer, not the previous checkpoint record itself. You were completely involved in the original discussion of this, as were others. I made the change to use the redo pointer at your request. So when you say Simon did this, what you mean is Simon acted according to group consensus on an agreed feature. How come this is a valid discussion? Why does making changes here make sense when other changes are said to destabilise the code and delay release? Should I revisit all the things others have done that I disagree with as well? No, I should not. Nor should we be trawling through changes made by me either. I'll look some more at this, but you should consider why this thread even exists. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time for pgindent run?
On Thursday, June 7, 2012, Robert Haas wrote: On Tue, Jun 5, 2012 at 10:25 AM, Bruce Momjian br...@momjian.usjavascript:; wrote: On Tue, Jun 05, 2012 at 10:21:14AM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us javascript:; writes: Is everyone ready for me to run pgindent? We are nearing the first commit-fest (June 15) and will have to branch the git tree soon. Also, we should do the pgindent run well before the commitfest, so that authors of pending patches have time to rebase their patches in case pgindent changes the code they are patching ... Ah, good point. That will affect commit-fest patches. We could run it only on the 9.3 branch, but that makes double-patching very hard. Is everyone good for a pgindent run this week? The sooner the better. +1. Anything people have pending for 9.2 has to be pretty small by now (I certainly hope), and it would give contributors a good chance to rebase before the first CF for 9.3. //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
[HACKERS] SP-GIST docs and examples
I know Oleg has been discussing quad trees for years now, so SP-GIST sounds like a great feature. The docs on SP-GIST simply suggest people read the code, which is way below our normal standard of documentation. CREATE INDEX contains no examples involving SP-GIST indexes. It seems likely that the first thing a user will ask is When can I use it? How do I create one? When should I use it in comparison with existing methods? ISTM that if we mention this as a major feature in 9.2, and I think we should, that there should be something more about these in the shared docs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] \conninfo and SSL
On Wednesday, June 6, 2012, Alvaro Herrera wrote: Excerpts from Robert Haas's message of mié jun 06 14:45:46 -0400 2012: On Sun, Jun 3, 2012 at 5:30 AM, Alastair Turner b...@ctrlf5.co.zajavascript:; wrote: A one-line change adds the SSL info on its own line like -- You are connected to database scratch as user scratch on host 127.0.0.1 at port 5432. SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256) -- Does this need a more integrated presentation, and therefore a broader change to make it translatable? +1 for doing it that way. Yeah, printSSLInfo already outputs translated stuff so this should be OK. Merging both messages into a single translatable unit would be pretty cumbersome, for no practical gain. Seems like a very low-impact change. Are people Ok with sneaking this into 9.2? //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] pg_receivexlog and feedback message
On Thursday, June 7, 2012, Fujii Masao wrote: On Thu, Jun 7, 2012 at 5:05 AM, Magnus Hagander mag...@hagander.netjavascript:; wrote: On Wed, Jun 6, 2012 at 8:26 PM, Fujii Masao masao.fu...@gmail.comjavascript:; wrote: On Tue, Jun 5, 2012 at 11:44 PM, Magnus Hagander mag...@hagander.netjavascript:; wrote: On Tue, Jun 5, 2012 at 4:42 PM, Fujii Masao masao.fu...@gmail.comjavascript:; wrote: On Tue, Jun 5, 2012 at 9:53 PM, Magnus Hagander mag...@hagander.netjavascript:; wrote: Right now, pg_receivexlog sets: replymsg-write = InvalidXLogRecPtr; replymsg-flush = InvalidXLogRecPtr; replymsg-apply = InvalidXLogRecPtr; when it sends it's status updates. I'm thinking it sohuld set replymsg-write = blockpos instad. Why? That way you can see in pg_stat_replication what has actually been received by pg_receivexlog - not just what we last sent. This can be useful in combination with an archive_command that can block WAL recycling until it has been saved to the standby. And it would be useful as a general monitoring thing as well. I think the original reason was that it shouldn't interefer with synchronous replication - but it does take away a fairly useful usecase... I think that not only replaymsg-write but also -flush should be set to blockpos in pg_receivexlog. Which allows pg_receivexlog to behave as synchronous standby, so we can write WAL to both local and remote synchronously. I believe there are some use cases for synchronous pg_receivexlog. pg_receivexlog doesn't currently fsync() after every write. It only fsync():s complete files. So we'd need to set -flush only at the end of a segment, right? Yes. Currently the status update is sent for each status interval. In sync replication, transaction has to wait for a while even after pg_receivexlog has written or flushed the WAL data. So we should add new option which specifies whether pg_receivexlog sends the status packet back as soon as it writes or flushes the WAL data, like the walreceiver does? That might be useful, but I think that's 9.3 material at this point. Fair enough. That's new feature rather than a bugfix. But I think we can get the set the write location in as a bugfix. Also set the flush location? Sending the flush location back seems helpful when using pg_receivexlog for WAL archiving purpose. By seeing the flush location we can ensure that WAL file has been archived durably (IOW, WAL file has been flushed in remote archive area). You can do that with the write location as well, as long as you round it off to complete segments, can't you? In fact that's exactly the usecase that got me to realize we were missing this :-) //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] SP-GIST docs and examples
On Thursday, June 7, 2012, Simon Riggs wrote: I know Oleg has been discussing quad trees for years now, so SP-GIST sounds like a great feature. The docs on SP-GIST simply suggest people read the code, which is way below our normal standard of documentation. It's the same we have for normal GiST indexes, no? It's under internals, I think it's ok in there. CREATE INDEX contains no examples involving SP-GIST indexes. It seems likely that the first thing a user will ask is When can I use it? How do I create one? When should I use it in comparison with existing methods? This is the problem though - it only has internals documentation, and no end-user documentation. Not sure GiST has that much end-user documentation either. And it is included in things like the FTI docs. ISTM that if we mention this as a major feature in 9.2, and I think we should, that there should be something more about these in the shared docs. +1, it should definitely get some more docs if possible. //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Interrupting long external library calls
On Jun7, 2012, at 10:20 , Sandro Santilli wrote: On Sat, May 26, 2012 at 01:24:26PM +0200, Florian Pflug wrote: On May26, 2012, at 12:40 , Simon Riggs wrote: On 25 May 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote: I assume that the geos::util::Interrupt::request() call sets a flag somewhere that's going to be periodically checked in long-running loops. Would it be possible for the periodic checks to include a provision for a callback into Postgres-specific glue code, wherein you could test the same flags CHECK_FOR_INTERRUPTS does? A similar approach might then be usable in other contexts, and it seems safer to me than messing with a host environment's signal handling. Can we do that as a macro, e.g. POSTGRES_LIBRARY_INTERRUPT_CHECK() Otherwise the fix to this problem may be worse - faulty interrupt handlers are worse than none at all. As it stands, ProcessInterrupts() sometimes returns instead of ereport()ing even if InterruptPending is set, interrupts aren't held off, and the code isn't in a critical section. That happens if QueryCancelPending (or however that's called) gets reset after a SIGINT arrived but before CHECK_FOR_INTERRUPTS() is called. Or at least that is how I interpret the comment at the bottom of that function... We could thus easily provide POSTGRES_LIBRARY_INTERRUPT_CHECK() if we're content with the (slim, probably) chance of false positives. Or we'd need to refactor things in a way that allows the hypothetical POSTGRES_LIBRARY_INTERRUPT_CHECK() to re-use the tests in ProcessInterrupts(), but without modifying any of the flags. So back to this, shortly after discovering (sorry for ignorance, but I just don't care about programming in a black box environment) that windows doesn't support posix signals. If I understood correctly the CHECK_FOR_INTERRUPTS postgresql function does also take care of events dispatching within windows, so that if nothing calls that macro there's no way that a pqsignal handler would be invoked. Is that right ? Yes. In that case I can understand Tom's advice about providing a callback, and then I would only need to perform the events flushing part of from within the callback, and only for windows. Why would you need a signal handler in the library at all, then? Just test the same flags that CHECK_FOR_INTERRUPTS does in the callback, and call your interrupt request method if they indicate abort. (Or, slightly cleaner maybe, allow the callback to abort processing by returning false) best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interrupting long external library calls
On Thu, Jun 07, 2012 at 12:00:09PM +0200, Florian Pflug wrote: On Jun7, 2012, at 10:20 , Sandro Santilli wrote: In that case I can understand Tom's advice about providing a callback, and then I would only need to perform the events flushing part of from within the callback, and only for windows. Why would you need a signal handler in the library at all, then? Just test the same flags that CHECK_FOR_INTERRUPTS does in the callback, and call your interrupt request method if they indicate abort. (Or, slightly cleaner maybe, allow the callback to abort processing by returning false) I'm just afraid that invoking a callback (from a library to user code) could be significantly slower than simply lookup a variable, especially if the interruption checking is performed very frequently. But maybe I'm being overparanoid. --strk; ,--o-. | __/ |Delivering high quality PostGIS 2.0 ! | / 2.0 |http://strk.keybit.net - http://vizzuality.com `-o--' -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interrupting long external library calls
On Jun7, 2012, at 12:08 , Sandro Santilli wrote: On Thu, Jun 07, 2012 at 12:00:09PM +0200, Florian Pflug wrote: On Jun7, 2012, at 10:20 , Sandro Santilli wrote: In that case I can understand Tom's advice about providing a callback, and then I would only need to perform the events flushing part of from within the callback, and only for windows. Why would you need a signal handler in the library at all, then? Just test the same flags that CHECK_FOR_INTERRUPTS does in the callback, and call your interrupt request method if they indicate abort. (Or, slightly cleaner maybe, allow the callback to abort processing by returning false) I'm just afraid that invoking a callback (from a library to user code) could be significantly slower than simply lookup a variable, especially if the interruption checking is performed very frequently. But maybe I'm being overparanoid. It's definitely slower, probably at least an order of magnitude. But you don't have to call that callback very often - once every few hundred milliseconds is already more then plenty. Isn't there a place in the library where you could put it where it'd be called with roughly that frequency? IMHO the advantage of not messing with signal handling in the library is more than worth the slight overhead of a callback, but YMMV. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog and feedback message
On Thu, Jun 7, 2012 at 6:25 PM, Magnus Hagander mag...@hagander.net wrote: On Thursday, June 7, 2012, Fujii Masao wrote: On Thu, Jun 7, 2012 at 5:05 AM, Magnus Hagander mag...@hagander.net wrote: On Wed, Jun 6, 2012 at 8:26 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 5, 2012 at 11:44 PM, Magnus Hagander mag...@hagander.net wrote: On Tue, Jun 5, 2012 at 4:42 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 5, 2012 at 9:53 PM, Magnus Hagander mag...@hagander.net wrote: Right now, pg_receivexlog sets: replymsg-write = InvalidXLogRecPtr; replymsg-flush = InvalidXLogRecPtr; replymsg-apply = InvalidXLogRecPtr; when it sends it's status updates. I'm thinking it sohuld set replymsg-write = blockpos instad. Why? That way you can see in pg_stat_replication what has actually been received by pg_receivexlog - not just what we last sent. This can be useful in combination with an archive_command that can block WAL recycling until it has been saved to the standby. And it would be useful as a general monitoring thing as well. I think the original reason was that it shouldn't interefer with synchronous replication - but it does take away a fairly useful usecase... I think that not only replaymsg-write but also -flush should be set to blockpos in pg_receivexlog. Which allows pg_receivexlog to behave as synchronous standby, so we can write WAL to both local and remote synchronously. I believe there are some use cases for synchronous pg_receivexlog. pg_receivexlog doesn't currently fsync() after every write. It only fsync():s complete files. So we'd need to set -flush only at the end of a segment, right? Yes. Currently the status update is sent for each status interval. In sync replication, transaction has to wait for a while even after pg_receivexlog has written or flushed the WAL data. So we should add new option which specifies whether pg_receivexlog sends the status packet back as soon as it writes or flushes the WAL data, like the walreceiver does? That might be useful, but I think that's 9.3 material at this point. Fair enough. That's new feature rather than a bugfix. But I think we can get the set the write location in as a bugfix. Also set the flush location? Sending the flush location back seems helpful when using pg_receivexlog for WAL archiving purpose. By seeing the flush location we can ensure that WAL file has been archived durably (IOW, WAL file has been flushed in remote archive area). You can do that with the write location as well, as long as you round it off to complete segments, can't you? You mean to prevent pg_receivexlog from sending back the end of WAL file as the write location *before* it completes the WAL file? If so, yes. But why do you want to keep the flush location invalid? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog and feedback message
On Thursday, June 7, 2012, Fujii Masao wrote: On Thu, Jun 7, 2012 at 6:25 PM, Magnus Hagander mag...@hagander.net wrote: On Thursday, June 7, 2012, Fujii Masao wrote: On Thu, Jun 7, 2012 at 5:05 AM, Magnus Hagander mag...@hagander.net wrote: On Wed, Jun 6, 2012 at 8:26 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 5, 2012 at 11:44 PM, Magnus Hagander mag...@hagander.net wrote: On Tue, Jun 5, 2012 at 4:42 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 5, 2012 at 9:53 PM, Magnus Hagander mag...@hagander.net wrote: Right now, pg_receivexlog sets: replymsg-write = InvalidXLogRecPtr; replymsg-flush = InvalidXLogRecPtr; replymsg-apply = InvalidXLogRecPtr; when it sends it's status updates. I'm thinking it sohuld set replymsg-write = blockpos instad. Why? That way you can see in pg_stat_replication what has actually been received by pg_receivexlog - not just what we last sent. This can be useful in combination with an archive_command that can block WAL recycling until it has been saved to the standby. And it would be useful as a general monitoring thing as well. I think the original reason was that it shouldn't interefer with synchronous replication - but it does take away a fairly useful usecase... I think that not only replaymsg-write but also -flush should be set to blockpos in pg_receivexlog. Which allows pg_receivexlog to behave as synchronous standby, so we can write WAL to both local and remote synchronously. I believe there are some use cases for synchronous pg_receivexlog. pg_receivexlog doesn't currently fsync() after every write. It only fsync():s complete files. So we'd need to set -flush only at the end of a segment, right? Yes. Currently the status update is sent for each status interval. In sync replication, transaction has to wait for a while even after pg_receivexlog has written or flushed the WAL data. So we should add new option which specifies whether pg_receivexlog sends the status packet back as soon as it writes or flushes the WAL data, like the walreceiver does? That might be useful, but I think that's 9.3 material at this point. Fair enough. That's new feature rather than a bugfix. But I think we can get the set the write location in as a bugfix. Also set the flush location? Sending the flush location back seems helpful when using pg_receivexlog for WAL archiving purpose. By seeing the flush location we can ensure that WAL file has been archived durably (IOW, WAL file has been flushed in remote archive area). You can do that with the write location as well, as long as you round it You mean to prevent pg_receivexlog from sending back the end of WAL file as the write location *before* it completes the WAL file? If so, yes. But why do you want to keep the flush location invalid? No. pg_receivexlog sends back the correct write location. Whoever does the check (through pg_stat_replication) rounds down, so it only counts it once pg_receivexlog has acknowledged receiving the whole mail. I'm not against doing the flush location as well, I'm just worried about feature-creep :-) But let's see how big a change that would turn out to be... //Magnus -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] SP-GIST docs and examples
On Thu, Jun 7, 2012 at 5:19 AM, Simon Riggs si...@2ndquadrant.com wrote: I know Oleg has been discussing quad trees for years now, so SP-GIST sounds like a great feature. The docs on SP-GIST simply suggest people read the code, which is way below our normal standard of documentation. I completely agree. The GIN and GIST documentation are pretty deficient as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GIST docs and examples
On 7 June 2012 12:52, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 7, 2012 at 5:19 AM, Simon Riggs si...@2ndquadrant.com wrote: I know Oleg has been discussing quad trees for years now, so SP-GIST sounds like a great feature. The docs on SP-GIST simply suggest people read the code, which is way below our normal standard of documentation. I completely agree. The GIN and GIST documentation are pretty deficient as well. Didn't Andrew Gierth volunteer to help rewrite those some time back? Maybe nudge him about it again. -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] creating objects in pg_catalog
On Wed, Jun 6, 2012 at 5:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: rhaas=# create table pg_catalog.tom (a int); ERROR: permission denied to create pg_catalog.tom The offending error check is in heap_create(), and based on what you're saying here it seems like we should just rip it out. Hmm. Yeah, it seems like the regular permissions tests on the schemas in question should be enough to keep Joe User from making tables there, and I do not see a reason why the backend would care if there are non-catalog tables laying about in pg_catalog. Checking the commit history, it seems this was originally a test to prevent people from creating tables named pg_xxx: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f5a10e722c052006886b678995695001958a#patch3 which may or may not have been critical once upon a time, but surely is not any more. So no objection to removing that particular test. Patch attached. Barring objections, I'll apply this to 9.3 once we branch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company remove-heap-create-check.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GIST docs and examples
On Thu, 7 Jun 2012, Thom Brown wrote: On 7 June 2012 12:52, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 7, 2012 at 5:19 AM, Simon Riggs si...@2ndquadrant.com wrote: I know Oleg has been discussing quad trees for years now, so SP-GIST sounds like a great feature. The docs on SP-GIST simply suggest people read the code, which is way below our normal standard of documentation. I completely agree. The GIN and GIST documentation are pretty deficient as well. Didn't Andrew Gierth volunteer to help rewrite those some time back? Maybe nudge him about it again. Oh, please ! Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
On Tue, 5 Jun 2012, Simon Riggs wrote: Sounds less good and we'd need reasonable proof it actually did anything useful without being dangerous. Doing an initial unlocked test speeds things up another 2.69 fold (on top of 3.55 for your patch) for me, with 1GB of shared buffers. That seems like it should be worthwhile. How do we go about getting reasonable proof that it is safe? That's enough for me. So is it planned to apply that patch for 9.2 ? Thanks, S * Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Early hint bit setting
On Wed, Jun 6, 2012 at 5:41 PM, Jim Nasby j...@nasby.net wrote: On 5/30/12 4:42 PM, Ants Aasma wrote: I was thinking about what is the earliest time where we could set hint bits. This would be just after the commit has been made visible. Except that's only true when there are no other transactions running. That's been one of the big sticking points about trying to proactively set hint bits; in a real system you're not going to gain very much unless you wait a while before setting them. are you sure? the relevant code to set hint bit during tuple scan looks like this: else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple))) { if (HeapTupleHeaderGetCmin(tuple) = snapshot-curcid) return false; /* inserted after scan started */ if (tuple-t_infomask HEAP_XMAX_INVALID) /* xid invalid */ return true; if (tuple-t_infomask HEAP_IS_LOCKED) /* not deleter */ return true; Assert(!(tuple-t_infomask HEAP_XMAX_IS_MULTI)); if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tuple))) { /* deleting subtransaction must have aborted */ SetHintBits(tuple, buffer, HEAP_XMAX_INVALID, InvalidTransactionId); return true; } if (HeapTupleHeaderGetCmax(tuple) = snapshot-curcid) return true;/* deleted after scan started */ else return false; /* deleted before scan started */ } else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetXmin(tuple)); else { /* it must have aborted or crashed */ SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; } The backend that commits the transaction knows that the transaction is committed and that it's not in progress (at least from itself). Why do you have to wait for other transactions in progress to finish? Setting the xmin committed bit doesn't keep you from checking the xmax based rules. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Wed, Jun 6, 2012 at 3:07 PM, Andres Freund and...@2ndquadrant.com wrote: Hm. For a short while I thought there would be an issue with heap_delete and IOS because the deleting transaction can commit without any barriers happening on the IOS side. But that only seems to be possible with non MVCC snapshots which are currently not allowed with index only scans. Well, one, commits are irrelevant; the page ceases to be all-visible as soon as the delete happens. And two, you can't do a delete or a commit without a memory barrier - every LWLockAcquire() or LWLockRelease() is a full barrier, so any operation that requires a buffer content lock is both preceded and followed by a full barrier. Proposed patch attached. This adds some more comments in various places, and implements your suggestion of retesting the visibility-map bit when we detect a possible mismatch with the page-level bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company vm-test-cleanup.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thursday, June 07, 2012 03:20:50 PM Robert Haas wrote: On Wed, Jun 6, 2012 at 3:07 PM, Andres Freund and...@2ndquadrant.com wrote: Hm. For a short while I thought there would be an issue with heap_delete and IOS because the deleting transaction can commit without any barriers happening on the IOS side. But that only seems to be possible with non MVCC snapshots which are currently not allowed with index only scans. Well, one, commits are irrelevant; the page ceases to be all-visible as soon as the delete happens. Its not irrelevant for the code as it stands if non-mvcc snapshots were allowed. Unless I miss something, even disregarding memory ordering issues, there is no interlocking providing protection against the index doing the visibilitymap_test() and some concurrent backend doing a heap_delete + commit directly after that. If a SnapshotNow were used this could result in different results between a IOS and a normal indexscan because the normal indexscan would lock the heap page before doing the visibility testing and thus would see the new visibility information. But then a SnapshotNow wouldn't be used if that were a problem... For normal snapshots its not a problem because with regards to the snapshot everything on the page is still visible as I think that can only happen for deletions and HOT updates because the index page would need to get locked otherwise. Deletions aren't visible yet and HOT is irrelevant for the IOS by definition. And two, you can't do a delete or a commit without a memory barrier - every LWLockAcquire() or LWLockRelease() is a full barrier, so any operation that requires a buffer content lock is both preceded and followed by a full barrier. The memory barrier on the deleting side is irrelevant if there is no memory barrier on the reading side. Proposed patch attached. This adds some more comments in various places, and implements your suggestion of retesting the visibility-map bit when we detect a possible mismatch with the page-level bit. Thanks, will look at it in a bit. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] XLog changes for 9.3
When I worked on the XLogInsert scaling patch, it became apparent that some changes to the WAL format would make it a lot easier. So for 9.3, I'd like to do some refactoring: 1. Use a 64-bit integer instead of the two-variable log/seg representation, for identifying a WAL segment. This has no user-visible effect, but makes the code a bit simpler. 2. Don't waste the last WAL segment in each logical 4GB file. Currently, we skip the WAL segment ending with FF. The comments claim that wasting the last segment ensures that we don't have problems representing last-byte-position-plus-1, but in my experience, it just makes things more complicated. You have two ways to represent the segment boundary, and some functions are picky on which one is used. For example, XLogWrite() assumes that when you want to flush to the end of a logical log file, you use the 5/FF00 representation, not 6/. Other functions, like XLogPageRead(), expect the latter. This is a backwards-incompatible change for external utilities that know how the WAL segment numbering works. Hopefully there aren't too many of those around. 3. Move the only field, xl_rem_len, from the continuation record header straight to the xlog page header, eliminating XLogContRecord altogether. This makes it easier to calculate in advance how much space a WAL record requires, as it no longer depends on how many pages it has to be split across. This wastes 4-8 bytes on every xlog page, but that's not much. 4. Allow WAL record header to be split across page boundaries. Currently, if there are less than SizeOfXLogRecord bytes left on the current WAL page, it is wasted, and the next record is inserted at the beginning of the next page. The problem with that is again that it makes it impossible to know in advance exactly how much space a WAL record requires, because it depends on how many bytes need to be wasted at the end of current page. These changes will help the XLogInsert scaling patch, by making the space calculations simpler. In essence, to reserve space for a WAL record of size X, you just need to do bytepos += X. There's a lot more details with that, like mapping from the contiguous byte position to an XLogRecPtr that takes page headers into account, and noticing RedoRecPtr changes safely, but it's a start. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
Sergey Koposov kopo...@ast.cam.ac.uk writes: On Tue, 5 Jun 2012, Simon Riggs wrote: Sounds less good and we'd need reasonable proof it actually did anything useful without being dangerous. Doing an initial unlocked test speeds things up another 2.69 fold (on top of 3.55 for your patch) for me, with 1GB of shared buffers. That seems like it should be worthwhile. How do we go about getting reasonable proof that it is safe? That's enough for me. Say what? That's a performance result and proves not a damn thing about safety. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
Tom Lane t...@sss.pgh.pa.us wrote: there is no guarantee that we'll manage to reach a database state that is consistent with data already flushed out to disk during the last checkpoint. Robert Haas robertmh...@gmail.com wrote: I know of real customers who would have suffered real data loss had this code been present in the server version they were using. Checkpoints are the *only* mechanism by which SLRU pages get flushed to disk on a mostly-idle system. That means if something happens to your pg_xlog directory, and you haven't had a checkpoint, you're screwed. Simon Riggs si...@2ndquadrant.com wrote: How come this is a valid discussion? Why does making changes here make sense when other changes are said to destabilise the code and delay release? I think the standard has been pretty clear -- at this point in a release we fix bugs and regressions from previous releases. The bar for introducing anything which doesn't qualify under either of those is very high in terms of being obviously useful and *very* low risk. Should I revisit all the things others have done that I disagree with as well? If you can spot any serious bugs, I would sure appreciate it if you point them out while we're still in beta. I think we all would. I'll look some more at this, but you should consider why this thread even exists. Because the beta release contains a new data loss bug which needs to be fixed. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Actually, it looks like there is an extremely simple way to handle this, which is to move the call of LogStandbySnapshot (which generates the WAL record in question) to before the checkpoint's REDO pointer is set, but after we have decided that we need a checkpoint. On further contemplation, there is a downside to that idea, which probably explains why the code was written as it was: if we place the XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather than after the checkpoint's REDO point, then a hot standby slave starting up from that checkpoint won't process the XLOG_RUNNING_XACTS record. That means its KnownAssignedXids machinery won't be fully operational until the master starts another checkpoint, which might be awhile. So this could result in undesirable delay in hot standby mode becoming active. I am not sure how significant this really is though. Comments? I suspect that's pretty significant. If we don't like that, I can think of a couple of other ways to get there, but they have their own downsides: * Instead of trying to detect after-the-fact whether any concurrent WAL activity happened during the last checkpoint, we could detect it during the checkpoint and then keep the info in a static variable in the checkpointer process until next time. However, I don't see any bulletproof way to do this without adding at least one or two lines of code within XLogInsert, which I'm sure Robert will complain about. My main concern here is to avoid doing anything that will make things harder for Heikki's WAL insert scaling patch, which I'm hoping will get done for 9.3. What do you have in mind, exactly? I feel like ProcLastRecPtr might be enough information. After logging running xacts, we can check whether ProcLastRecPtr is equal to the redo pointer. If so, then nothing got written to WAL between the time we began the checkpoint and the time we wrote that record. If, through further, similar gyrations, we can then work out whether the checkpoint record immediately follows the running-xacts record, we're there. That's pretty ugly, I guess, but it seems possible. Alternatively, we could keep a flag in XLogCtl-Insert indicating whether anything that requires a new checkpoint has happened since the last checkpoint. This could be set inside the existing block that tests whether RedoRecPtr is out of date, so any given backend would only do it once per checkpoint cycle. We'd have a hard-coded special case that would skip setting the flag for a running-xacts record. I kind of hate to shove even that much extra code in there, but as long as it doesn't mess up what Heikki has in mind maybe it'd be OK... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Could we replace SysV semaphores with latches?
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 07.06.2012 07:09, Tom Lane wrote: It strikes me that we have recently put together an independent but just about equivalent waiting mechanism in the form of latches. And not only that, but there's already a latch for each process. Could we replace our usages of SysV semaphores with WaitLatch on the procLatch? Would have to performance test that carefully. We use semaphores in lwlocks, so it's performance critical. A latch might well be slower, especially on platforms where a signal does not interrupt sleep, and we rely on the signal handler and self-pipe to wake up. By the time you've reached the point where you conclude you have to block, all hope of high performance has gone out the window anyway, so I can't get terribly excited about that. But in general, yeah, our current implementation of latches could very possibly use some more work to improve performance. I think we've latch-ified enough code to make that worth doing already. Although perhaps we could improve the latch implementation. pselect() might perform better than the self-pipe trick, on platforms where it works. AFAIK pselect does not fix the basic race condition: what if somebody else does SetLatch just before you reach the blocking kernel call? You still end up needing a self-pipe. I would be more inclined to look into OS-specific primitives such as futexes on Linux. (No idea whether those actually would be suitable, just pointing out that they exist.) Our semaphore-based API was always both overcomplicated and underspecified, but I think we have got latch semantics nailed down well enough that implementations built on OS-specific primitives could be a reasonable thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote: When I worked on the XLogInsert scaling patch, it became apparent that some changes to the WAL format would make it a lot easier. So for 9.3, I'd like to do some refactoring: 1. Use a 64-bit integer instead of the two-variable log/seg representation, for identifying a WAL segment. This has no user-visible effect, but makes the code a bit simpler. +1 We can define a sensible InvalidXLogRecPtr instead of doing that locally in loads of places! Yipee. 2. Don't waste the last WAL segment in each logical 4GB file. Currently, we skip the WAL segment ending with FF. The comments claim that wasting the last segment ensures that we don't have problems representing last-byte-position-plus-1, but in my experience, it just makes things more complicated. You have two ways to represent the segment boundary, and some functions are picky on which one is used. For example, XLogWrite() assumes that when you want to flush to the end of a logical log file, you use the 5/FF00 representation, not 6/. Other functions, like XLogPageRead(), expect the latter. This is a backwards-incompatible change for external utilities that know how the WAL segment numbering works. Hopefully there aren't too many of those around. +1 3. Move the only field, xl_rem_len, from the continuation record header straight to the xlog page header, eliminating XLogContRecord altogether. This makes it easier to calculate in advance how much space a WAL record requires, as it no longer depends on how many pages it has to be split across. This wastes 4-8 bytes on every xlog page, but that's not much. +1. I don't think this will waste a measureable amount in real-world scenarios. A very big percentag of pages have continuation records. 4. Allow WAL record header to be split across page boundaries. Currently, if there are less than SizeOfXLogRecord bytes left on the current WAL page, it is wasted, and the next record is inserted at the beginning of the next page. The problem with that is again that it makes it impossible to know in advance exactly how much space a WAL record requires, because it depends on how many bytes need to be wasted at the end of current page. +0.5. Its somewhat convenient to be able to look at a record before you have reassembled it over multiple pages. But its probably not worth the implementation complexity. If we do that we can remove all the aligment padding as well. Which would be a problem for you anyway, wouldn't it? These changes will help the XLogInsert scaling patch, by making the space calculations simpler. In essence, to reserve space for a WAL record of size X, you just need to do bytepos += X. There's a lot more details with that, like mapping from the contiguous byte position to an XLogRecPtr that takes page headers into account, and noticing RedoRecPtr changes safely, but it's a start. Hm. Wouldn't you need to remove short/long page headers for that as well? Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com wrote: Well, one, commits are irrelevant; the page ceases to be all-visible as soon as the delete happens. Its not irrelevant for the code as it stands if non-mvcc snapshots were allowed. Unless I miss something, even disregarding memory ordering issues, there is no interlocking providing protection against the index doing the visibilitymap_test() and some concurrent backend doing a heap_delete + commit directly after that. Hmm. Well, the backend that did the heap_delete would clear the visibility map bit when it did the delete, before releasing the lock on the heap buffer. So normally we'll see that buffer as not-all-visible as soon as the delete is performed, even if no commit has happened yet. But I guess there could be a memory-ordering issue there for a SnapshotNow scan, because as you say there's no barrier on the read side in that case. The delete + commit would have to happen after we finish fetching the index TID but before we test the visibility map bit. So I suppose that if we wanted to support index-only scans under SnapshotNow maybe we'd want to lock and unlock the visibility map page when testing each bit. But that almost seems like overkill anyway: for it to matter, someone would have to be relying on starting a scan, deleting a tuple in another transaction while the scan was in progress, and having the scan see the delete when it reached the deleted tuple. But of course if the scan had take a microsecond longer to reach the deleted tuple it wouldn't have seen it as deleted anyway, so it's a bit hard to see how any such code could be race-free anyhow. Proposed patch attached. This adds some more comments in various places, and implements your suggestion of retesting the visibility-map bit when we detect a possible mismatch with the page-level bit. Thanks, will look at it in a bit. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: When I worked on the XLogInsert scaling patch, it became apparent that some changes to the WAL format would make it a lot easier. So for 9.3, I'd like to do some refactoring: 1. Use a 64-bit integer instead of the two-variable log/seg representation, for identifying a WAL segment. This has no user-visible effect, but makes the code a bit simpler. 2. Don't waste the last WAL segment in each logical 4GB file. Currently, we skip the WAL segment ending with FF. The comments claim that wasting the last segment ensures that we don't have problems representing last-byte-position-plus-1, but in my experience, it just makes things more complicated. I think that's actually an indivisible part of point #1. The issue in the 32+32 representation is that you'd overflow the low-order half when trying to represent last-byte-of-file-plus-1, and have to do something with propagating that to the high half. In a 64-bit continuous addressing scheme the problem goes away, and it would just get more complicated not less to preserve the hole. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thursday, June 07, 2012 04:27:32 PM Robert Haas wrote: On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com wrote: Proposed patch attached. This adds some more comments in various places, and implements your suggestion of retesting the visibility-map bit when we detect a possible mismatch with the page-level bit. Thanks, will look at it in a bit. I wonder if /* mark page all-visible, if appropriate */ if (all_visible !PageIsAllVisible(page)) { PageSetAllVisible(page); MarkBufferDirty(buf); visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, vmbuffer, visibility_cutoff_xid); } shouldn't test if (all_visible (!PageIsAllVisible(page) || !all_visible_according_to_vm) instead. Commentwise I am not totally content with the emphasis on memory ordering because some of the stuff is more locking than memory ordering. Except that I think its a pretty clear improvement. I can reformulate the places where I find that relevant but I have the feeling that wouldn't help the legibility. Its the big comment in vacuumlazy.c, the comment in nodeIndexonly.c and the one in the header of visibilitymap_test. Should be s/memory- ordering/concurrency/ except in nodeIndexonlyscan.c The visibilitymap_clear/PageClearAllVisible in heap_multi_insert should be moved into the critical section, shouldn't it? Regards, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Could we replace SysV semaphores with latches?
On Thu, Jun 7, 2012 at 10:13 AM, Tom Lane t...@sss.pgh.pa.us wrote: I would be more inclined to look into OS-specific primitives such as futexes on Linux. (No idea whether those actually would be suitable, just pointing out that they exist.) Our semaphore-based API was always both overcomplicated and underspecified, but I think we have got latch semantics nailed down well enough that implementations built on OS-specific primitives could be a reasonable thing. I've been thinking about trying to futex-ify our spinlock implementation, so that when we detect that the spinlock is contended (or contended sufficiently badly?) we go into a long kernel sleep (e.g. 10 s) and wait to be woken up. This might perform better than our current implementation in cases where the spinlock is badly contended, since it would avoid yanking the cache line around between all the CPUs on the box. But I haven't yet, because (1) it'll only work on Linux and (2) it's better to fix the problem that is causing the contention rather than make the contention less expensive. Still, it might be worth looking into. I'm not sure whether there's a sensible way to use this for LWLocks directly. It would be nice not to be doing the lock-within-a-lock thing, but I don't know that I really want to maintain a completely separate LWLock implementation just for Linux, and it's not obvious how you're supposed to use a futex to implement a lock with more than one lock mode. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On 07.06.2012 17:18, Andres Freund wrote: On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote: 3. Move the only field, xl_rem_len, from the continuation record header straight to the xlog page header, eliminating XLogContRecord altogether. This makes it easier to calculate in advance how much space a WAL record requires, as it no longer depends on how many pages it has to be split across. This wastes 4-8 bytes on every xlog page, but that's not much. +1. I don't think this will waste a measureable amount in real-world scenarios. A very big percentag of pages have continuation records. Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on 64-bit architectures) even when there is a continuation record, because of alignment: typedef struct XLogPageHeaderData { uint16 xlp_magic; /* magic value for correctness checks */ uint16 xlp_info; /* flag bits, see below */ TimeLineID xlp_tli; /* TimeLineID of first record on XLogRecPtr xlp_pageaddr; /* XLOG address of this page */ + uint32 xlp_rem_len; /* bytes remaining of continued record */ } XLogPageHeaderData; The page header is currently 16 bytes in length, so adding a 4-byte field to it bumps the aligned size to 24 bytes. Nevertheless, I think we can well live with that. 4. Allow WAL record header to be split across page boundaries. Currently, if there are less than SizeOfXLogRecord bytes left on the current WAL page, it is wasted, and the next record is inserted at the beginning of the next page. The problem with that is again that it makes it impossible to know in advance exactly how much space a WAL record requires, because it depends on how many bytes need to be wasted at the end of current page. +0.5. Its somewhat convenient to be able to look at a record before you have reassembled it over multiple pages. But its probably not worth the implementation complexity. Looking at the code, I think it'll be about the same complexity for XLogInsert in its current form (it will help the patch I'm working on), and makes ReadRecord() a bit more complicated. But not much. If we do that we can remove all the aligment padding as well. Which would be a problem for you anyway, wouldn't it? It's not a problem. You just MAXALIGN the size of the record when you calculate how much space it needs, and then all records become naturally MAXALIGNed. We could quite easily remove the alignment on-disk if we wanted to, ReadRecord() already always copies the record to an aligned buffer, but I wasn't planning to do that. These changes will help the XLogInsert scaling patch, by making the space calculations simpler. In essence, to reserve space for a WAL record of size X, you just need to do bytepos += X. There's a lot more details with that, like mapping from the contiguous byte position to an XLogRecPtr that takes page headers into account, and noticing RedoRecPtr changes safely, but it's a start. Hm. Wouldn't you need to remove short/long page headers for that as well? No, those are ok because they're predictable. Although it would make the mapping simpler. To convert from a contiguous xlog byte position that excludes all headers, to XLogRecPtr, you need to do something like this (I just made this up, probably has bugs, but it's about this complex): #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD) #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD) uint64 xlogrecptr; uint64 full_segments = bytepos / UsableBytesInSegment; int offset_in_segment = bytepos % UsableBytesInSegment; xlogrecptr = full_segments * XLOG_SEG_SIZE; /* is it on the first page? */ if (offset_in_segment XLOG_BLCKSZ - SizeOfXLogLongPHD) xlogrecptr += SizeOfXLogLongPHD + offset_in_segment; else { /* first page is fully used */ xlogrecptr += XLOG_BLCKSZ; /* add other full pages */ offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD; xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ; /* and finally offset within the last page */ xlogrecptr += offset_in_segment % UsableBytesInPage; } /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */ XLogRecPtr.xlogid = xlogrecptr 32; XLogRecPtr.xrecoff = xlogrecptr 0x; Capsulated in a function, that's not too bad. But if we want to make that simpler, one idea would be to allocate the whole 1st page in each WAL segment for metadata. That way all the actual xlog pages would hold the same amount of xlog data. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
On 06/06/2012 04:50 PM, Andres Freund wrote: On Wednesday, June 06, 2012 04:38:42 PM Tom Lane wrote: You might think we should design this exactly like the TCP-socket multiple-listen-addresses case, ie just have a config variable containing a list of directory names. The sticking point there is that the directories aren't really interchangeable. In particular, there is still going to be one directory that is the one hard-wired into libpq. I wonder if the whole issue doesn't require libpq to also try multiple hardcoded socket locations. I guess so. Let's say we add additional socket support and some server uses one e.g. at /var/run/postgresql. Then a client can either (1) specify the same path explicitly and then we don't need to specify any additional sockets on the client side or (2) stick to the default path, which is hard-coded, currently to /tmp. Going back to the original problem (inaccessible /tmp directory), it is the case (2) -- a client uses the default path. So any additional client-side socket option won't probably help here, but we would probably need a second hard-coded path e.g. at /var/run/postgresql. Regards, Honza -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
Robert Haas robertmh...@gmail.com writes: On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: If we don't like that, I can think of a couple of other ways to get there, but they have their own downsides: * Instead of trying to detect after-the-fact whether any concurrent WAL activity happened during the last checkpoint, we could detect it during the checkpoint and then keep the info in a static variable in the checkpointer process until next time. However, I don't see any bulletproof way to do this without adding at least one or two lines of code within XLogInsert, which I'm sure Robert will complain about. My main concern here is to avoid doing anything that will make things harder for Heikki's WAL insert scaling patch, which I'm hoping will get done for 9.3. Yeah, I'm not very happy either with adding new requirements to XLogInsert, even if it is just tracking one more bit of information. What do you have in mind, exactly? I feel like ProcLastRecPtr might be enough information. After logging running xacts, we can check whether ProcLastRecPtr is equal to the redo pointer. If so, then nothing got written to WAL between the time we began the checkpoint and the time we wrote that record. If, through further, similar gyrations, we can then work out whether the checkpoint record immediately follows the running-xacts record, we're there. That's pretty ugly, I guess, but it seems possible. It's fairly messy because of the issues around holes being left at page ends etc --- so the fact that ProcLastRecPtr is different from the previous insert pointer doesn't immediately prove whether something else got inserted first, or we just had to leave some dead space. Heikki mentioned this morning that he'd like to remove some of those rules in 9.3, but that doesn't help us today. Note also that a closer look at LogStandbySnapshot shows it emits more than one WAL record, meaning we'd have to do this dance more than once, and the changes to do that would be pretty deadly to the modularity of the functions LogStandbySnapshot calls. The conclusion I'd come to yesterday was that we'd want XLogInsert to do something like if (Insert-PrevRecord is different from ProcLastRecPtr) SomebodyElseWroteWAL = true; where SomebodyElseWroteWAL is a process-local boolean that we reset at the start of a checkpoint sequence, and then check after we've written out the LogStandbySnapshot records and the checkpoint record. (We'd also have to hack ProcLastRecPtr by initializing it to Insert-PrevRecord at the time we reset SomebodyElseWroteWAL, which is sort of ugly in that it messes up the relatively clean definition of that variable.) So that's not exactly a lot of new code in the critical section, but it's still new code. In the end I think I like the last idea I mentioned (keeping two different REDO values during a checkpoint) the best. It's a bit complicated but the grottiness is localized in CreateCheckpoint. Or at least I think it will be, without having written a patch yet. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. Personally, I don't think we should do this until we have a better regression test suite around replication and recovery because the impact will be huge but I welcome the suggested changes themselves. If you are going to do this in 9.3, then it has to be early in the first Commit Fest and you'll need to be around to quickly follow through on all of the other subsequent breakages it will cause, otherwise every other piece of work in this area will be halted or delayed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
Honza Horak hho...@redhat.com writes: On 06/06/2012 04:50 PM, Andres Freund wrote: I wonder if the whole issue doesn't require libpq to also try multiple hardcoded socket locations. I guess so. I don't really want to go there. Some use cases have been shown in this thread for having a server listen in multiple places, but that does not translate to saying that clients need to support automatically looking in multiple places. I think that mainly introduces questions we could do without, like which server did you actually end up contacting. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Hi, On Thursday, June 07, 2012 05:51:07 PM Simon Riggs wrote: On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. It's not that bad. Most of that code is pretty abstracted, the changes to adapt to that should be less than 20 lines. And it would remove some of the complexity. Personally, I don't think we should do this until we have a better regression test suite around replication and recovery because the impact will be huge but I welcome the suggested changes themselves. Hm. One could regard the logical rep stuff as a testsuite ;) If you are going to do this in 9.3, then it has to be early in the first Commit Fest and you'll need to be around to quickly follow through on all of the other subsequent breakages it will cause, otherwise every other piece of work in this area will be halted or delayed. Yea, I would definitely welcome an early patch. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote: Honza Horak hho...@redhat.com writes: On 06/06/2012 04:50 PM, Andres Freund wrote: I wonder if the whole issue doesn't require libpq to also try multiple hardcoded socket locations. I guess so. I don't really want to go there. Some use cases have been shown in this thread for having a server listen in multiple places, but that does not translate to saying that clients need to support automatically looking in multiple places. I think that mainly introduces questions we could do without, like which server did you actually end up contacting. It would be really nice to have a development psql connect to a distro installed psql and vice versa without having to specify -h /var/run/psql and - h /tmp all the time... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
On 7 June 2012 14:59, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Tom Lane t...@sss.pgh.pa.us wrote: there is no guarantee that we'll manage to reach a database state that is consistent with data already flushed out to disk during the last checkpoint. Robert Haas robertmh...@gmail.com wrote: I know of real customers who would have suffered real data loss had this code been present in the server version they were using. Checkpoints are the *only* mechanism by which SLRU pages get flushed to disk on a mostly-idle system. That means if something happens to your pg_xlog directory, and you haven't had a checkpoint, you're screwed. If that is the concern, then its a one line fix to add the missing clog flush. The other suggestions I've skim read seem fairly invasive at this stage of the release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thu, Jun 7, 2012 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, On Thursday, June 07, 2012 05:51:07 PM Simon Riggs wrote: On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. It's not that bad. Most of that code is pretty abstracted, the changes to adapt to that should be less than 20 lines. And it would remove some of the complexity. Personally, I don't think we should do this until we have a better regression test suite around replication and recovery because the impact will be huge but I welcome the suggested changes themselves. Hm. One could regard the logical rep stuff as a testsuite ;) If you are going to do this in 9.3, then it has to be early in the first Commit Fest and you'll need to be around to quickly follow through on all of the other subsequent breakages it will cause, otherwise every other piece of work in this area will be halted or delayed. Yea, I would definitely welcome an early patch. Just as I'm sure everybody else would welcome *your* patches landing in the first commitfest and that you all guarantee to be around quickly follow through on all potential breakages *that* can cause. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thursday, June 07, 2012 06:02:12 PM Magnus Hagander wrote: On Thu, Jun 7, 2012 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, On Thursday, June 07, 2012 05:51:07 PM Simon Riggs wrote: On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. It's not that bad. Most of that code is pretty abstracted, the changes to adapt to that should be less than 20 lines. And it would remove some of the complexity. Personally, I don't think we should do this until we have a better regression test suite around replication and recovery because the impact will be huge but I welcome the suggested changes themselves. Hm. One could regard the logical rep stuff as a testsuite ;) If you are going to do this in 9.3, then it has to be early in the first Commit Fest and you'll need to be around to quickly follow through on all of the other subsequent breakages it will cause, otherwise every other piece of work in this area will be halted or delayed. Yea, I would definitely welcome an early patch. Just as I'm sure everybody else would welcome *your* patches landing in the first commitfest and that you all guarantee to be around quickly follow through on all potential breakages *that* can cause. Agreed. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thu, Jun 7, 2012 at 11:04 AM, Andres Freund and...@2ndquadrant.com wrote: On Thursday, June 07, 2012 04:27:32 PM Robert Haas wrote: On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com wrote: Proposed patch attached. This adds some more comments in various places, and implements your suggestion of retesting the visibility-map bit when we detect a possible mismatch with the page-level bit. Thanks, will look at it in a bit. I wonder if /* mark page all-visible, if appropriate */ if (all_visible !PageIsAllVisible(page)) { PageSetAllVisible(page); MarkBufferDirty(buf); visibilitymap_set(onerel, blkno, InvalidXLogRecPtr, vmbuffer, visibility_cutoff_xid); } shouldn't test if (all_visible (!PageIsAllVisible(page) || !all_visible_according_to_vm) instead. Hmm, I think you're right. Commentwise I am not totally content with the emphasis on memory ordering because some of the stuff is more locking than memory ordering. Except that I think its a pretty clear improvement. I can reformulate the places where I find that relevant but I have the feeling that wouldn't help the legibility. Its the big comment in vacuumlazy.c, the comment in nodeIndexonly.c and the one in the header of visibilitymap_test. Should be s/memory- ordering/concurrency/ except in nodeIndexonlyscan.c Hmm, I see your point. The visibilitymap_clear/PageClearAllVisible in heap_multi_insert should be moved into the critical section, shouldn't it? Yes, it should. I was thinking maybe we could go the other way and have heap_insert do it before starting the critical section, but that's no good: clearing the visibility map bit is part of the critical data change, and we can't do it and then forget to WAL-log it. Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company vm-test-cleanup-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On 07.06.2012 18:51, Simon Riggs wrote: On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. I don't think this has much impact on what you're doing (although it's a bit hard to tell without more details). The way WAL records work is the same, it's just the code that lays them out on a page, and reads back from a page, that's changed. And that's fairly isolated in xlog.c. If you are going to do this in 9.3, then it has to be early in the first Commit Fest and you'll need to be around to quickly follow through on all of the other subsequent breakages it will cause, otherwise every other piece of work in this area will be halted or delayed. Yeah, the plan is to get this in early, in the first commit fest. Not only because of possible breakage, but also because my ultimate goal is the XLogInsert refactoring, and I want do that early in the release cycle, too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
On Thu, Jun 7, 2012 at 11:57 AM, Andres Freund and...@2ndquadrant.com wrote: On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote: Honza Horak hho...@redhat.com writes: On 06/06/2012 04:50 PM, Andres Freund wrote: I wonder if the whole issue doesn't require libpq to also try multiple hardcoded socket locations. I guess so. I don't really want to go there. Some use cases have been shown in this thread for having a server listen in multiple places, but that does not translate to saying that clients need to support automatically looking in multiple places. I think that mainly introduces questions we could do without, like which server did you actually end up contacting. It would be really nice to have a development psql connect to a distro installed psql and vice versa without having to specify -h /var/run/psql and - h /tmp all the time... This is true, but you have this problem already. It might be worth fixing, but it seems like a separate issue from the topic of this thread, which is where the server listens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thursday, June 07, 2012 06:08:34 PM Robert Haas wrote: On Thu, Jun 7, 2012 at 11:04 AM, Andres Freund and...@2ndquadrant.com wrote: On Thursday, June 07, 2012 04:27:32 PM Robert Haas wrote: On Thu, Jun 7, 2012 at 9:41 AM, Andres Freund and...@2ndquadrant.com wrote: Proposed patch attached. This adds some more comments in various places, and implements your suggestion of retesting the visibility-map bit when we detect a possible mismatch with the page-level bit. Thanks, will look at it in a bit. The visibilitymap_clear/PageClearAllVisible in heap_multi_insert should be moved into the critical section, shouldn't it? Yes, it should. I was thinking maybe we could go the other way and have heap_insert do it before starting the critical section, but that's no good: clearing the visibility map bit is part of the critical data change, and we can't do it and then forget to WAL-log it. You could do a visibilitymap_pin outside, but I don't really see the point as the page is already locked. There might be some slight benefit in doing so in multi_insert but that would be more complicated. And of doubtful benefit. Updated patch attached. Looks good. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
Andres Freund and...@2ndquadrant.com writes: On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote: Honza Horak hho...@redhat.com writes: On 06/06/2012 04:50 PM, Andres Freund wrote: I wonder if the whole issue doesn't require libpq to also try multiple hardcoded socket locations. I don't really want to go there. It would be really nice to have a development psql connect to a distro installed psql and vice versa without having to specify -h /var/run/psql and - h /tmp all the time... I don't find that nice at all. Which server did you actually connect to? How do you control it? You're going to end up needing the -h switch anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thu, Jun 7, 2012 at 11:51 AM, Simon Riggs si...@2ndquadrant.com wrote: So this is a very large curve ball you're throwing there. This is not exactly unexpected. At least the first two of these items were previous discussed in the context of the XLOG scaling patch, many months ago. It shouldn't come as a surprise to anyone that Heikki is planning to continue to work on that patch even though it didn't make 9.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ability to listen on two unix sockets
On Thursday, June 07, 2012 06:20:32 PM Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On Thursday, June 07, 2012 05:55:11 PM Tom Lane wrote: Honza Horak hho...@redhat.com writes: On 06/06/2012 04:50 PM, Andres Freund wrote: I wonder if the whole issue doesn't require libpq to also try multiple hardcoded socket locations. I don't really want to go there. It would be really nice to have a development psql connect to a distro installed psql and vice versa without having to specify -h /var/run/psql and - h /tmp all the time... I don't find that nice at all. Which server did you actually connect to? How do you control it? You're going to end up needing the -h switch anyway. They can't run on the same port anyway unless you disable listening on localhost. Changing a single port number is far less effort than typing -h /var/run/postgresql ;) Anyway, I am not wed to this, and I don't plan to put work into it so I better shut up ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thu, Jun 7, 2012 at 12:19 PM, Andres Freund and...@2ndquadrant.com wrote: You could do a visibilitymap_pin outside, but I don't really see the point as the page is already locked. There might be some slight benefit in doing so in multi_insert but that would be more complicated. And of doubtful benefit. Doesn't RelationGetBufferForTuple() already do exactly that? Updated patch attached. Looks good. Thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
On 7 June 2012 14:56, Tom Lane t...@sss.pgh.pa.us wrote: Sergey Koposov kopo...@ast.cam.ac.uk writes: On Tue, 5 Jun 2012, Simon Riggs wrote: Sounds less good and we'd need reasonable proof it actually did anything useful without being dangerous. Doing an initial unlocked test speeds things up another 2.69 fold (on top of 3.55 for your patch) for me, with 1GB of shared buffers. That seems like it should be worthwhile. How do we go about getting reasonable proof that it is safe? That's enough for me. Say what? That's a performance result and proves not a damn thing about safety. Of course not. Based on the rationale explained in the code comments in the patch, it seems like a reasonable thing to me now. The argument was that since we hold AccessExclusiveLock on the relation, no other agent can be reading in new parts of the table into new buffers, so the only change to a buffer would be away from the dropping relation, in which case we wouldn't care. Which seems correct to me. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
Simon Riggs si...@2ndquadrant.com writes: Robert Haas robertmh...@gmail.com wrote: I know of real customers who would have suffered real data loss had this code been present in the server version they were using. If that is the concern, then its a one line fix to add the missing clog flush. To where, and what performance impact will that have? The other suggestions I've skim read seem fairly invasive at this stage of the release. The issue here is that we committed a not-very-well-thought-out fix to the original problem. I think a reasonable argument could be made for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724 and postponing any of these other ideas to 9.3. The useless-checkpoints problem has been there since 9.0 and can surely wait another release cycle to get fixed. But I concur with Robert that changing the system behavior so that checkpointing of committed changes might be delayed indefinitely is a high-risk choice. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thursday, June 07, 2012 06:23:58 PM Robert Haas wrote: On Thu, Jun 7, 2012 at 12:19 PM, Andres Freund and...@2ndquadrant.com wrote: You could do a visibilitymap_pin outside, but I don't really see the point as the page is already locked. There might be some slight benefit in doing so in multi_insert but that would be more complicated. And of doubtful benefit. Doesn't RelationGetBufferForTuple() already do exactly that? Forget it, sorry. I mis(read|remembered) how multi_insert works and concluded from that on how insert works... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
Simon Riggs si...@2ndquadrant.com writes: On 7 June 2012 14:56, Tom Lane t...@sss.pgh.pa.us wrote: Say what? That's a performance result and proves not a damn thing about safety. Of course not. Based on the rationale explained in the code comments in the patch, it seems like a reasonable thing to me now. The argument was that since we hold AccessExclusiveLock on the relation, no other agent can be reading in new parts of the table into new buffers, so the only change to a buffer would be away from the dropping relation, in which case we wouldn't care. Which seems correct to me. Oh, I must be confused about which patch we are talking about --- I thought this was in reference to some of the WIP ideas that were being thrown about with respect to using lock-free access primitives. Which patch are you proposing for commit now, exactly? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thursday, June 07, 2012 05:35:11 PM Heikki Linnakangas wrote: On 07.06.2012 17:18, Andres Freund wrote: On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote: 3. Move the only field, xl_rem_len, from the continuation record header straight to the xlog page header, eliminating XLogContRecord altogether. This makes it easier to calculate in advance how much space a WAL record requires, as it no longer depends on how many pages it has to be split across. This wastes 4-8 bytes on every xlog page, but that's not much. +1. I don't think this will waste a measureable amount in real-world scenarios. A very big percentag of pages have continuation records. Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on 64-bit architectures) even when there is a continuation record, because of alignment: typedef struct XLogPageHeaderData { uint16 xlp_magic; /* magic value for correctness checks */ uint16 xlp_info; /* flag bits, see below */ TimeLineID xlp_tli; /* TimeLineID of first record on XLogRecPtr xlp_pageaddr; /* XLOG address of this page */ + uint32 xlp_rem_len; /* bytes remaining of continued record */ } XLogPageHeaderData; The page header is currently 16 bytes in length, so adding a 4-byte field to it bumps the aligned size to 24 bytes. Nevertheless, I think we can well live with that. At that point we can just do the #define SizeofXLogPageHeaderData (offsetof(XLogPageHeaderData, xlp_pageaddr) + sizeof(uint32)) dance. If the record can be smeared over two pages there is no point in storing it aligned. Then we don't waste any additional space in comparison to the current state. If we do that we can remove all the aligment padding as well. Which would be a problem for you anyway, wouldn't it? It's not a problem. You just MAXALIGN the size of the record when you calculate how much space it needs, and then all records become naturally MAXALIGNed. We could quite easily remove the alignment on-disk if we wanted to, ReadRecord() already always copies the record to an aligned buffer, but I wasn't planning to do that. Whats the reasoning for having alignment on disk if the records aren't stored continually? These changes will help the XLogInsert scaling patch, by making the space calculations simpler. In essence, to reserve space for a WAL record of size X, you just need to do bytepos += X. There's a lot more details with that, like mapping from the contiguous byte position to an XLogRecPtr that takes page headers into account, and noticing RedoRecPtr changes safely, but it's a start. Hm. Wouldn't you need to remove short/long page headers for that as well? No, those are ok because they're predictable. I haven't read your scalability patch, so I am not really sure what you need... The bytepos += X from above isn't as easy that way. But yes, its not that complicated. Although it would make the mapping simpler. To convert from a contiguous xlog byte position that excludes all headers, to XLogRecPtr, you need to do something like this (I just made this up, probably has bugs, but it's about this complex): #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD) #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD) uint64 xlogrecptr; uint64 full_segments = bytepos / UsableBytesInSegment; int offset_in_segment = bytepos % UsableBytesInSegment; xlogrecptr = full_segments * XLOG_SEG_SIZE; /* is it on the first page? */ if (offset_in_segment XLOG_BLCKSZ - SizeOfXLogLongPHD) xlogrecptr += SizeOfXLogLongPHD + offset_in_segment; else { /* first page is fully used */ xlogrecptr += XLOG_BLCKSZ; /* add other full pages */ offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD; xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ; /* and finally offset within the last page */ xlogrecptr += offset_in_segment % UsableBytesInPage; } /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */ XLogRecPtr.xlogid = xlogrecptr 32; XLogRecPtr.xrecoff = xlogrecptr 0x; Its a bit more complicated than that, records can span a good bit more than just two pages (even more than two segments) and you need to decide for every of those whether it has a long or a short header. Capsulated in a function, that's not too bad. But if we want to make that simpler, one idea would be to allocate the whole 1st page in each WAL segment for metadata. That way all the actual xlog pages would hold the same amount of xlog data. Its a bit easier then, but you probably still need to loop over the size and subtract till you reached the final point. Its no problem to produce a 100MB wal record. But then thats probably nothing to design for. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL
Re: [HACKERS] page is not marked all-visible warning in regression tests
Robert Haas robertmh...@gmail.com writes: Updated patch attached. The comments need a pass of copy-editing, eg here and here: + * so somebody else could be change the bit just after we look at it. In fact, ^^^ + * got cleared after we checked it and before we got took the buffer Seems reasonable beyond that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] page is not marked all-visible warning in regression tests
On Thu, Jun 7, 2012 at 12:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Updated patch attached. The comments need a pass of copy-editing, eg here and here: + * so somebody else could be change the bit just after we look at it. In fact, ^^^ + * got cleared after we checked it and before we got took the buffer Seems reasonable beyond that. Thanks, committed with those fixes and a correction for one more typo I spotted elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
On 7 June 2012 17:27, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Robert Haas robertmh...@gmail.com wrote: I know of real customers who would have suffered real data loss had this code been present in the server version they were using. If that is the concern, then its a one line fix to add the missing clog flush. To where, and what performance impact will that have? To the point where we decide to skip the other parts of the checkpoint. How would that cause a performance impact exactly? It's less work than the original behaviour would be. The other suggestions I've skim read seem fairly invasive at this stage of the release. The issue here is that we committed a not-very-well-thought-out fix to the original problem. I think a reasonable argument could be made for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724 and postponing any of these other ideas to 9.3. The useless-checkpoints problem has been there since 9.0 and can surely wait another release cycle to get fixed. But I concur with Robert that changing the system behavior so that checkpointing of committed changes might be delayed indefinitely is a high-risk choice. Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. Robert has shown a bug exists, so it should be fixed directly, especially if we believe in least invasive changes. If not-fixing-the-actual-bug happened before, its happening again here as well, with some poor sounding logic to justify it. Please act as you see fit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Andres Freund and...@2ndquadrant.com writes: dance. If the record can be smeared over two pages there is no point in storing it aligned. I think this is not true. The value of requiring alignment is that you can read the record-length field without first having to copy it somewhere. In particular, it will get really ugly if the record length field itself could cross a page boundary. I think we want to be able to determine the record length before we do any data copying, so that we can malloc the record buffer and then just do one copy step. The real reason for the current behavior of not letting the record header get split across multiple pages is so that the length field is guaranteed to be in the first page. We can still guarantee that if we (1) put the length field first and (2) require at least int32 alignment. I think losing that property will be pretty bad though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
On 7 June 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 7 June 2012 14:56, Tom Lane t...@sss.pgh.pa.us wrote: Say what? That's a performance result and proves not a damn thing about safety. Of course not. Based on the rationale explained in the code comments in the patch, it seems like a reasonable thing to me now. The argument was that since we hold AccessExclusiveLock on the relation, no other agent can be reading in new parts of the table into new buffers, so the only change to a buffer would be away from the dropping relation, in which case we wouldn't care. Which seems correct to me. Oh, I must be confused about which patch we are talking about --- I thought this was in reference to some of the WIP ideas that were being thrown about with respect to using lock-free access primitives. Which patch are you proposing for commit now, exactly? Both of these, as attached up thread. Simon's patch - dropallforks.v1.patch Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch (needs a little tidyup) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thursday, June 07, 2012 06:53:58 PM Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: dance. If the record can be smeared over two pages there is no point in storing it aligned. I think this is not true. The value of requiring alignment is that you can read the record-length field without first having to copy it somewhere. In particular, it will get really ugly if the record length field itself could cross a page boundary. I think we want to be able to determine the record length before we do any data copying, so that we can malloc the record buffer and then just do one copy step. Hm, I had assumed the record would get copied into a temp/static buffer first and only get reassembled together with the data afterwards. But if thats not the way to go, sure, storing it aligned so that the length can always be read aligned within a page is sensible. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote: Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. But that's exactly the problem - it might not be soon at all. We have customers who process about one write transaction per day. The current state of play in 9.2 is that we'll checkpoint when we get to the next WAL segment. But at one write transaction per day, that could take weeks or months. Someone invented a setting called 'checkpoint_timeout' for a very good reason - people don't want huge amounts of time to pass between checkpoints. That setting is currently capped at one hour; the proposition that someone who sets it to 5 minutes on a low-write-volume system is OK with the effective value being 5 months does not seem likely to me. Your suggestion of just checkpointing CLOG seems like it would mitigate the worst of the damage, but I think I agree with Tom that just reverting the whole patch would be a better solution, if we can't figure out anything cleaner. It's better to have a few unnecessary checkpoints than to risk losing somebody's data, especially since the unnecessary checkpoints only happen with wal_level=hot_standby, but the data loss risk exists for everyone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On 7 June 2012 17:12, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 07.06.2012 18:51, Simon Riggs wrote: On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. I don't think this has much impact on what you're doing (although it's a bit hard to tell without more details). The way WAL records work is the same, it's just the code that lays them out on a page, and reads back from a page, that's changed. And that's fairly isolated in xlog.c. I wasn't worried about the code overlap, but the subsidiary breakage looks pretty enormous to me. Anything changing filenames will break every HA config anybody has anywhere. So you can pretty much kiss goodbye to the idea of pg_upgrade. For me, this one thing alone is sufficient to force next release to be 10.0. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
Simon Riggs si...@2ndquadrant.com writes: On 7 June 2012 17:27, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: If that is the concern, then its a one line fix to add the missing clog flush. To where, and what performance impact will that have? To the point where we decide to skip the other parts of the checkpoint. How would that cause a performance impact exactly? It's less work than the original behaviour would be. It's not clear to me exactly which parts of the checkpoint process would need to be duplicated here to be safe. What you're proposing is basically a partial checkpoint, which would need quite a lot of thought to be sure it did everything that should be done and nothing that shouldn't be. And it would be a continuing risk spot for future bugs of omission. The issue here is that we committed a not-very-well-thought-out fix to the original problem. I think a reasonable argument could be made for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724 and postponing any of these other ideas to 9.3. The useless-checkpoints problem has been there since 9.0 and can surely wait another release cycle to get fixed. But I concur with Robert that changing the system behavior so that checkpointing of committed changes might be delayed indefinitely is a high-risk choice. Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. No, the situation of most concern is where we make some change and then nothing happens for a very long time. If there's a continuing flow of updates, then yes a checkpoint will happen ... eventually. Even with the assumption of continuing updates, the time until a commit is checkpointed might be vastly longer than the configured checkpoint interval, and that's an interval in which loss of the WAL log is more dangerous than it was in any previous release. So basically, this fix is giving up some hard-to-quantify amount of data security. Per this thread, there are other ways in which we can fix the useless-checkpoint problem without giving up any such security. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thursday, June 07, 2012 07:03:32 PM Simon Riggs wrote: On 7 June 2012 17:12, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 07.06.2012 18:51, Simon Riggs wrote: On 7 June 2012 14:50, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: These changes will help the XLogInsert scaling patch ...and as I'm sure you're aware will junk much of the replication code and almost certainly set back the other work that we have brewing for 9.3. So this is a very large curve ball you're throwing there. I don't think this has much impact on what you're doing (although it's a bit hard to tell without more details). The way WAL records work is the same, it's just the code that lays them out on a page, and reads back from a page, that's changed. And that's fairly isolated in xlog.c. I wasn't worried about the code overlap, but the subsidiary breakage looks pretty enormous to me. The xlog arithmetic will still be encapsulated, so not much difference there. Removing reading of XLogContRecord isn't complicated and would result in less code. Shouldn't be much more than that. Anything changing filenames will break every HA config anybody has anywhere. So you can pretty much kiss goodbye to the idea of pg_upgrade. For me, this one thing alone is sufficient to force next release to be 10.0. Hm? Wal isn't relevant for pg_upgrade. And the HA setups should rely on archive_command and such and not do computation of the next/last name. I would guess removing that corner-case actually fixes more tools than it breaks. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Simon Riggs si...@2ndquadrant.com wrote: Anything changing filenames will break every HA config anybody has anywhere. It will impact our scripts related to backup and archiving, but I think we're talking about two or three staff days to cover it in our shop. We should definitely make sure that this change is conspicuously noted. The scariest part is that there will now be files that matter with names that previously didn't exist, so lack of action will cause failure to capture a usable backup. I don't know that it merits a bump to 10.0, though. We test every backup for usability, as I believe any shop should; failure to cover this should cause pretty obvious errors pretty quickly if you are testing your backups. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
Robert Haas robertmh...@gmail.com writes: ... It's better to have a few unnecessary checkpoints than to risk losing somebody's data, especially since the unnecessary checkpoints only happen with wal_level=hot_standby, but the data loss risk exists for everyone. Yeah, that's another point here: the benefit of the patch accrues to a different set of people than the ones paying the penalty. If you've got hot standby enabled, presumably you are replicating to at least one slave and so the prospect of data loss via WAL loss is mitigated for you. I also note that the other work done in 9.2 to reduce idle-system load did not address replication configurations at all; I think we still have time-driven wakeups in walsender and walreceiver for instance. So I'd rather revert the patch now, and consider that a better fix will be part of a future round of work to reduce the idle-system load in replication setups. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Could we replace SysV semaphores with latches?
On 06/07/2012 06:09 AM, Tom Lane wrote: There has been regular griping in this list about our dependence on SysV shared memory, but not so much about SysV semaphores, even though the latter cause their fair share of issues; as seen for example in buildfarm member spoonbill's recent string of failures: creating template1 database in /home/pgbuild/pgbuildfarm/HEAD/pgsql.25563/src/test/regress/./tmp_check/data/base/1 ... FATAL: could not create semaphores: No space left on device DETAIL: Failed system call was semget(1, 17, 03600). HINT: This error does *not* mean that you have run out of disk space. It occurs when either the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded. You need to raise the respective kernel parameter. Alternatively, reduce PostgreSQL's consumption of semaphores by reducing its max_connections parameter. The PostgreSQL documentation contains more information about configuring your system for PostgreSQL. child process exited with exit code 1 hmm now that you mention that I missed the issue completely - the problem here is that spoonbill only has resources for one running postmaster and the failure on 24.5.2012 caused a left over postmaster instance - should be fixed now Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Robert Haas robertmh...@gmail.com wrote: But if you're just using regexp matching against pathnames, your tool will be just fine. Do your tools actually rely on the occasional absence of a file in what would otherwise be the usual sequence of files? To save snapshot backups for the long term, we generate a list of the specific WAL files needed to reach a consistent recovery point from a given base backup. We keep monthly snapshot backups for a year. We currently determine the first and last file needed, and then create a list of all the WAL files to save. We error out if any are missing, so we do skip the FF file. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Early hint bit setting
On Wed, Jun 6, 2012 at 6:41 PM, Jim Nasby j...@nasby.net wrote: Except that's only true when there are no other transactions running. That's been one of the big sticking points about trying to proactively set hint bits; in a real system you're not going to gain very much unless you wait a while before setting them. No, the committed hint bit just means that the transaction is committed. You don't have to wait for it to be all-visible. I think my biggest concern about this is that it inevitably relies on some assumption about how much latency there will be before the client sends the next request. That strikes me as impossible to tune. On system A, connected to the Internet via an overloaded 56k modem link, you can get away with doing a huge amount of fiddling around while waiting for the next request. But on system B, which uses 10GE or Infiniband or local sockets, the acceptable latency will be much less. Even given identical hardware, scheduler behavior may matter quite a lot - rumor has it that FreeBSD's scheduling latency may be significantly less than on Linux, although I have not verified it and rumor may lie. But the point is that whether or not this works out to a win on any given system seems like it will depend on an awful lot of stuff that we can't know or control. I would be more inclined to look at trying to make this happen in a background process, although that's not without its own challenges. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thu, Jun 7, 2012 at 1:40 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: But if you're just using regexp matching against pathnames, your tool will be just fine. Do your tools actually rely on the occasional absence of a file in what would otherwise be the usual sequence of files? To save snapshot backups for the long term, we generate a list of the specific WAL files needed to reach a consistent recovery point from a given base backup. We keep monthly snapshot backups for a year. We currently determine the first and last file needed, and then create a list of all the WAL files to save. We error out if any are missing, so we do skip the FF file. OK, I see. Still, I think there are a lot of people who don't do anything that complex, and won't be affected. But I agree we had better clearly release-note it as an incompatibility. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On Thu, Jun 7, 2012 at 1:15 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Simon Riggs si...@2ndquadrant.com wrote: Anything changing filenames will break every HA config anybody has anywhere. It will impact our scripts related to backup and archiving, but I think we're talking about two or three staff days to cover it in our shop. We should definitely make sure that this change is conspicuously noted. The scariest part is that there will now be files that matter with names that previously didn't exist, so lack of action will cause failure to capture a usable backup. But if you're just using regexp matching against pathnames, your tool will be just fine. Do your tools actually rely on the occasional absence of a file in what would otherwise be the usual sequence of files? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WalSndWakeup() and synchronous_commit=off
On 6 June 2012 20:11, Andres Freund and...@2ndquadrant.com wrote: On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote: Hi, On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Does anybody have a better idea than to either call WalSndWakeup() at essentially the wrong places or calling it inside a critical section? Tom, what danger do you see from calling it in a critical section? My concern was basically that it might throw an error. Looking at the current implementation of SetLatch, it seems that's not possible, but I wonder whether we want to lock ourselves into that assumption. The assumption is already made at several other places I think. XLogSetAsyncXactLSN does a SetLatch and is called from critical sections; several signal handlers call it without any attention to the context. Requiring it to be called outside would make its usage considerably less convenient and I don't really see what could change that would require to throw non-panic errors. Still, if the alternatives are worse, maybe that's the best answer. If we do that, though, let's add comments to WalSndWakeup and SetLatch mentioning that they mustn't throw error. Patch attached. I would like to invite some more review (+commit...) here ;). Imo this is an annoying bug which should be fixed before next point release or beta/rc comes out... Moved the wakeup to a logical place outside a critical section. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Simon Riggs si...@2ndquadrant.com writes: Anything changing filenames will break every HA config anybody has anywhere. This seems like nonsense to me. How many external scripts are likely to know that we skip the FF page? There might be some, but not many. So you can pretty much kiss goodbye to the idea of pg_upgrade. And that is certainly nonsense. I don't think pg_upgrade even knows about this, and if it does we can surely fix it. For me, this one thing alone is sufficient to force next release to be 10.0. Huh? We make incompatible changes in major versions all the time. This one does not appear to me to be worse than many others. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
On 7 June 2012 19:52, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Anything changing filenames will break every HA config anybody has anywhere. This seems like nonsense to me. How many external scripts are likely to know that we skip the FF page? There might be some, but not many. If that is the only change in filenames, then all is forgiven. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLog changes for 9.3
Simon Riggs si...@2ndquadrant.com writes: On 7 June 2012 19:52, Tom Lane t...@sss.pgh.pa.us wrote: This seems like nonsense to me. How many external scripts are likely to know that we skip the FF page? There might be some, but not many. If that is the only change in filenames, then all is forgiven. Oh, now I see what you're on about. Yes, I agree that we should maintain the same formatting of WAL segment file names, even though it will be rather artificial in the 64-bit-arithmetic world. The only externally visible change should be the creation of FF-numbered files where formerly those were skipped. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
Simon Riggs si...@2ndquadrant.com writes: On 7 June 2012 17:34, Tom Lane t...@sss.pgh.pa.us wrote: Oh, I must be confused about which patch we are talking about --- I thought this was in reference to some of the WIP ideas that were being thrown about with respect to using lock-free access primitives. Which patch are you proposing for commit now, exactly? Both of these, as attached up thread. Simon's patch - dropallforks.v1.patch Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch (needs a little tidyup) OK, will take a look. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WalSndWakeup() and synchronous_commit=off
On Thursday, June 07, 2012 08:41:23 PM Simon Riggs wrote: On 6 June 2012 20:11, Andres Freund and...@2ndquadrant.com wrote: On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote: Hi, On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Does anybody have a better idea than to either call WalSndWakeup() at essentially the wrong places or calling it inside a critical section? Tom, what danger do you see from calling it in a critical section? My concern was basically that it might throw an error. Looking at the current implementation of SetLatch, it seems that's not possible, but I wonder whether we want to lock ourselves into that assumption. The assumption is already made at several other places I think. XLogSetAsyncXactLSN does a SetLatch and is called from critical sections; several signal handlers call it without any attention to the context. Requiring it to be called outside would make its usage considerably less convenient and I don't really see what could change that would require to throw non-panic errors. Still, if the alternatives are worse, maybe that's the best answer. If we do that, though, let's add comments to WalSndWakeup and SetLatch mentioning that they mustn't throw error. Patch attached. I would like to invite some more review (+commit...) here ;). Imo this is an annoying bug which should be fixed before next point release or beta/rc comes out... Moved the wakeup to a logical place outside a critical section. Hm. I don't really like the way you implemented that. While it reduces the likelihood quite a bit it will still miss wakeups if an XLogInsert pushes out the data because of missing space or if any place does an XLogFlush(lsn). The knowledge is really only available in XLogWrite... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
Jeff Janes jeff.ja...@gmail.com writes: On 30 May 2012 12:10, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Also, I wonder if DropRelFileNodeBuffers() could scan the pool without grabbing the spinlocks on every buffer? It could do an unlocked test first, and only grab the spinlock on buffers that need to be dropped. Sounds less good and we'd need reasonable proof it actually did anything useful without being dangerous. Doing an initial unlocked test speeds things up another 2.69 fold (on top of 3.55 for your patch) for me, with 1GB of shared buffers. That seems like it should be worthwhile. With shared_buffers set to 1GB, I see about a 2X reduction in the total time to drop a simple table, ie create table zit(f1 text primary key); drop table zit; (This table definition is chosen to ensure there's an index and a toast table involved, so several scans of the buffer pool are needed.) The DROP goes from about 40ms to about 20ms on a fairly recent Xeon desktop. So I'm convinced this is a win. I extended the patch to also cover DropDatabaseBuffers, FlushRelationBuffers, and FlushDatabaseBuffers, which have got the exact same type of full-pool scan loop, and committed it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
I wrote: Simon Riggs si...@2ndquadrant.com writes: Both of these, as attached up thread. Simon's patch - dropallforks.v1.patch Jeff's patch - DropRelFileNodeBuffers_unlock_v1.patch (needs a little tidyup) OK, will take a look. I didn't like dropallforks.v1.patch at all as presented, for several reasons: * Introducing an AllForks notation that only works in some contexts is a foot-gun of large caliber. This concern is not academic: you broke dropping of temp relations entirely, in the patch as presented, because for temp rels DropRelFileNodeBuffers would hand off to DropRelFileNodeLocalBuffers and the latter had not been taught about AllForks. * Since we have found out in this thread that the inner loop of DropRelFileNodeBuffers is performance-critical for the cases we're dealing with, it seems inappropriate to me to make its tests more complex. We want simpler, and we can have simpler given that the relation-drop case cares neither about fork nor block number. * The patch modified behavior of XLogDropRelation, which has not been shown to be performance-critical, and probably can't be because the hashtable it searches should never be all that large. It certainly doesn't matter to the speed of normal execution. I thought it would be a lot safer and probably a little bit quicker if we just split DropRelFileNodeBuffers into two routines, one for the specific-fork case and one for the all-forks case; and then the same for its main caller smgrdounlink. So I modified the patch along those lines and committed it. As committed, the smgrdounlinkfork case is actually dead code; it's never called from anywhere. I left it in place just in case we want it someday. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] New Postgres committer: Kevin Grittner
I am pleased to announce that Kevin Grittner has accepted the core committee's invitation to become our newest committer. As you all know, Kevin's done a good deal of work on the project over the past couple of years. We judged that he has the requisite skills, dedication to the project, and a suitable degree of caution to be a good committer. Please join me in welcoming him aboard. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Postgres committer: Kevin Grittner
On Jun 7, 2012, at 18:15, Tom Lane wrote: I am pleased to announce that Kevin Grittner has accepted the core committee's invitation to become our newest committer. As you all know, Kevin's done a good deal of work on the project over the past couple of years. We judged that he has the requisite skills, dedication to the project, and a suitable degree of caution to be a good committer. Please join me in welcoming him aboard. Congratulations, Kevin! Well-deserved! Michael Glaesemann grzm seespotcode net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Postgres committer: Kevin Grittner
On 7 June 2012 23:15, Tom Lane t...@sss.pgh.pa.us wrote: I am pleased to announce that Kevin Grittner has accepted the core committee's invitation to become our newest committer. As you all know, Kevin's done a good deal of work on the project over the past couple of years. We judged that he has the requisite skills, dedication to the project, and a suitable degree of caution to be a good committer. Please join me in welcoming him aboard. Kevin, with great power comes great responsibility. In other words your workload has just gone up. ;) You'll be a great addition to the existing team of committers. -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Postgres committer: Kevin Grittner
On 7 June 2012 23:15, Tom Lane t...@sss.pgh.pa.us wrote: Please join me in welcoming him aboard. Congratulations, Kevin. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: parameterized function scan
On 05/24/2012 12:46 AM, Tom Lane wrote: Well, it's not per spec: what you did accepts queries that are invalid per spec and are very likely to be errors rather than intentional invocations of the LATERAL facility. This might be all right for I think I saw queries where function is joined with no explicit LATERAL(). Nevertheless... Quite honestly, I think that something that only handles functions as LATERAL righthands is broken by design. It doesn't meet the spec, and it's unlikely to represent much of a step towards a full implementation. Remember Polya's Inventor's Paradox: the more general problem may be easier to solve. ... sounds more serious. I'll keep it in mind if I get the impression that I have a new idea about this problem anytime. Thanks, Tony H. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Postgres committer: Kevin Grittner
On 7 June 2012 23:40, Peter Geoghegan pe...@2ndquadrant.com wrote: On 7 June 2012 23:15, Tom Lane t...@sss.pgh.pa.us wrote: Please join me in welcoming him aboard. Congratulations, Kevin. Idle thought for the web team: Now might be a good time to take down the blurb on .org in which Kevin describes the mailing list support as amazing. :-) -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Postgres committer: Kevin Grittner
On Thu, Jun 7, 2012 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Please join me in welcoming him aboard. Congratulations Kevin -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] log_newpage header comment
It seems that in implementing ginbuildempty(), I falsified the first note in the header comment for log_newpage(): * Note: all current callers build pages in private memory and write them * directly to smgr, rather than using bufmgr. Therefore there is no need * to pass a buffer ID to XLogInsert, nor to perform MarkBufferDirty within * the critical section. Mea culpa, mea culpa, mea maxima culpa. So, this leads to a couple of questions: 1. Considering that we're logging the entire page, is it necessary (or even desirable) to include the buffer ID in the rdata structure? If so, why? To put that another way, does my abuse of log_newpage constitute a bug in gistbuildempty()? 2. Should we add a new function that does the same thing as log_newpage for a shared buffer? I'm imagining that the signature would be: XLogRecPtr log_newpage_buffer(RelFileNode *rnode, Buffer buffer); Admittedly, it may seem that we don't quite need such a function to cater to just one caller, but I think we might have more in the future. Among other things, it occurs to me to wonder whether I ought to rewrite all the ambuildempty routines in the style of ginbuildempty(), to avoid having to fsync in the foreground. Even if not, I think there might be other applications for this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Postgres committer: Kevin Grittner
On Thu, Jun 7, 2012 at 7:09 PM, Jaime Casanova ja...@2ndquadrant.com wrote: On Thu, Jun 7, 2012 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Please join me in welcoming him aboard. Congratulations Kevin +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WalSndWakeup() and synchronous_commit=off
On 7 June 2012 21:08, Andres Freund and...@2ndquadrant.com wrote: Moved the wakeup to a logical place outside a critical section. Hm. I don't really like the way you implemented that. While it reduces the likelihood quite a bit it will still miss wakeups if an XLogInsert pushes out the data because of missing space or if any place does an XLogFlush(lsn). The knowledge is really only available in XLogWrite... Right, but the placement inside the critical section was objected to. This way, any caller of XLogFlush() will be swept up at least once per wal_writer_delay, so missing a few calls doesn't mean we have spikes in replication delay. Doing it more frequently was also an objection from Fujii, to which we must listen. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
On 7 June 2012 22:54, Tom Lane t...@sss.pgh.pa.us wrote: I thought it would be a lot safer and probably a little bit quicker if we just split DropRelFileNodeBuffers into two routines, one for the specific-fork case and one for the all-forks case; and then the same for its main caller smgrdounlink. So I modified the patch along those lines and committed it. As committed, the smgrdounlinkfork case is actually dead code; it's never called from anywhere. I left it in place just in case we want it someday. That's fine. The first version of the patch did it exactly that way. I tried to double guess objections and so recoded it the way submitted. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] log_newpage header comment
Robert Haas robertmh...@gmail.com writes: It seems that in implementing ginbuildempty(), I falsified the first note in the header comment for log_newpage(): * Note: all current callers build pages in private memory and write them * directly to smgr, rather than using bufmgr. Therefore there is no need * to pass a buffer ID to XLogInsert, nor to perform MarkBufferDirty within * the critical section. 1. Considering that we're logging the entire page, is it necessary (or even desirable) to include the buffer ID in the rdata structure? If so, why? To put that another way, does my abuse of log_newpage constitute a bug in gistbuildempty()? AFAICS, not passing the buffer ID to XLogInsert is not an issue, since we are logging the whole page in any case. However, failing to perform MarkBufferDirty within the critical section definitely is an issue. 2. Should we add a new function that does the same thing as log_newpage for a shared buffer? I'm imagining that the signature would be: Either that or rethink building this data in shared buffers. What's the point of that, exactly, for a page that we are most certainly not going to use in normal operation? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] slow dropping of tables, DropRelFileNodeBuffers, tas
On Thu, 7 Jun 2012, Tom Lane wrote: I extended the patch to also cover DropDatabaseBuffers, FlushRelationBuffers, and FlushDatabaseBuffers, which have got the exact same type of full-pool scan loop, and committed it. Thanks everybody for the patches/commits. Sergey * Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
On 7 June 2012 18:03, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote: Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. But that's exactly the problem - it might not be soon at all. We have customers who process about one write transaction per day. The current state of play in 9.2 is that we'll checkpoint when we get to the next WAL segment. But at one write transaction per day, that could take weeks or months. Someone invented a setting called 'checkpoint_timeout' for a very good reason - people don't want huge amounts of time to pass between checkpoints. That setting is currently capped at one hour; the proposition that someone who sets it to 5 minutes on a low-write-volume system is OK with the effective value being 5 months does not seem likely to me. We discussed that in the original thread. The purpose of a checkpoint is to reduce recovery time. Nobody actually wants a checkpoint just of itself and why would we care how many months that takes? I grant that it takes a little while to come to terms with that thought, but that doesn't make the patch wrong. The only risk of data loss is in the case where someone deletes their pg_xlog and who didn't take a backup in all that time, which is hardly recommended behaviour. We're at exactly the same risk of data loss if someone deletes their pg_clog. Too frequent checkpoints actually makes the data loss risk from deleted pg_clog greater, so the balance of data loss risk doesn't seem to have altered. Your suggestion of just checkpointing CLOG seems like it would mitigate the worst of the damage, but I think I agree with Tom that just reverting the whole patch would be a better solution, if we can't figure out anything cleaner. It's better to have a few unnecessary checkpoints than to risk losing somebody's data, especially since the unnecessary checkpoints only happen with wal_level=hot_standby, but the data loss risk exists for everyone. We're not at risk of losing anyone's data. There is no aspect of the normal running system that is at a higher risk of data loss. I don't think its fair comment to claim the patch has issues because you've found a guy with very, very, very low write rates, no backup strategy and a propensity to delete essential parts of their database who will be adversely affected. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
On 7 June 2012 18:03, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote: Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. But that's exactly the problem - it might not be soon at all. We have customers who process about one write transaction per day. The current state of play in 9.2 is that we'll checkpoint when we get to the next WAL segment. But at one write transaction per day, that could take weeks or months. Someone invented a setting called 'checkpoint_timeout' for a very good reason - people don't want huge amounts of time to pass between checkpoints. That setting is currently capped at one hour; the proposition that someone who sets it to 5 minutes on a low-write-volume system is OK with the effective value being 5 months does not seem likely to me. ISTM that this low-volume system accepts approximately the same risk of data loss as a high volume system that writes the same volume of data, but in 5 minutes rather than 5 months. You might argue that the scope of things that can go wrong in 5 months is far greater than the scope of 5 minutes. I might counter that the mechanical wear attendant to constant writes was a more important factor, or flash memory having gone through one too many P/E cycles, becoming unreliable. The bottom line is that once you've flushed WAL, you're by definition home-free - if you're relying on a checkpoint to save you from data loss, you're in big trouble anyway. Checkpoints are theoretically just for putting some cap on recovery time. That's pretty much how they're described in the literature. Your customer's use-case seems very narrow, and your complaint seems unusual to me, but couldn't you just get the customer to force checkpoints in a cronjob or something? CheckPointStmt will force, provided !RecoveryInProgress() . -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding adjacent checkpoint records
Peter Geoghegan pe...@2ndquadrant.com writes: On 7 June 2012 18:03, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote: Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. But that's exactly the problem - it might not be soon at all. Your customer's use-case seems very narrow, and your complaint seems unusual to me, but couldn't you just get the customer to force checkpoints in a cronjob or something? CheckPointStmt will force, provided !RecoveryInProgress() . I think both you and Simon have completely missed the point. This is not a use case in the sense of someone doing it deliberately. This is about data redundancy, ie, if you lose your WAL through some unfortunate mishap, are you totally screwed or is there a reasonable chance that the data is on-disk in the main data store? I would guess that the incidents Robert has been talking about were cases where EDB were engaged to do crash recovery, and were successful precisely because PG wasn't wholly dependent on the WAL copy of the data. This project has always put reliability first. It's not clear to me why we would compromise that across-the-board in order to slightly reduce idle load in replication configurations. Yeah, it's probably not a *large* compromise ... but it is a compromise, and one that doesn't seem to me to be very well-advised. We can fix the idle-load issue without compromising on this basic goal; it will just take more than a ten-line patch to do it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers