Re: [HACKERS] Removing PD_ALL_VISIBLE
On 17 January 2013 03:02, Jeff Davis pg...@j-davis.com wrote: Rebased patch attached. No significant changes. Jeff, can you summarise/collate why we're doing this, what concerns it raises and how you've dealt with them? That will help decide whether to commit. Thanks -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CF3+4
On Jan 17, 2013 8:15 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: At 2013-01-17 16:05:05 +0900, michael.paqu...@gmail.com wrote: Is it really necessary to create a new commit fest just to move the items? Marking the patches that are considered as being too late for 9.3 should be just returned with feedback. Opening 2013-03 is not so much to move existing patches, but to give people a place to submit *new* post-9.3 patches without interrupting the 2013-01 CF. Yeah, and +1 for doing that. The sooner the better. By whichever of the time frames for the cf that had been discussed, it should certainly not be open for accepting new patches for 9.3 anymore. /Magnus
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, Jan 17, 2013 at 2:11 PM, Simon Riggs si...@2ndquadrant.com wrote: On 17 January 2013 03:02, Jeff Davis pg...@j-davis.com wrote: Rebased patch attached. No significant changes. Jeff, can you summarise/collate why we're doing this, what concerns it raises and how you've dealt with them? That will help decide whether to commit. +1. On another thread Set visibility map bit after HOT prune, Robert mentioned that its not such a good idea to remove the PD_ALL_VISIBLE flag because it helps us to reduce the contention on the VM page, especially when we need to reset the VM bit. Here is an excerpt from Robert's comment that thread: Sure, but you're zipping rather blithely past the disadvantages of such an approach. Jeff Davis recently proposed getting rid of PD_ALL_VISIBLE, and Tom and I both expressed considerable skepticism about that; this proposal has the same problems. One of the major benefits of PD_ALL_VISIBLE is that, when it isn't set, inserts, updates, and deletes to the page can ignore the visibility map. That means that a server under heavy concurrency is much less likely to encounter contention on the visibility map blocks. Now, maybe that's not really a problem, but I sure haven't seen enough evidence to make me believe it. If it's really true that PD_ALL_VISIBLE needn't fill this role, then Heikki wasted an awful lot of time implementing it, and I wasted an awful lot of time keeping it working when I made the visibility map crash-safe for IOS. That could be true, but I tend to think it isn't. May be you've already addressed that concern with the proven performance numbers, but I'm not sure. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CF3+4 (was Re: Parallel query execution)
On Thu, Jan 17, 2013 at 8:19 AM, Pavan Deolasee pavan.deola...@gmail.com wrote: On Wed, Jan 16, 2013 at 1:51 PM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: At 2013-01-16 02:07:29 -0500, t...@sss.pgh.pa.us wrote: In case you hadn't noticed, we've totally lost control of the CF process. What can we do to get it back on track? I know various people (myself included) have been trying to keep CF3 moving, e.g. sending followup mail, adjusting patch status, etc. I want to help, but I don't know what's wrong. What are the committers working on, and what is the status of the Ready for commiter patches? Is the problem that the patches marked Ready aren't, in fact, ready? Or is it lack of feedback from authors? Or something else? ISTM that even committers are often overwhelmed with the work, their own as well as that of reviewing other's patches and committing them. Especially when a patch is large or touches core areas, I can feel the significant work that the committer has to do even after some help and initial review from CF reviewers. On the other hand, if the patches are not committed in time, we loose context, interest dies down and when the patch is actually picked up by a committer who is often not involved in the original discussion, many points need to be revisited and reworked. Would it help to step up a few developers and create a second line of committers ? The commits by the second line committers will still be reviewed by the first line committers before they make into the product, but may be at later stage or when we are in beta. I thought of even suggesting that the master branch will contain only commits by the first line committers. We then maintain a secondary branch which also have commits from the second line committers in addition to all commits from the master branch. The first line committers can then cherry pick from the secondary branch at some later stage. But not sure if this will add more overhead and defeat the problem at hand. While we can certainly do that, it would probably help just to havae a second line of reviewers. Basically a set of more senior reviewers - so a patch would go submission - reviewer - senior reviewer - committer. With the second line of reviewers focusing more on the whole how to do things, etc. As you, I'm not sure if it creates more overhead than it solves, but it might be worth a try. In a way it already exists, because I'm sure committers pay slightly more attention to reviews by people who ahve been doing it a lot and are known to process those things, than to new entries. All that woudl bee needed was for those people to realize it would be helpful for them to do a second-stage review even if somebody else has done the first one. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
At 2013-01-17 08:41:37 +, si...@2ndquadrant.com wrote: Jeff, can you summarise/collate why we're doing this, what concerns it raises and how you've dealt with them? Since I was just looking at the original patch and discussion, and since Pavan has posted an excerpt from one objection to it, here's an excerpt from Jeff's original post titled Do we need so many hint bits? http://www.postgresql.org/message-id/1353026577.14335.91.camel@sussancws0025 Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced in the visibility map patch, apparently as a way to know when to clear the VM bit when doing an update. It was then also used for scans, which showed a significant speedup. But I wonder: why not just use the visibilitymap directly from those places? It can be used for the scan because it is crash safe now (not possible before). And since it's only one lookup per scanned page, then I don't think it would be a measurable performance loss there. Inserts/updates/deletes also do a significant amount of work, so again, I doubt it's a big drop in performance there -- maybe under a lot of concurrency or something. The benefit of removing PD_ALL_VISIBLE would be significantly higher. It's quite common to load a lot of data, and then do some reads for a while (setting hint bits and flushing them to disk), and then do a VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the pages again. Also, if I remember correctly, Robert went to significant effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed before? There was considerable discussion after this (accessible through the archives link above), which I won't attempt to summarise. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote: This might be way more than we want to do, but there is an article that describes some techniques for doing what seems to be missing (AIUI): http://msdn.microsoft.com/en-us/magazine/cc163996.aspx Even this would be doable, I'm afraid it may not fit in 9.3 if we think about the current status of CF. So our choice would be: 1) Postpone the patch to 9.4 2) Commit the patch in 9.3 without Windows support I personally am ok with #2. We traditionally avoid particular paltform specific features on PostgreSQL. However I think the policiy could be losen for contrib staffs. Also pgbench is just a client program. We could always use pgbench on UNIX/Linux if we truely need the feature. What do you think? Fair enough, I was just trying to point out alternatives. We have committed platform-specific features before now. I hope it doesn't just get left like this, though. We have committed platform-specific features before, but generally only when it's not *possible* to do them for all platforms. For example the posix_fadvise stuff isn't available on Windows at all, so there isn't much we can do there. Yeah, I hope someone pick this up and propose as a TODO item. In the mean time, I'm going to commit the patch without Windows support unless there's objection. Perhaps we should actually hold off until someone committs to actually getting it fixed in the next version? If we do have that, then we can commit it as a partial feature, but if we just hope someone picks it up, that's leaving it very loose.. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, Jan 17, 2013 at 2:57 PM, Abhijit Menon-Sen a...@2ndquadrant.comwrote: There was considerable discussion after this (accessible through the archives link above), which I won't attempt to summarise. I thought Robert made those comments after considerable discussions on Jeff's approach. So he probably still stands by his objections or at least not satisfied/seen the numbers. Now that I look at the patch, I wonder if there is another fundamental issue with the patch. Since the patch removes WAL logging for the VM set operation, this can happen: 1. Vacuum kicks in and clears all dead tuples in a page and decides that its all-visible 2. Vacuum WAL-logs the cleanup activity and marks the page dirty 3. Vacuum sets the visibility bit and marks the VM page dirty 4. Say the VM page gets written to the disk. The heap page is not yet written neither the WAL log corresponding to the cleanup operation 5. CRASH After recovery, the VM bit will remain set because the VM page got written before the crash. But since heap page's cleanup WAL did not made to the disk, those operations won't be replayed. The heap page will be left with not-all-visible tuples in that case and its not a good state to be in. The original code does not have this problem because the VM set WAL gets written after the heap page cleanup WAL. So its guaranteed that the VM bit will be set during recovery only if the cleanup WAL is replayed too (there is more magic than what meets the eye and I think its not fully documented). Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] review: pgbench - aggregation of info written into log
On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote: This might be way more than we want to do, but there is an article that describes some techniques for doing what seems to be missing (AIUI): http://msdn.microsoft.com/en-us/magazine/cc163996.aspx Even this would be doable, I'm afraid it may not fit in 9.3 if we think about the current status of CF. So our choice would be: 1) Postpone the patch to 9.4 2) Commit the patch in 9.3 without Windows support I personally am ok with #2. We traditionally avoid particular paltform specific features on PostgreSQL. However I think the policiy could be losen for contrib staffs. Also pgbench is just a client program. We could always use pgbench on UNIX/Linux if we truely need the feature. What do you think? Fair enough, I was just trying to point out alternatives. We have committed platform-specific features before now. I hope it doesn't just get left like this, though. We have committed platform-specific features before, but generally only when it's not *possible* to do them for all platforms. For example the posix_fadvise stuff isn't available on Windows at all, so there isn't much we can do there. Right - having platform specific features for other reasons like lack of time is a slippery slope in my opinion. We should not get into such a habit or Windows will quickly become a second class platform as far as PostgreSQL features are concerned. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On Thu, Jan 17, 2013 at 11:09 AM, Dave Page dp...@pgadmin.org wrote: On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote: This might be way more than we want to do, but there is an article that describes some techniques for doing what seems to be missing (AIUI): http://msdn.microsoft.com/en-us/magazine/cc163996.aspx Even this would be doable, I'm afraid it may not fit in 9.3 if we think about the current status of CF. So our choice would be: 1) Postpone the patch to 9.4 2) Commit the patch in 9.3 without Windows support I personally am ok with #2. We traditionally avoid particular paltform specific features on PostgreSQL. However I think the policiy could be losen for contrib staffs. Also pgbench is just a client program. We could always use pgbench on UNIX/Linux if we truely need the feature. What do you think? Fair enough, I was just trying to point out alternatives. We have committed platform-specific features before now. I hope it doesn't just get left like this, though. We have committed platform-specific features before, but generally only when it's not *possible* to do them for all platforms. For example the posix_fadvise stuff isn't available on Windows at all, so there isn't much we can do there. Right - having platform specific features for other reasons like lack of time is a slippery slope in my opinion. We should not get into such a habit or Windows will quickly become a second class platform as far as PostgreSQL features are concerned. Especially since there is no lack of time - the functionality is there, it just looks (significantly) different. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Robert Haas robertmh...@gmail.com writes: - adds ddl_command_trace and ddl_command_end events - causes those events to be called not only for actual SQL commands but also for things that happen, at present, to go through the same code path - adds additional magic variables to PL/pgsql to expose new information not previously exposed - adds CONTEXT as an additional event-trigger-filter variable, along with TAG In my mind, that's four or five separate patches. You don't have to agree, but that's what I think, and I'm not going to apologize for thinking it. Moreover, I think you will struggle to find examples of patches committed in the last three years that made as many different, only loosely-related changes as you're proposing for this one. So, say I do that. Now we have 5 patches to review. They all are against the same set of files, so there's a single possible ordering to apply them, and so to review them. When we reach to an agreement out-of-order (we will), it takes a sensible amount of time to rebuild the patch set in a different order. I don't think we have the time to do that. Now, if that's what it takes, I'll spend time on it. In which exact order do you want to be reviewing and applying that series of patches? As for the patch not being well-thought-out, I'm sticking by that, too. Alvaro's comments about lock escalations should be enough to tickle your alarm bells, but there are plenty of other problems here, too. Here's the comment in the code where the lock problems are discussed: /* * Fill in the EventTriggerTargetOid for a DROP command and a * ddl_command_start Event Trigger: the lookup didn't happen yet and we * still want to provide the user with that information when possible. * * We only do the lookup if the command contains a single element, which is * often forced to be the case as per the grammar (see the drop_type: * production for a list of cases where more than one object per command is * allowed). * * We take no exclusive lock here, the main command will have to do the * name lookup all over again and take appropriate locks down after the * User Defined Code runs anyway, and the commands executed by the User * Defined Code will take their own locks. We only lock the objects here so * that they don't disapear under us in another concurrent transaction, * hence using ShareUpdateExclusiveLock. XXX this seems prone to lock * escalation troubles. */ My thinking is that the user code can do anything, even run a DROP command against the object that we are preparing ourselves to be dropping. I don't think we can bypass doing the lookup twice. The way not to have to do any lookup in our code is not to report the OID of the Object being dropped. Then, the event trigger code will do that lookup as its first thing, certainly. The lookup will happen. RAISE NOTICE 'drop table % (%)', tg_objectname::regclass, tg_objectname::regclass::oid; That is, at least, the angle I considered when crafting this patch. I'm happy to change that angle if needed and adjust to things I missed and you will tell me about. I think you might want to review the use case behing ddl_command_trace, that has been asked by who users wanted to see their use case supported in some easier way than just what you're talking about here. […] Being able to do that in one command instead of two is not a sufficient reason to add another event type. That feature was proposed in past october (with a patch) and received no comment, so I went on with it. The performance costs of this patch is another couple of cache lookups when doing a DDL command, which I saw as acceptable. http://www.postgresql.org/message-id/m28vbgcc30@2ndquadrant.fr I'm not so attached to it that I will put the rest of the patch in danger with it, I'm fine about removing that part if we have a good reason to. And yes, you (politely) saying I don't like it is a good enough reason to. Our process for a non-commiter proposal seems to me that you have to send a patch showing what you're talking about for a thread to form and to get some comments on the idea, either rejecting it or reaching to a consensus about it. So the opening of that thread was in october and the discussion happens now: it's very fine, and I could be happy and thankfull about this fact with some subtle changes in the form of the messages. I'm very much opposed to that proposal, as I am to your proposal to expose internal and generated events to users. Recursing into ProcessUtility is a nasty, messy hook that is responsible for subtle bugs and locking problems in our current DDL implementation. We We are publishing 2 pieces of information tied to the implementation in the current patch: - the parse tree, only available to C code, and without any API stability promises - the fact that somme commands are running nested commands: serial does a create sequence, create schema create
Re: [HACKERS] Multiple --table options for other commands
On Fri, Dec 14, 2012 at 4:14 PM, Karl O. Pinc k...@meme.com wrote: On 12/13/2012 11:02:56 PM, Josh Kupershmidt wrote: On Thu, Dec 13, 2012 at 9:03 PM, Karl O. Pinc k...@meme.com wrote: Sorry to be so persnickety, and unhelpful until now. It seemed like it should be doable, but something was going wrong between keyboard and chair. I guess I should be doing this when better rested. No problem, here is v5 with changed synopses. The clusterdb synopsis had tabs in it, which I understand is frowned upon in the docs. I've fixed this. It looks good to me, passes check and so forth. Attached is a v6 patch, with no tabs in docs and based off the latest head. I'm marking it ready for committer. Thanks. Applied, with only some small whitespace changes. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
On 2013-01-17 01:38:31 -0500, Tom Lane wrote: But having said that ... are we sure this code is not actually broken? ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. You're absolutely right. ISTM, to fix it we would have to either provide a COMERROR like facility for FATAL errors or just remove the requirement of raising an error exactly there. If I remember what I tried before correctly the latter seems to involve setting openssl into nonblocking mode and check for the saved error in the EINTR loop in be-secure and re-raise the error from there. Do we want to backport either - there hasn't been any report that I could link to it right now, but on the other hand it could possibly cause rather ugly problems (data leakage, segfaults, code execution aren't all that improbable)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: Hi all, There is a strange bug with the latest master head (commit 7fcbf6a). When the WAL stream with a master is cut on a slave, slave returns a FATAL (well normal...), but then enters in recovery process and automatically promotes. Here are more details about the logs on slave (I simply killed the master manually): FATAL: could not receive data from WAL stream: cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00010004’: No such file or directory LOG: record with zero length at 0/401E1B8 LOG: redo done at 0/401E178 LOG: last completed transaction was at log time 2013-01-17 20:27:53.180971+09 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0002.history’: No such file or directory LOG: selected new timeline ID: 2 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0001.history’: No such file or directory LOG: archive recovery complete DEBUG: resetting unlogged relations: cleanup 0 init 1 LOG: database system is ready to accept connections LOG: autovacuum launcher started DEBUG: archived transaction log file 00010004 DEBUG: archived transaction log file 0002.history LOG: statement: create table bn (a int); DEBUG: autovacuum: processing database postgres Slave does not try anymore to reconnect to master with messages of the type: FATAL: could not connect to the primary server I also noticed that there is some delay until modifications on master are visible on slave. For example run a simple CREATE TABLE and the new table is not [some bisecting later...] I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Does someone else noticed that? No, at least I haven't, but it got committed only a rather short time ago ;) Looking, I have an inkling where the rpoblem could be. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
Dne 17.01.2013 10:36, Magnus Hagander napsal: On Thu, Jan 17, 2013 at 2:35 AM, Tatsuo Ishii is...@postgresql.org wrote: This might be way more than we want to do, but there is an article that describes some techniques for doing what seems to be missing (AIUI): http://msdn.microsoft.com/en-us/magazine/cc163996.aspx Even this would be doable, I'm afraid it may not fit in 9.3 if we think about the current status of CF. So our choice would be: 1) Postpone the patch to 9.4 2) Commit the patch in 9.3 without Windows support I personally am ok with #2. We traditionally avoid particular paltform specific features on PostgreSQL. However I think the policiy could be losen for contrib staffs. Also pgbench is just a client program. We could always use pgbench on UNIX/Linux if we truely need the feature. What do you think? Fair enough, I was just trying to point out alternatives. We have committed platform-specific features before now. I hope it doesn't just get left like this, though. We have committed platform-specific features before, but generally only when it's not *possible* to do them for all platforms. For example the posix_fadvise stuff isn't available on Windows at all, so there isn't much we can do there. Maybe, although this platform-dependence already exists in pgbench to some extent (the Windows branch is unable to log the timestamps of transactions). It would certainly be better if pgbench was able to handle Windows and Linux equally, but that was not the aim of this patch. Yeah, I hope someone pick this up and propose as a TODO item. In the mean time, I'm going to commit the patch without Windows support unless there's objection. Perhaps we should actually hold off until someone committs to actually getting it fixed in the next version? If we do have that, then we can commit it as a partial feature, but if we just hope someone picks it up, that's leaving it very loose.. Well, given that I'm an author of that patch, that 'someone' would have to be me. The problem is I have access to absolutely no Windows machines, not mentioning the development tools (and that I have no clue about it). I vaguely remember there were people on this list doing Windows development on a virtual machine or something. Any interest in working on this / giving me some help? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
Dne 17.01.2013 11:16, Magnus Hagander napsal: On Thu, Jan 17, 2013 at 11:09 AM, Dave Page dp...@pgadmin.org wrote: On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote: We have committed platform-specific features before, but generally only when it's not *possible* to do them for all platforms. For example the posix_fadvise stuff isn't available on Windows at all, so there isn't much we can do there. Right - having platform specific features for other reasons like lack of time is a slippery slope in my opinion. We should not get into such a habit or Windows will quickly become a second class platform as far as PostgreSQL features are concerned. Especially since there is no lack of time - the functionality is there, it just looks (significantly) different. Really? Any link to relevant docs or something? When doing some research in this field, the only option I was able to come up with was combining gettimeofday() with the timing functionality, and do something like this: 1) call gettimeofday() at thread start, giving a common unix timestamp 2) measure the time from the thread start using the conters (for each transaction) 3) combine those values This might of course give up to a second difference compared to the actual time (because of the gettimeofday precision), but IMHO that's fine. An even simpler option would omit the (1), so the timestamps would start at 0. Or is there a better way? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Materialized views WIP patch
On Wed, Jan 16, 2013 at 1:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Where I really need someone to hit me upside the head with a clue-stick is the code I added to the bottom of RelationBuildDesc() in relcache.c. The idea is that on first access to an unlogged MV, to detect that the heap has been replaced by the init fork, set relisvalid to false, and make the heap look normal again. Hmm. I agree that relcache.c has absolutely no business doing that, but not sure what else to do instead. Seems like it might be better done at first touch of the MV in the parser, rewriter, or planner --- but the fact that I can't immediately decide which of those is right makes me feel that it's still too squishy. I think we shouldn't be doing that at all. The whole business of transferring the relation-is-invalid information from the relation to a pg_class flag seems like a Rube Goldberg device to me. I'm still not convinced that we should have a relation-is-invalid flag at all, but can we at least not have two? It seems perfectly adequate to detect that the MV is invalid when we actually try to execute a plan - that is, when we first access the heap or one of its indexes. So the bit can just live in the file-on-disk, and there's no need to have a second copy of it in pg_class. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On 01/17/2013 06:11 AM, Tomas Vondra wrote: Dne 17.01.2013 11:16, Magnus Hagander napsal: On Thu, Jan 17, 2013 at 11:09 AM, Dave Page dp...@pgadmin.org wrote: On Thu, Jan 17, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote: We have committed platform-specific features before, but generally only when it's not *possible* to do them for all platforms. For example the posix_fadvise stuff isn't available on Windows at all, so there isn't much we can do there. Right - having platform specific features for other reasons like lack of time is a slippery slope in my opinion. We should not get into such a habit or Windows will quickly become a second class platform as far as PostgreSQL features are concerned. Especially since there is no lack of time - the functionality is there, it just looks (significantly) different. Really? Any link to relevant docs or something? When doing some research in this field, the only option I was able to come up with was combining gettimeofday() with the timing functionality, and do something like this: 1) call gettimeofday() at thread start, giving a common unix timestamp 2) measure the time from the thread start using the conters (for each transaction) 3) combine those values This might of course give up to a second difference compared to the actual time (because of the gettimeofday precision), but IMHO that's fine. The link I posted showed a technique which essentially uses edge detection on gettimeofday() to get an accurate correspondence between the time of day and the performance counter, following which you can supposedly calculate the time of day with a high degree of accuracy just from the performance counter. You should be able to do this just once, at program start. If you don't care that much about the initial precision then your procedure should work fine, I think. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: Hi all, There is a strange bug with the latest master head (commit 7fcbf6a). When the WAL stream with a master is cut on a slave, slave returns a FATAL (well normal...), but then enters in recovery process and automatically promotes. Here are more details about the logs on slave (I simply killed the master manually): FATAL: could not receive data from WAL stream: cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00010004’: No such file or directory LOG: record with zero length at 0/401E1B8 LOG: redo done at 0/401E178 LOG: last completed transaction was at log time 2013-01-17 20:27:53.180971+09 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0002.history’: No such file or directory LOG: selected new timeline ID: 2 cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/0001.history’: No such file or directory LOG: archive recovery complete DEBUG: resetting unlogged relations: cleanup 0 init 1 LOG: database system is ready to accept connections LOG: autovacuum launcher started DEBUG: archived transaction log file 00010004 DEBUG: archived transaction log file 0002.history LOG: statement: create table bn (a int); DEBUG: autovacuum: processing database postgres Slave does not try anymore to reconnect to master with messages of the type: FATAL: could not connect to the primary server I also noticed that there is some delay until modifications on master are visible on slave. For example run a simple CREATE TABLE and the new table is not [some bisecting later...] I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Ok, my inkling proved to be correct, its two related issues: a ) The error handling in XLogReadRecord is inconsistent, it doesn't always reset the necessary things. b) ReadRecord: We cannot not break out of the retry loop in readRecord just so, just removing the break seems correct. c) ReadRecord: (minor): We should log an error even if errormsg is not set, otherwise we wont jump out far enough. I think at least a) and b) is the result of merges between development of different people, sorry for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CF3+4
On 01/17/2013 04:43 PM, Magnus Hagander wrote: On Jan 17, 2013 8:15 AM, Abhijit Menon-Sen a...@2ndquadrant.com mailto:a...@2ndquadrant.com wrote: At 2013-01-17 16:05:05 +0900, michael.paqu...@gmail.com mailto:michael.paqu...@gmail.com wrote: Is it really necessary to create a new commit fest just to move the items? Marking the patches that are considered as being too late for 9.3 should be just returned with feedback. Opening 2013-03 is not so much to move existing patches, but to give people a place to submit *new* post-9.3 patches without interrupting the 2013-01 CF. Yeah, and +1 for doing that. The sooner the better. By whichever of the time frames for the cf that had been discussed, it should certainly not be open for accepting new patches for 9.3 anymore. I've moved all pending patches from 2012-11 to 2013-01. I'll go through and poke them for aliveness and start chasing things up; in the mean time, any chance of closing 2012-11? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] review: pgbench - aggregation of info written into log
On 01/17/2013 06:04 AM, Tomas Vondra wrote: The problem is I have access to absolutely no Windows machines, not mentioning the development tools (and that I have no clue about it). I vaguely remember there were people on this list doing Windows development on a virtual machine or something. Any interest in working on this / giving me some help? One of the items on my very long TODO list (see stuff elsewhere about committers not doing enough reviewing and committing) is to create a package that can easily be run to set Windows Postgres development environments for MSVC, Mingw and Cygwin on Amazon, GoGrid etc. Once I get that done I'll be a lot less sympathetic to I don't have access to Windows pleas. Windows does run in a VM very well, but if you're going to set it up on your own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal copy to install there. If you don't already have one, that will set you back about $140.00 (for w7 Pro) in the USA. Note that that's significantly better than the situation with OSX, which you can't run at all except on Apple hardware. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On Thu, Jan 17, 2013 at 1:29 PM, Andrew Dunstan and...@dunslane.net wrote: On 01/17/2013 06:04 AM, Tomas Vondra wrote: The problem is I have access to absolutely no Windows machines, not mentioning the development tools (and that I have no clue about it). I vaguely remember there were people on this list doing Windows development on a virtual machine or something. Any interest in working on this / giving me some help? One of the items on my very long TODO list (see stuff elsewhere about committers not doing enough reviewing and committing) is to create a package that can easily be run to set Windows Postgres development environments for MSVC, Mingw and Cygwin on Amazon, GoGrid etc. FYI, I have a similar item on my TODO list, though I was looking primarily at VC++ on AWS. It did get close to the top last week, but then I got busy with other things :-/. Anyway, I'd suggest we ping each other if either actually get started, to avoid duplication of effort. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On Thu, Jan 17, 2013 at 2:29 PM, Andrew Dunstan and...@dunslane.net wrote: On 01/17/2013 06:04 AM, Tomas Vondra wrote: The problem is I have access to absolutely no Windows machines, not mentioning the development tools (and that I have no clue about it). I vaguely remember there were people on this list doing Windows development on a virtual machine or something. Any interest in working on this / giving me some help? One of the items on my very long TODO list (see stuff elsewhere about committers not doing enough reviewing and committing) is to create a package that can easily be run to set Windows Postgres development environments for MSVC, Mingw and Cygwin on Amazon, GoGrid etc. Once I get that done I'll be a lot less sympathetic to I don't have access to Windows pleas. Windows does run in a VM very well, but if you're going to set it up on your own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal copy to install there. If you don't already have one, that will set you back about $140.00 (for w7 Pro) in the USA. Note that that's significantly better than the situation with OSX, which you can't run at all except on Apple hardware. Yeah. I used to have an AMI with the VS environment preinstalled on Amazon, but I managed to fat finger things and delete it at some point and haven't really had time to rebuild it. Having a script that would download and install all the pre-requisites on such a box would be *great*. Then you could get up and going pretty quickly, and getting a Windows box up running for a few hours there is almost free, and you don't have to deal with licensing hassles. (Of course, the AMI method doesn't work all the way since you'd be distributing Visual Studio, but if we can have a script that auto-downloads-and-installs it as necessary we can get around that) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small pg_basebackup display bug
On Sun, Dec 16, 2012 at 7:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Sat, Dec 15, 2012 at 2:24 PM, Erik Rijkers e...@xs4all.nl wrote: That would make such a truncation less frequent, and after all a truncated display is not particular useful. Agreed - it's useful during testing, but not in a typical production use. It might actually be more useful if it's truncated in in the other end (keeping the last 30 instead of the first 30 chars) +1 for truncating from the left. I think pg_upgrade already does that in its progress messages. Fixed. I also fixed the output of the size parameter to be a fixed length, so the whole row doesn't shift left and right depending on how far long the process is. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] default SSL compression (was: libpq compression)
On Wed, Jan 2, 2013 at 3:17 PM, Magnus Hagander mag...@hagander.net wrote: On Wed, Jan 2, 2013 at 3:15 PM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 02, 2013 at 02:03:20PM +0100, Magnus Hagander wrote: On Wed, Jan 2, 2013 at 1:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: So +1 for changing it to DEFAULT from me, too. There's no reason to think we know more about this than the OpenSSL authors. The DEFAULT value in OpenSSL 1.0 means ALL:!aNULL:!eNULL. Researching some more, this might cause a problem actually, which would explain some of the things that are in our default. For example, an ADH algorithm doesn't use certificates - but it uses DH parameters, so it likely won't work anyway. EDH uses certs, but also requires DH parameters. Maybe what we nede is DEFAULT:!ADH:@STRENGTH as the default? I understand aNULL to include ADH. Hmm. Seems you're right when I run a test on it, I was reading it wrong. The other difference is that our current string denies 40 and 56 bit encryptions (low and export strenghts). Do we stll want to do that? On the one hand, those seem bad to permit by default in 2013. On the other hand, if so, why hasn't OpenSSL removed them from DEFAULT? Perhaps it has backward-compatibility concerns that wouldn't apply to us by virtue of having disabled them for some time. Sounds reasonable to continue disabling them. So then the default would be DEFAULT:!LOW:!EXP:@STRENGTH (the @STRENGTH part is the sorting key for preference, which the default one seems not to have) The biggest difference being that we start from DEFAULT rather than ALL. I've applied a change that does this, including still rejecting MD5. Meaning that the difference is we start from DEFAULT instead of ALL (and the ADH rule is removed, since !aNULL is already in the default). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] could not create directory ...: File exists
Greetings, We've come across this rather annoying error happening during our builds: ERROR: could not create directory pg_tblspc/25120/PG_9.3_201212081/231253: File exists It turns out that this is coming from copydir() when called by createdb() during a CREATE DATABASE .. FROM TEMPLATE where the template has tables in tablespaces. It turns out that createdb() currently only takes an AccessShareLock on pg_tablespace when scanning it with SnapshotNow, making it possible for a concurrent process to make some uninteresting modification to a tablespace (such as an ACL change) which will cause the heap scan in createdb() to see a given tablespace multiple times. copydir() will then, rightfully, complain that it's being asked to create a directory which already exists. Given that this is during createdb(), I'm guessing we don't have any option but to switch the scan to using ShareLock, since there isn't a snapshot available to do an MVCC scan with (I'm guessing that there could be other issues trying to do that anyway). Attached is a patch which does this and corrects the problem for us (of course, we're now going to go rework our build system to not modify tablespace ACLs, since with this patch concurrent builds end up blocking on each other, which is annoying). Thanks, Stephen colordiff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c new file mode 100644 index 1f6e02d..60a5099 *** a/src/backend/commands/dbcommands.c --- b/src/backend/commands/dbcommands.c *** createdb(const CreatedbStmt *stmt) *** 549,556 /* * Iterate through all tablespaces of the template database, and copy * each one to the new database. */ ! rel = heap_open(TableSpaceRelationId, AccessShareLock); scan = heap_beginscan(rel, SnapshotNow, 0, NULL); while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { --- 549,563 /* * Iterate through all tablespaces of the template database, and copy * each one to the new database. + * + * We need to use ShareLock to prevent other processes from updating a + * tablespace (such as changing an ACL, for example) or we will end up + * running into the same tablespace multiple times during our heap scan + * (the prior-to-update tuple and then the new tuple after the update) + * and copydir() will, rightfully, complain that the directory already + * exists. */ ! rel = heap_open(TableSpaceRelationId, ShareLock); scan = heap_beginscan(rel, SnapshotNow, 0, NULL); while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { *** createdb(const CreatedbStmt *stmt) *** 607,613 } } heap_endscan(scan); ! heap_close(rel, AccessShareLock); /* * We force a checkpoint before committing. This effectively means --- 614,620 } } heap_endscan(scan); ! heap_close(rel, ShareLock); /* * We force a checkpoint before committing. This effectively means signature.asc Description: Digital signature
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee pavan.deola...@gmail.com wrote: May be you've already addressed that concern with the proven performance numbers, but I'm not sure. It would be nice to hear what Heikki's reasons were for adding PD_ALL_VISIBLE in the first place. Jeff's approach of holding the VM pins for longer certainly mitigates some of the damage, in the sense that it reduces buffer lookups and pin/unpin cycles - and might be worth doing independently of the rest of the patch if we think it's a win. Index-only scans already use a similar optimization, so extending it to inserts, updates, and deletes is surely worth considering. The main question in my mind is whether there are any negative consequences to holding a VM buffer pin for that long without interruption. The usual consideration - namely, blocking vacuum - doesn't apply here because vacuum does not take a cleanup lock on the visibility map page, only the heap page, but I'm not sure if there are others. The other part of the issue is cache pressure. I don't think I can say it better than what Tom already wrote: # I'd be worried about the case of a lot of sessions touching a lot of # unrelated tables. This change implies doubling the number of buffers # that are held pinned by any given query, and the distributed overhead # from that (eg, adding cycles to searches for free buffers) is what you # ought to be afraid of. I agree that we ought to be afraid of that. A pgbench test isn't going to find a problem in this area because there you have a bunch of sessions all accessing the same small group of tables. To find a problem of the type above, you'd need lots of backends accessing lots of different, small tables. That's not the use case we usually benchmark, but I think there are a fair number of people doing things like that - for example, imagine a hosting provider or web application with many databases or schemas on a single instance. AFAICS, Jeff hasn't tried to test this scenario. Now, on the flip side, we should also be thinking about what we would gain from this patch, and what other ways there might be to achieve the same goals. As far as I can see, the main gain is that if you bulk-load a table, don't vacuum it right away, get all the hint bits set by some other mechanism, and then vacuum the table, you'll only read the whole table instead of rewriting the whole table. So ISTM that, for example, if we adopted the idea of making HOT prune set visibility-map bits, most of the benefit of this change evaporates, but whatever costs it may have will remain. There are other possible ways of getting the same benefit as well - for example, I've been thinking for a while now that we should try to find some judicious way of vacuuming insert-only tables, perhaps only in small chunks when there's nothing else going on. That wouldn't as clearly obsolete this patch, but it would address a very similar use case and would also preset hint bits, which would help a lot of people. Some of the ideas we've had about a HEAP_XMIN_FREEZE intersect with this idea, too - if we can do an early freeze without losing forensic information, we can roll setting the hint bit, setting PD_ALL_VISIBLE, and freezing the page into a single write. All of which is to say that I think this patch is premature. If we adopt something like this, we're likely never going to revert back to the way we do it now, and whatever cache-pressure or other costs this approach carries will be hear to stay - so we had better think awfully carefully before we do that. And, FWIW, I don't believe that there is sufficient time in this release cycle to carefully test this patch and the other alternative designs that aim toward the same ends. Even if there were, this is exactly the sort of thing that should be committed at the beginning of a release cycle, not the end, so as to allow adequate time for discovery of unforeseen consequences before the code ends up out in the wild. Of course, there's another issue here too, which is that as Pavan points out, the page throws crash-safety out the window, which breaks index-only scans (if you have a crash). HEAP_XLOG_VISIBLE is intended principally to protect the VM bit, not PD_ALL_VISIBLE, but the patch rips it out even though its purpose is to remove the latter, not the former. Removing PD_ALL_VISIBLE eliminates the need to keep the visibility-map bit consist with PD_ALL_VISIBLE, but it does not eliminate the need to keep PD_ALL_VISIBLE consistent with the page contents. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CF3+4
On Thu, Jan 17, 2013 at 8:17 AM, Craig Ringer cr...@2ndquadrant.com wrote: I've moved all pending patches from 2012-11 to 2013-01. I'll go through and poke them for aliveness and start chasing things up; in the mean time, any chance of closing 2012-11? Done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches
On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'd prefer to leave the .partial suffix in place, as the segment really isn't complete. It doesn't make a difference when you recover to the latest timeline, but if you have a more complicated scenario with multiple timelines that are still alive, ie. there's a server still actively generating WAL on that timeline, you'll easily get confused. As an example, imagine that you have a master server, and one standby. You maintain a WAL archive for backup purposes with pg_receivexlog, connected to the standby. Now, for some reason, you get a split-brain situation and the standby server is promoted with new timeline 2, while the real master is still running. The DBA notices the problem, and kills the standby and pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in pg_receivexlog's target directory, and re-points pg_recevexlog to the master while he re-builds the standby server from backup. At that point, pg_receivexlog will start streaming from the end of the zero-padded segment, not knowing that it was partial, and you have a hole in the archived WAL stream. Oops. The DBA could avoid that by also removing the last WAL segment on timeline 1, the one that was partial. But it's really not obvious that there's anything wrong with that segment. Keeping the .partial suffix makes it clear. I shudder at the idea that the DBA is manually involved in any of this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches
On 17.01.2013 16:56, Robert Haas wrote: On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'd prefer to leave the .partial suffix in place, as the segment really isn't complete. It doesn't make a difference when you recover to the latest timeline, but if you have a more complicated scenario with multiple timelines that are still alive, ie. there's a server still actively generating WAL on that timeline, you'll easily get confused. As an example, imagine that you have a master server, and one standby. You maintain a WAL archive for backup purposes with pg_receivexlog, connected to the standby. Now, for some reason, you get a split-brain situation and the standby server is promoted with new timeline 2, while the real master is still running. The DBA notices the problem, and kills the standby and pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in pg_receivexlog's target directory, and re-points pg_recevexlog to the master while he re-builds the standby server from backup. At that point, pg_receivexlog will start streaming from the end of the zero-padded segment, not knowing that it was partial, and you have a hole in the archived WAL stream. Oops. The DBA could avoid that by also removing the last WAL segment on timeline 1, the one that was partial. But it's really not obvious that there's anything wrong with that segment. Keeping the .partial suffix makes it clear. I shudder at the idea that the DBA is manually involved in any of this. The scenario I described is that you screwed up your failover environment, and end up with a split-brain situation by accident. The DBA certainly needs to be involved to recover from that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On 17.01.2013 16:53, Robert Haas wrote: On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee pavan.deola...@gmail.com wrote: May be you've already addressed that concern with the proven performance numbers, but I'm not sure. It would be nice to hear what Heikki's reasons were for adding PD_ALL_VISIBLE in the first place. The idea was to avoid clearing the bit in the VM page on every update, when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set. I assumed the traffic and contention on the VM page would be a killer otherwise. I don't remember if I ever actually tested that though. Maybe I was worrying about nothing and hitting the VM page on every update is ok. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches
On Thu, Jan 17, 2013 at 9:59 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The scenario I described is that you screwed up your failover environment, and end up with a split-brain situation by accident. The DBA certainly needs to be involved to recover from that. OK, I agree, but I still think a lot of DBAs would have no idea how to handle that situation. I agree with your proposal, don't get me wrong - I just think there's still an awful lot of room for operator error in these more complex replication scenarios. I don't have a clue how to fix that, and it's certainly not the purpose of this thread to fix that; I'm just venting. Actually, I'm really glad to see all the work you've done to improve the way that some of these scenarios work and eliminate various bugs and other surprising failure modes over the last couple of months. It's great stuff. Alas, I think we still some distance from being able to provide an easy button. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 17.01.2013 15:05, Andres Freund wrote: On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Ok, my inkling proved to be correct, its two related issues: a ) The error handling in XLogReadRecord is inconsistent, it doesn't always reset the necessary things. b) ReadRecord: We cannot not break out of the retry loop in readRecord just so, just removing the break seems correct. c) ReadRecord: (minor): We should log an error even if errormsg is not set, otherwise we wont jump out far enough. I think at least a) and b) is the result of merges between development of different people, sorry for that. Got a patch? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
Pavan Deolasee pavan.deola...@gmail.com writes: On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote: ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. I thought since FATAL will force the backend to exit, we don't care much about corrupted OpenSSL state. I even thought that's why we raise ERROR to FATAL so that the backend can start in a clean state. But clearly I'm missing a point here because you don't think that way. If we were to simply exit(1), leaving the kernel to close the client socket, it'd be safe enough because control would never have returned to OpenSSL. But this code doesn't do that. What we're looking at is that we've interrupted OpenSSL at some arbitrary point, and now we're going to make fresh calls to it to try to pump the FATAL error message out to the client. It seems fairly unlikely that that's safe. I'm not sure I credit Andres' worry of arbitrary code execution, but I do fear that OpenSSL could get confused to the point of freezing up, or even more likely that it would transmit garbage to the client, which rather defeats the purpose. Don't see a nice fix. The COMMERROR approach (ie, don't try to send anything to the client, only the log) is not nice at all since the client would get the impression that the server crashed. On the other hand, anything else requires waiting till we get control back from OpenSSL, which might be a long time, and meanwhile we're still holding locks that prevent WAL recovery from proceeding. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote: On 17.01.2013 15:05, Andres Freund wrote: On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Ok, my inkling proved to be correct, its two related issues: a ) The error handling in XLogReadRecord is inconsistent, it doesn't always reset the necessary things. b) ReadRecord: We cannot not break out of the retry loop in readRecord just so, just removing the break seems correct. c) ReadRecord: (minor): We should log an error even if errormsg is not set, otherwise we wont jump out far enough. I think at least a) and b) is the result of merges between development of different people, sorry for that. Got a patch? Yes, I am just testing some scenarios with it, will send it after that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 16:23:44 +0100, Andres Freund wrote: On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote: On 17.01.2013 15:05, Andres Freund wrote: On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Ok, my inkling proved to be correct, its two related issues: a ) The error handling in XLogReadRecord is inconsistent, it doesn't always reset the necessary things. b) ReadRecord: We cannot not break out of the retry loop in readRecord just so, just removing the break seems correct. c) ReadRecord: (minor): We should log an error even if errormsg is not set, otherwise we wont jump out far enough. I think at least a) and b) is the result of merges between development of different people, sorry for that. Got a patch? Yes, I am just testing some scenarios with it, will send it after that. Ok, the attached patch seems to fix a) and b). c) above is bogus, as explained in a comment in the patch. I also noticed that the TLI check didn't mark the last source as failed. Not a real issue and its independent from this patch but I found that when promoting from streaming rep the code first fails over to archive recovery and only then to recovering from pg_xlog. Is that intended? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 70cfabc..e33bd7a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3209,12 +3209,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, EndRecPtr = xlogreader-EndRecPtr; if (record == NULL) { - /* not all failures fill errormsg; report those that do */ - if (errormsg errormsg[0] != '\0') -ereport(emode_for_corrupt_record(emode, - RecPtr ? RecPtr : EndRecPtr), - (errmsg_internal(%s, errormsg) /* already translated */)); - lastSourceFailed = true; if (readFile = 0) @@ -3222,7 +3216,23 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, close(readFile); readFile = -1; } - break; + + /* + * We only end up here without a message when XLogPageRead() failed + * - in that case we already logged something. + * In StandbyMode that only happens if we have been triggered, so + * we shouldn't loop anymore in that case. + */ + if (errormsg == NULL) +break; + + + ereport(emode_for_corrupt_record(emode, + RecPtr ? RecPtr : EndRecPtr), + (errmsg_internal(%s, errormsg) /* already translated */)); + + /* don't make a timeline check */ + continue; } /* @@ -3234,6 +3244,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, XLogSegNo segno; int32 offset; + lastSourceFailed = true; + XLByteToSeg(xlogreader-latestPagePtr, segno); offset = xlogreader-latestPagePtr % XLogSegSize; XLogFileName(fname, xlogreader-readPageTLI, segno); @@ -3243,7 +3255,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, xlogreader-latestPageTLI, fname, offset))); - return false; + continue; } } while (StandbyMode record == NULL); diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index ff871a3..75164f6 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord); if (readOff 0) - { - if (state-errormsg_buf[0] != '\0') - *errormsg = state-errormsg_buf; - return NULL; - } + goto err; /* * ReadPageInternal always returns at least the page header, so we can @@ -246,8 +242,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) { report_invalid_record(state, invalid record offset at %X/%X, (uint32) (RecPtr 32), (uint32) RecPtr); - *errormsg = state-errormsg_buf; - return NULL; + goto err; } if XLogPageHeader) state-readBuf)-xlp_info XLP_FIRST_IS_CONTRECORD) @@ -255,8 +250,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) { report_invalid_record(state, contrecord is requested by %X/%X, (uint32) (RecPtr 32), (uint32) RecPtr); - *errormsg = state-errormsg_buf; - return NULL; + goto err; } /* ReadPageInternal has verified the page header */ @@ -270,11 +264,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) targetPagePtr, Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ)); if (readOff 0) - { - if
Re: [HACKERS] could not create directory ...: File exists
Stephen Frost sfr...@snowman.net writes: It turns out that createdb() currently only takes an AccessShareLock on pg_tablespace when scanning it with SnapshotNow, making it possible for a concurrent process to make some uninteresting modification to a tablespace (such as an ACL change) which will cause the heap scan in createdb() to see a given tablespace multiple times. copydir() will then, rightfully, complain that it's being asked to create a directory which already exists. Ugh. Still another problem with non-MVCC catalog scans. Given that this is during createdb(), I'm guessing we don't have any option but to switch the scan to using ShareLock, since there isn't a snapshot available to do an MVCC scan with (I'm guessing that there could be other issues trying to do that anyway). It seems that the only thing we actually use from each tuple is the OID. So there are other ways to fix it, of which probably the minimum-change one is to keep a list of already-processed tablespace OIDs and skip a tuple if we get a match in the list. This would be O(N^2) in the number of tablespaces, but I rather doubt that's a problem. [ thinks for a bit... ] Ugh, no, because the *other* risk you've got here is not seeing a row at all, which would be really bad. I'm not sure that transiently increasing the lock here is enough, either. The concurrent transactions you're worried about probably aren't holding their locks till commit, so you could get this lock and still see tuples with unstable committed-good states. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby conflict resolution handling
On 2013-01-17 10:19:23 -0500, Tom Lane wrote: Pavan Deolasee pavan.deola...@gmail.com writes: On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane t...@sss.pgh.pa.us wrote: ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. I thought since FATAL will force the backend to exit, we don't care much about corrupted OpenSSL state. I even thought that's why we raise ERROR to FATAL so that the backend can start in a clean state. But clearly I'm missing a point here because you don't think that way. If we were to simply exit(1), leaving the kernel to close the client socket, it'd be safe enough because control would never have returned to OpenSSL. But this code doesn't do that. What we're looking at is that we've interrupted OpenSSL at some arbitrary point, and now we're going to make fresh calls to it to try to pump the FATAL error message out to the client. It seems fairly unlikely that that's safe. I'm not sure I credit Andres' worry of arbitrary code execution, but I do fear that OpenSSL could get confused to the point of freezing up, or even more likely that it would transmit garbage to the client, which rather defeats the purpose. I don't think its likely either, I seem to remember it copying arround function pointers though, so it seems possible with some bad luck. Don't see a nice fix. The COMMERROR approach (ie, don't try to send anything to the client, only the log) is not nice at all since the client would get the impression that the server crashed. On the other hand, anything else requires waiting till we get control back from OpenSSL, which might be a long time, and meanwhile we're still holding locks that prevent WAL recovery from proceeding. I think we can make openssl return pretty much immediately if we assume recv() can reliably interrupted by signals, possibly by setting the socket to nonblocking in the signal handler. We just need to tell openssl not to retry immediately and we should be fine. Given that quite some people use openssl with nonblocking sockets, that code path should be reasonably safe. That still requires ugliness around saving the error and reraising it after returning from openssl though... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Materialized views WIP patch
On 16 January 2013 17:25, Thom Brown t...@linux.com wrote: On 16 January 2013 17:20, Kevin Grittner kgri...@mail.com wrote: Thom Brown wrote: Some weirdness: postgres=# CREATE VIEW v_test2 AS SELECT 1 moo; CREATE VIEW postgres=# CREATE MATERIALIZED VIEW mv_test2 AS SELECT moo, 2*moo FROM v_test2 UNION ALL SELECT moo, 3*moo FROM v_test2; SELECT 2 postgres=# \d+ mv_test2 Materialized view public.mv_test2 Column | Type | Modifiers | Storage | Stats target | Description --+-+---+-+--+- moo | integer | | plain | | ?column? | integer | | plain | | View definition: SELECT *SELECT* 1.moo, *SELECT* 1.?column?; You are very good at coming up with these, Thom! Will investigate. Can you confirm that *selecting* from the MV works as you would expect; it is just the presentation in \d+ that's a problem? Yes, nothing wrong with using the MV, or refreshing it: postgres=# TABLE mv_test2; moo | ?column? -+-- 1 |2 1 |3 (2 rows) postgres=# SELECT * FROM mv_test2; moo | ?column? -+-- 1 |2 1 |3 (2 rows) postgres=# REFRESH MATERIALIZED VIEW mv_test2; REFRESH MATERIALIZED VIEW But a pg_dump of the MV has the same issue as the view definition: -- -- Name: mv_test2; Type: MATERIALIZED VIEW; Schema: public; Owner: thom; Tablespace: -- CREATE MATERIALIZED VIEW mv_test2 ( moo, ?column? ) AS SELECT *SELECT* 1.moo, *SELECT* 1.?column? WITH NO DATA; A separate issue is with psql tab-completion: postgres=# COMMENT ON MATERIALIZED VIEW ^IIS This should be offering MV names instead of prematurely providing the IS keyword. -- Thom
Re: [HACKERS] Event Triggers: adding information
On Thu, Jan 17, 2013 at 5:18 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Now, if that's what it takes, I'll spend time on it. In which exact order do you want to be reviewing and applying that series of patches? Let's agree on which things we even want to do first. Here's my take: - adds ddl_command_trace and ddl_command_end events I think ddl_command_end is OK and I'm willing to commit that if extracted as its own patch. I think ddl_command_trace is unnecessary syntactic sugar. - causes those events to be called not only for actual SQL commands but also for things that happen, at present, to go through the same code path I think this is a bad idea, not only because, as I said before, it exposes internal implementation details, but also because it's probably not safe. As Noah (I believe, or maybe Tom), observed in a previous post, we've gone to a lot of work to make sure that nothing you do inside a trigger can crash the server, corrupt the catalogs, trigger elog() messages, etc. That applies in spades to DDL. Despite Tom's skepticism, I'm pretty sure that it's safe for us to execute trigger function calls either completely before or completely after we execute the DDL command. It should be no different than executing them as separate commands. But doing something in the middle of a command is something else altogether. Consider, for example, that ALTER TABLE does a bunch of crap internally to itself, and then recurses into ProcessUtility to do some more stuff. This is a terrible design, but it's what we've got. Are you absolutely sure that the intermediate state that exists after that first bunch of stuff is done and before those other things are done is a safe place to execute arbitrary user-provided code? I'm not. And the more places we add that recurse into ProcessUtility, or indeed, add event triggers in general, the more places we're going to add new failure modes. Fixing that is part of the price of adding a new event trigger and I'm OK with that. But I'm not OK with baking into the code the assumption that any current or future callers of ProcessUtility() should be prepared for arbitrary code execution. The code wasn't written with that in mind, and it will not end well. - adds additional magic variables to PL/pgsql to expose new information not previously exposed I'm not sure that I agree with the implementation of this, but I think I'm OK with the concept. I will review it when it is submitted in a form not intertwined with other things. - adds CONTEXT as an additional event-trigger-filter variable, along with TAG I'm OK with adding new event-trigger-filter variables, provided that they are done without additional run-time overhead for people who are not using event triggers; I think it's nice that we can support that. But I think this particular proposal is DOA because nothing except TOPLEVEL is actually safe. As for the patch not being well-thought-out, I'm sticking by that, too. Alvaro's comments about lock escalations should be enough to tickle your alarm bells, but there are plenty of other problems here, too. Here's the comment in the code where the lock problems are discussed: /* * Fill in the EventTriggerTargetOid for a DROP command and a * ddl_command_start Event Trigger: the lookup didn't happen yet and we * still want to provide the user with that information when possible. * * We only do the lookup if the command contains a single element, which is * often forced to be the case as per the grammar (see the drop_type: * production for a list of cases where more than one object per command is * allowed). * * We take no exclusive lock here, the main command will have to do the * name lookup all over again and take appropriate locks down after the * User Defined Code runs anyway, and the commands executed by the User * Defined Code will take their own locks. We only lock the objects here so * that they don't disapear under us in another concurrent transaction, * hence using ShareUpdateExclusiveLock. XXX this seems prone to lock * escalation troubles. */ My thinking is that the user code can do anything, even run a DROP command against the object that we are preparing ourselves to be dropping. I don't think we can bypass doing the lookup twice. Sure, but there's also the issue of getting the right answer. For example, suppose you are planning to drop a function. You do a name lookup on the name and call the event trigger. Meanwhile, someone else renames the function and creates a new one with the same name. After the event trigger returns, the actual drop code latches onto the one with the new name. Now, the event trigger ran with OID A and the actual object dropped was one with OID B. It's possible that KaiGai's RENAME refactoring in 9.3 has strengthened the locking enough that this particular failure mode is no longer possible in that specific scenario, but I am fairly confident it would have been an
Re: [HACKERS] Event Triggers: adding information
On 2013-01-17 11:15:24 -0500, Robert Haas wrote: As a further example, suppose that in 9.4 (or 9.17) we add a command DROP TABLES IN SCHEMA fred WHERE name LIKE 'bob%'. Well, the logging trigger still just works (because it's only writing the statement, without caring about its contents). And the replication trigger still just works too (because it will still get a callback for every table that gets dropped, even though it knows nothing about the new syntax). That's very powerful. Of course, there will still be cases where things have to change, but you can minimize it by clean definitions. It pains me that I've evidently failed to communicate this concept clearly despite a year or more of trying. Does that make sense? Is there some way I can make this more clear? The difference seems very clear-cut to me, but evidently I'm not conveying it well. Well, I didn't manage to follow the topic with the level of detail I would have liked to/should have so its very well possible that I missed a proposal of how to implement that concept in a way you like? With some squinting you can actually see the proposed ddl_trace callsite as exactly that kind of replication trigger, but with a worse name... Without piggy-backing on the internal commands generated - which currently seems to be achieved by passing everything through ProcessUtility, but could work in a quite different way - and reconstructing sql from that I don't immediately see how you want to do this? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 17.01.2013 17:42, Andres Freund wrote: Ok, the attached patch seems to fix a) and b). c) above is bogus, as explained in a comment in the patch. I also noticed that the TLI check didn't mark the last source as failed. This looks fragile: /* * We only end up here without a message when XLogPageRead() failed * - in that case we already logged something. * In StandbyMode that only happens if we have been triggered, so * we shouldn't loop anymore in that case. */ if (errormsg == NULL) break; I don't like relying on the presence of an error message to control logic like that. Should we throw in an explicit CheckForStandbyTrigger() check in the condition of that loop? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote: On 17.01.2013 17:42, Andres Freund wrote: Ok, the attached patch seems to fix a) and b). c) above is bogus, as explained in a comment in the patch. I also noticed that the TLI check didn't mark the last source as failed. This looks fragile: /* * We only end up here without a message when XLogPageRead() failed * - in that case we already logged something. * In StandbyMode that only happens if we have been triggered, so * we shouldn't loop anymore in that case. */ if (errormsg == NULL) break; I don't like relying on the presence of an error message to control logic like that. Should we throw in an explicit CheckForStandbyTrigger() check in the condition of that loop? I agree, I wasn't too happy about that either. But in some sense its only propagating state from XLogReadPage which already has dealt with the error and decided its ok. Its the solution closest to what happened in the old implementation, thats why I thought it would be halfway to acceptable. Adding the CheckForStandbyTrigger() in the condition would mean promotion would happen before all the available records are processed and it would increase the amount of stat()s tremendously. So I don't really like that either. Don't really have a good idea :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latex longtable format
On Sat, Jan 12, 2013 at 07:09:31PM -0500, Bruce Momjian wrote: I have received several earnest requests over the years for LaTeX 'longtable' output, and I have just implemented it based on a sample LaTeX longtable output file. I have called it 'latex-longtable' and implemented all the behaviors of ordinary latex mode. One feature is that in latex-longtable mode, 'tableattr' allows control over the column widths --- that seemed to be very important to the users. One requested change I made to the ordinary latex output was to suppress the line under the table title if border = 0 (default is border = 1). Patch and sample output attached. I would like to apply this for PG 9.3. Modified patch applied. I didn't need to modify our existing 'latex' output format, except to add border=3 support. It was tempting to suggest renaming our 'latex' output format to 'latex-tabular', because that is what it uses, but 'tabular' is a package included by default in Latex, while 'longtable' requires two additional packages to be specified, so I resisted suggesting it. I did document that 'latex' uses 'tabular'. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 17.01.2013 18:42, Andres Freund wrote: On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote: On 17.01.2013 17:42, Andres Freund wrote: Ok, the attached patch seems to fix a) and b). c) above is bogus, as explained in a comment in the patch. I also noticed that the TLI check didn't mark the last source as failed. This looks fragile: /* * We only end up here without a message when XLogPageRead() failed * - in that case we already logged something. * In StandbyMode that only happens if we have been triggered, so * we shouldn't loop anymore in that case. */ if (errormsg == NULL) break; I don't like relying on the presence of an error message to control logic like that. Should we throw in an explicit CheckForStandbyTrigger() check in the condition of that loop? I agree, I wasn't too happy about that either. But in some sense its only propagating state from XLogReadPage which already has dealt with the error and decided its ok. Its the solution closest to what happened in the old implementation, thats why I thought it would be halfway to acceptable. Adding the CheckForStandbyTrigger() in the condition would mean promotion would happen before all the available records are processed and it would increase the amount of stat()s tremendously. So I don't really like that either. I was thinking of the attached. As long as we check for CheckForStandbyTrigger() after the record == NULL check, we won't perform extra stat() calls on successful reads, only when we're polling after reaching the end of valid WAL. That seems acceptable. If we want to avoid even that, we could move the static 'triggered' variable from CheckForStandbyTrigger() out of that function, and check that in the loop. - Heikki diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 70cfabc..ac2b26b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3180,7 +3180,8 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index, * try to read a record just after the last one previously read. * * If no valid record is available, returns NULL, or fails if emode is PANIC. - * (emode must be either PANIC, LOG) + * (emode must be either PANIC, LOG). In standby mode, retries until a valid + * record is available. * * The record is copied into readRecordBuf, so that on successful return, * the returned record pointer always points there. @@ -3209,12 +3210,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, EndRecPtr = xlogreader-EndRecPtr; if (record == NULL) { - /* not all failures fill errormsg; report those that do */ - if (errormsg errormsg[0] != '\0') -ereport(emode_for_corrupt_record(emode, - RecPtr ? RecPtr : EndRecPtr), - (errmsg_internal(%s, errormsg) /* already translated */)); - lastSourceFailed = true; if (readFile = 0) @@ -3222,7 +3217,20 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, close(readFile); readFile = -1; } - break; + + /* + * We only end up here without a message when XLogPageRead() failed + * - in that case we already logged something. + * In StandbyMode that only happens if we have been triggered, so + * we shouldn't loop anymore in that case. + */ + if (errormsg) +ereport(emode_for_corrupt_record(emode, + RecPtr ? RecPtr : EndRecPtr), + (errmsg_internal(%s, errormsg) /* already translated */)); + + /* Give up, or retry if we're in standby mode. */ + continue; } /* @@ -3234,6 +3242,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, XLogSegNo segno; int32 offset; + lastSourceFailed = true; + XLByteToSeg(xlogreader-latestPagePtr, segno); offset = xlogreader-latestPagePtr % XLogSegSize; XLogFileName(fname, xlogreader-readPageTLI, segno); @@ -3243,9 +3253,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, xlogreader-latestPageTLI, fname, offset))); - return false; + record = NULL; + continue; } - } while (StandbyMode record == NULL); + } while (StandbyMode record == NULL !CheckForStandbyTrigger()); return record; } diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index ff871a3..75164f6 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord); if (readOff 0) - { - if (state-errormsg_buf[0] != '\0') - *errormsg = state-errormsg_buf; - return NULL; - } + goto err; /* *
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote: On 17.01.2013 18:42, Andres Freund wrote: On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote: On 17.01.2013 17:42, Andres Freund wrote: Ok, the attached patch seems to fix a) and b). c) above is bogus, as explained in a comment in the patch. I also noticed that the TLI check didn't mark the last source as failed. This looks fragile: /* * We only end up here without a message when XLogPageRead() failed * - in that case we already logged something. * In StandbyMode that only happens if we have been triggered, so * we shouldn't loop anymore in that case. */ if (errormsg == NULL) break; I don't like relying on the presence of an error message to control logic like that. Should we throw in an explicit CheckForStandbyTrigger() check in the condition of that loop? I agree, I wasn't too happy about that either. But in some sense its only propagating state from XLogReadPage which already has dealt with the error and decided its ok. Its the solution closest to what happened in the old implementation, thats why I thought it would be halfway to acceptable. Adding the CheckForStandbyTrigger() in the condition would mean promotion would happen before all the available records are processed and it would increase the amount of stat()s tremendously. So I don't really like that either. I was thinking of the attached. As long as we check for CheckForStandbyTrigger() after the record == NULL check, we won't perform extra stat() calls on successful reads, only when we're polling after reaching the end of valid WAL. That seems acceptable. Looks good to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Robert Haas robertmh...@gmail.com writes: - adds ddl_command_trace and ddl_command_end events I think ddl_command_end is OK and I'm willing to commit that if extracted as its own patch. I think ddl_command_trace is unnecessary syntactic sugar. Ok. Will prepare a non controversial patch for ddl_command_end. After having played with the patch, I happen to quite like ddl_command_trace, but I won't try to protect it from its death any more than that. - causes those events to be called not only for actual SQL commands but also for things that happen, at present, to go through the same code path I think this is a bad idea, not only because, as I said before, it exposes internal implementation details, but also because it's probably not safe. As Noah (I believe, or maybe Tom), observed in a previous post, we've gone to a lot of work to make sure that nothing you do inside a trigger can crash the server, corrupt the catalogs, trigger elog() messages, etc. That applies in spades to DDL. I naively though that doing CommandCounterIncrement() before running the event trigger would go a long enough way towards making the operation safe when the current code is calling back into ProcessUtility(). Despite Tom's skepticism, I'm pretty sure that it's safe for us to execute trigger function calls either completely before or completely after we execute the DDL command. It should be no different than executing them as separate commands. But doing something in the middle of a command is something else altogether. Consider, for example, that ALTER TABLE does a bunch of crap internally to itself, and then recurses into ProcessUtility to do some more stuff. This is a terrible design, but it's what we've got. Are you absolutely sure that the intermediate state that exists after that first bunch of stuff is done and before those other things are done is a safe place to execute arbitrary user-provided code? I'm not. And the more places we add that recurse into ProcessUtility, or indeed, add event triggers in general, the more places we're going to add new failure modes. While I hear you, I completely fail to understand which magic is going to make your other idea safer than this one. Namely, calling an event trigger named sql_drop rather than ddl_command_start from the same place in the code is certainly not helping at all. If your answer to that is how to implement it (e.g. keeping a queue of events for which to fire Event Triggers once we reach a safe point in the code), then why couldn't we do exactly the same thing under the other name? Fixing that is part of the price of adding a new event trigger and I'm OK with that. But I'm not OK with baking into the code the assumption that any current or future callers of ProcessUtility() should be prepared for arbitrary code execution. The code wasn't written with that in mind, and it will not end well. Again, while I agree with the general principle, it happens that the calls to the arbitrary code are explicit in ProcessUtility(), so that if the place is known to be unsafe it's easy enough to refrain from calling the user code. We do that already for CREATE INDEX CONCURRENTLY at ddl_command_end. - adds additional magic variables to PL/pgsql to expose new information not previously exposed I'm not sure that I agree with the implementation of this, but I think I'm OK with the concept. I will review it when it is submitted in a form not intertwined with other things. Fair enough. - adds CONTEXT as an additional event-trigger-filter variable, along with TAG I'm OK with adding new event-trigger-filter variables, provided that they are done without additional run-time overhead for people who are not using event triggers; I think it's nice that we can support that. But I think this particular proposal is DOA because nothing except TOPLEVEL is actually safe. Yeah, let's first see about that. My thinking is that the user code can do anything, even run a DROP command against the object that we are preparing ourselves to be dropping. I don't think we can bypass doing the lookup twice. Sure, but there's also the issue of getting the right answer. For example, suppose you are planning to drop a function. You do a name lookup on the name and call the event trigger. Meanwhile, someone else renames the function and creates a new one with the same name. After the event trigger returns, the actual drop code latches onto the one with the new name. Now, the event trigger ran with OID A and the actual object dropped was one with OID B. It's possible that KaiGai's Is your scenario even possible when we take a ShareUpdateExclusiveLock on he object before running the Event Trigger, in order to protect against exactly that scenario, as said in the comment? Maybe another lock should be taken. Which one? There's another issue here, too, which is that this change reintroduces a failure mode that I spent a lot of time
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 17.01.2013 18:55, Andres Freund wrote: On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote: I was thinking of the attached. As long as we check for CheckForStandbyTrigger() after the record == NULL check, we won't perform extra stat() calls on successful reads, only when we're polling after reaching the end of valid WAL. That seems acceptable. Looks good to me. Ok, committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: Slave does not try anymore to reconnect to master with messages of the type: FATAL: could not connect to the primary server I also noticed that there is some delay until modifications on master are visible on slave. I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Heikki committed a fix for at least the promotion issue, I didn't notice any problem with an increased delay, could you check again if you still see it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
* Tom Lane (t...@sss.pgh.pa.us) wrote: Ugh. Still another problem with non-MVCC catalog scans. Indeed. It seems that the only thing we actually use from each tuple is the OID. Yes, that's true. So there are other ways to fix it, of which probably the minimum-change one is to keep a list of already-processed tablespace OIDs and skip a tuple if we get a match in the list. This would be O(N^2) in the number of tablespaces, but I rather doubt that's a problem. I did consider this option also but it felt like a lot of additional code to write and I wasn't entirely convinced it'd be worth it. It's very frustrating that we can't simply get a list of *currently valid* tablespaces with a guarantee that no one else is going to mess with that list while we process it. That's what MVCC would give us, but we can't rely on that yet.. If we agree that keeping a list and then skipping over anything on the list already seen, I can go ahead and code that up. [ thinks for a bit... ] Ugh, no, because the *other* risk you've got here is not seeing a row at all, which would be really bad. I'm not sure that I see how that could happen..? I agree that it'd be really bad if it did though. Or are you thinking if we were to take a snapshot and then walk the table? I'm not sure that transiently increasing the lock here is enough, either. The concurrent transactions you're worried about probably aren't holding their locks till commit, so you could get this lock and still see tuples with unstable committed-good states. If there are other transactiosn which have open locks against the table, wouldn't that prevent my being able to acquire ShareLock on it? Or put another way, how would they not hold their locks till commit or rollback? We do only look at tablespaces which exist for the database we're copying, as I recall, so any newly created tablespaces shouldn't impact this. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: Slave does not try anymore to reconnect to master with messages of the type: FATAL: could not connect to the primary server I also noticed that there is some delay until modifications on master are visible on slave. I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Heikki committed a fix for at least the promotion issue, I didn't notice any problem with an increased delay, could you check again if you still see it? I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-01-17 13:47:41 +0900, Michael Paquier wrote: Slave does not try anymore to reconnect to master with messages of the type: FATAL: could not connect to the primary server I also noticed that there is some delay until modifications on master are visible on slave. I think that bug has been introduced by commit 7fcbf6a. Before splitting xlog reading as a separate facility things worked correctly. There are also no delay problems before this commit. Heikki committed a fix for at least the promotion issue, I didn't notice any problem with an increased delay, could you check again if you still see it? I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 That's after the commit or before? Because in passing I think I noticed/fixed a bug that could cause exactly that problem... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
Peter Geoghegan pe...@2ndquadrant.com writes: On 8 December 2012 14:41, Andres Freund and...@2ndquadrant.com wrote: Is anybody planning to work on this? There hasn't been any activity since the beginning of the CF and it doesn't look like there is much work left? I took another look at this. Applied with some changes: * Use float8 arithmetic to prevent intermediate-result overflow, as I suggested last night. * Rearrange the subsequent checks so that they provide bulletproof clamping behavior on out-of-range newmemtupsize values; this allows dropping the very ad-hoc range limiting used in the previous patch (and in what I had last night). * Fix the check against availMem; it was supposed to be testing that the increment in the array size was within availMem. (In the original coding the increment was always the same as the original size, but not so much anymore.) * Allow the array size to grow to the MaxAllocSize limit, instead of punting if we'd exceed that. If we're attempting to use as much of work_mem as we possibly can, I don't see why that should be a reject case. * Improve the comments, including the existing one about the availMem check, since that evidently wasn't clear enough. * Copy the whole mess into tuplestore.c too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH]Tablesample Submission
On 11/04/2012 07:22 PM, Qi Huang wrote: Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then.Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore Did you ever do the update of the patch? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote: Now that I look at the patch, I wonder if there is another fundamental issue with the patch. Since the patch removes WAL logging for the VM set operation, this can happen: Thank you. I think I was confused by this comment here: When we *set* a visibility map during VACUUM, we must write WAL. This may seem counterintuitive, since the bit is basically a hint: if it is clear, it may still be the case that every tuple on the page is visible to all transactions; we just don't know that for certain. The difficulty is that there are two bits which are typically set together: the PD_ALL_VISIBLE bit on the page itself, and the visibility map bit. If a crash occurs after the visibility map page makes it to disk and before the updated heap page makes it to disk, redo must set the bit on the heap page. Otherwise, the next insert, update, or delete on the heap page will fail to realize that the visibility map bit must be cleared, possibly causing index-only scans to return wrong answers. Which lead me to believe that I could just rip out the WAL-related code if PD_ALL_VISIBLE goes away, which is incorrect. But the incorrectness doesn't have to do with the WAL directly, it's because the VM page's LSN is not bumped past the LSN of the related heap page cleanup, so it can be written too early. Of course, the way to bump the LSN is to write WAL for the visibilitymap_set operation. But that would be a very simple WAL routine, rather than the complex one that exists without the patch. I suppose we could even try to bump the LSN without writing WAL somehow, but it doesn't seem worth reasoning through that (setting a VM bit is rare enough). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: [ thinks for a bit... ] Ugh, no, because the *other* risk you've got here is not seeing a row at all, which would be really bad. I'm not sure that I see how that could happen..? I agree that it'd be really bad if it did though. Or are you thinking if we were to take a snapshot and then walk the table? The case where you see a tuple twice is if an update drops a new version of a row beyond your seqscan, and then commits before you get to the new version. But if it drops the new version of the row *behind* your seqscan, and then commits before you get to the original tuple, you'll not see either row version as committed. Both of these cases are inherent risks in SnapshotNow scans. I'm not sure that transiently increasing the lock here is enough, either. The concurrent transactions you're worried about probably aren't holding their locks till commit, so you could get this lock and still see tuples with unstable committed-good states. If there are other transactiosn which have open locks against the table, wouldn't that prevent my being able to acquire ShareLock on it? What I'm worried about is that we very commonly release locks on catalogs as soon as we're done with the immediate operation --- not only read locks, but update locks as well. So there are lots of cases where locks are released before commit. It's possible that that never happens in operations applied to pg_tablespace, but I'd not want to bet on it. I wonder whether it's practical to look at the on-disk directories instead of relying on pg_tablespace for this particular purpose. Or maybe we should just write this off as a case we can't realistically fix before we have MVCC catalog scans. It seems that any other fix is going to be hopelessly ugly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tuplesort memory usage: grow_memtuples
On 17 January 2013 18:22, Tom Lane t...@sss.pgh.pa.us wrote: Applied with some changes: Thank you. That feedback is useful. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
On 2013-01-17 13:46:44 -0500, Tom Lane wrote: Or maybe we should just write this off as a case we can't realistically fix before we have MVCC catalog scans. It seems that any other fix is going to be hopelessly ugly. ISTM we could just use a MVCC snapshot in this specific case? Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
* Tom Lane (t...@sss.pgh.pa.us) wrote: The case where you see a tuple twice is if an update drops a new version of a row beyond your seqscan, and then commits before you get to the new version. But if it drops the new version of the row *behind* your seqscan, and then commits before you get to the original tuple, you'll not see either row version as committed. Both of these cases are inherent risks in SnapshotNow scans. Ah, I see. If there are other transactiosn which have open locks against the table, wouldn't that prevent my being able to acquire ShareLock on it? What I'm worried about is that we very commonly release locks on catalogs as soon as we're done with the immediate operation --- not only read locks, but update locks as well. So there are lots of cases where locks are released before commit. It's possible that that never happens in operations applied to pg_tablespace, but I'd not want to bet on it. Fair enough. I wonder whether it's practical to look at the on-disk directories instead of relying on pg_tablespace for this particular purpose. So we'd traverse all of the tablespace directories and copy any directories which we come across which have the oid of the template database..? Sounds pretty reasonable, though I'm not sure that we'd want to back-patch a large change like that. Or maybe we should just write this off as a case we can't realistically fix before we have MVCC catalog scans. It seems that any other fix is going to be hopelessly ugly. I feel like we should be able to do better than what we have now, at least. Using ShareLock prevented the specific case that we were experiencing and is therefore MUCH better (for us, anyway) than released versions where we ran into the error on a regular basis. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH]Tablesample Submission
On 17 January 2013 18:32, Josh Berkus j...@agliodbs.com wrote: On 11/04/2012 07:22 PM, Qi Huang wrote: Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then.Thanks. Best RegardsHuang Qi VictorComputer Science of National University of Singapore Did you ever do the update of the patch? We aren't just waiting for a rebase, we're waiting for Hitoshi's comments to be addressed. I would add to them by saying I am very uncomfortable with the idea of allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really want that you can use a sub-select. Plus the patch contains zero documentation. So I can't see this going anywhere for 9.3. I've moved it to CF1 of 9.4 marked Waiting on Author -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
Tomas Vondra wrote: Hi, attached is a patch that improves performance when dropping multiple tables within a transaction. Instead of scanning the shared buffers for each table separately, the patch removes this and evicts all the tables in a single pass through shared buffers. Made some tweaks and pushed (added comments to new functions, ensure that we never try to palloc(0), renamed DropRelFileNodeAllBuffers to plural, made the use bsearch logic a bit simpler). Our system creates a lot of working tables (even 100.000) and we need to perform garbage collection (dropping obsolete tables) regularly. This often took ~ 1 hour, because we're using big AWS instances with lots of RAM (which tends to be slower than RAM on bare hw). After applying this patch and dropping tables in groups of 100, the gc runs in less than 4 minutes (i.e. a 15x speed-up). I'm curious -- why would you drop tables in groups of 100 instead of just doing the 100,000 in a single transaction? Maybe that's faster now, because you'd do a single scan of the buffer pool instead of 1000? (I'm assuming that in groups of means you do each group in a separate transaction) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, 2013-01-17 at 10:39 -0800, Jeff Davis wrote: On Thu, 2013-01-17 at 15:25 +0530, Pavan Deolasee wrote: Now that I look at the patch, I wonder if there is another fundamental issue with the patch. Since the patch removes WAL logging for the VM set operation, this can happen: Thank you. New patch attached with simple WAL logging. Regards, Jeff Davis rm-pd-all-visible-20130117.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: Or maybe we should just write this off as a case we can't realistically fix before we have MVCC catalog scans. It seems that any other fix is going to be hopelessly ugly. I feel like we should be able to do better than what we have now, at least. Using ShareLock prevented the specific case that we were experiencing and is therefore MUCH better (for us, anyway) than released versions where we ran into the error on a regular basis. If it actually *reliably* fixed the problem, rather than just reducing the probabilities, that would mean that the updates your other sessions were doing didn't release RowExclusiveLock early. Have you dug into the code to see if that's true? (Or even more to the point, if it's still true in HEAD? I have no idea if all the recent refactoring has changed this behavior for GRANT.) My thought about a localized fix is similar to Andres' --- maybe we could take a snapshot and use a real MVCC scan. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
Alvaro Herrera alvhe...@2ndquadrant.com writes: Made some tweaks and pushed (added comments to new functions, ensure that we never try to palloc(0), renamed DropRelFileNodeAllBuffers to plural, made the use bsearch logic a bit simpler). FWIW, there's nothing particularly wrong with palloc(0) ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
On 17 January 2013 16:15, Robert Haas robertmh...@gmail.com wrote: As a further example, suppose that in 9.4 (or 9.17) we add a command DROP TABLES IN SCHEMA fred WHERE name LIKE 'bob%'. Well, the logging trigger still just works (because it's only writing the statement, without caring about its contents). And the replication trigger still just works too (because it will still get a callback for every table that gets dropped, even though it knows nothing about the new syntax). That's very powerful. Of course, there will still be cases where things have to change, but you can minimize it by clean definitions. It pains me that I've evidently failed to communicate this concept clearly despite a year or more of trying. Does that make sense? Is there some way I can make this more clear? The difference seems very clear-cut to me, but evidently I'm not conveying it well. That problem is going to happen in some way or another. It's not about avoiding the problem completely, but choosing a set of compromises. I'm yet to understand what your set of compromises is in detailed enough practical implementation terms (not just it's another set of events) and how it's a better compromise over all (I trust your judgement on that, I still want to understand). After all, the whole point of the Event Trigger patch is to expose some level of implementation details. I don't really agree with that. I think the point is to expose what the system is doing to the DBA. I'm OK with exposing the fact that creating a table with a serial column also creates a sequence. There is no problem with that - it's all good. What we DON'T want to do is set up a system where the fact that it creates a sequence is exposed because that happens to go through ProcessUtility() and the fact that it creates a constraint is not because that happens not to go through ProcessUtility(). That is not the sort of quality that our users have come to expect from PostgreSQL. The point, i.e. the main use case, is to replicate the DDL in a useful form. Discussions and reasoning need to focus on the main use case, not on weird futures or qualitative points. Yes, the discussion has gone on for years and it really should come to a useful conclusion sometime soon. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Teaching pg_receivexlog to follow timeline switches
Robert Haas escribió: Actually, I'm really glad to see all the work you've done to improve the way that some of these scenarios work and eliminate various bugs and other surprising failure modes over the last couple of months. It's great stuff. +1 -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
I wrote: Stephen Frost sfr...@snowman.net writes: I feel like we should be able to do better than what we have now, at least. Using ShareLock prevented the specific case that we were experiencing and is therefore MUCH better (for us, anyway) than released versions where we ran into the error on a regular basis. My thought about a localized fix is similar to Andres' --- maybe we could take a snapshot and use a real MVCC scan. BTW, it strikes me that the reason this particular SnapshotNow scan is a big problem for you is that, because we stop and copy a subdirectory after each tuple, the elapsed time for the seqscan is several orders of magnitude more than it is for most (perhaps all) other cases where SnapshotNow is used. So the window for trouble with concurrent updates is that much bigger. This seems to provide a reasonably principled argument why we might want to fix this case with a localized use of an MVCC scan before we have such a fix globally. Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY kind of annotation. We know we need the SnapshotNow scan fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 17.01.2013 20:08, Andres Freund wrote: On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 That's after the commit or before? Because in passing I think I noticed/fixed a bug that could cause exactly that problem... I think I broke that with the teach pg_receivexlog to cross timelines patch. Will take a look... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On 17 January 2013 15:14, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 17.01.2013 16:53, Robert Haas wrote: On Thu, Jan 17, 2013 at 3:49 AM, Pavan Deolasee pavan.deola...@gmail.com wrote: May be you've already addressed that concern with the proven performance numbers, but I'm not sure. It would be nice to hear what Heikki's reasons were for adding PD_ALL_VISIBLE in the first place. The idea was to avoid clearing the bit in the VM page on every update, when the bit is known to not be set, ie. when the PD_ALL_VISIBLE flag is not set. I assumed the traffic and contention on the VM page would be a killer otherwise. I don't remember if I ever actually tested that though. Maybe I was worrying about nothing and hitting the VM page on every update is ok. Presumably we remember the state of the VM so we can skip the re-visit after every write? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, 2013-01-17 at 17:14 +0200, Heikki Linnakangas wrote: I don't remember if I ever actually tested that though. Maybe I was worrying about nothing and hitting the VM page on every update is ok. I tried, but was unable to show really anything at all, even without keeping the VM page pinned. I think the bottleneck is elsewhere; although I am keeping the page pinned in this patch to prevent it from becoming a bottleneck. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH]Tablesample Submission
So I can't see this going anywhere for 9.3. I've moved it to CF1 of 9.4 marked Waiting on Author Agreed. I wish I'd noticed that it got lost earlier. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
Hi Jeff 2012/4/19 Jeff Davis pg...@j-davis.com: On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote: (...) This is just handwaving of course. I think some digging in the spatial-join literature would likely find ideas better than any of these. I will look in some more detail. The merge-like approach did seem to be represented in the paper referenced by Alexander (the external plane sweep), but it also refers to several methods based on partitioning. I'm beginning to think that more than one of these ideas has merit. Regards, Jeff Davis I'm perhaps really late in this discussion but I just was made aware of that via the tweet from Josh Berkus about PostgreSQL 9.3: Current Feature Status What is the reason to digg into spatial-joins when there is PostGIS being a bullet-proof and fast implementation? Yours, Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Simon Riggs si...@2ndquadrant.com writes: On 17 January 2013 16:15, Robert Haas robertmh...@gmail.com wrote: It pains me that I've evidently failed to communicate this concept clearly despite a year or more of trying. Does that make sense? Is there some way I can make this more clear? The difference seems very clear-cut to me, but evidently I'm not conveying it well. The point, i.e. the main use case, is to replicate the DDL in a useful form. Discussions and reasoning need to focus on the main use case, not on weird futures or qualitative points. I have to completely disagree with that. If we don't want PostgreSQL to soon subside into an unfixable morass, as I think Brooks puts it, we must *not* simply patch things in a way that expediently provides an approximation of some desired feature. We have to consider, and put substantial weight on, having a clean and maintainable system design. This is particularly the case if we're talking about a feature that is going to expose backend internals to users to any degree. We're either going to be constrained to not change those internals any more, or expect to break users' applications when we do change them, and neither result is very appealing. Especially not when we're talking about the structure of Postgres' DDL code, which can most charitably be described as it just grew, not as something that has any clear architecture to it. Now if we could quantify these things, that would be great, but AFAICS qualitative points is all we've got to go on. I think that we're not realistically going to be able to introduce event triggers in very many of the places we'd like to have them without first doing a lot of fundamental refactoring. And that is hard, dangerous, slow work that tends to break things in itself. As we've been repeatedly reminded in connection with KaiGai-san's refactoring patches. So my opinion is that most of what we'd like to have here is material for 9.4 or 9.5 or even further out. If we can get the event trigger catalog infrastructure and some *very* basic events, like end-of-toplevel-command, into place for 9.3, we'll have done well. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
On Thu, Jan 17, 2013 at 12:06 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Ok. Will prepare a non controversial patch for ddl_command_end. Thanks. I will make a forceful effort to review that in a timely fashion when it's posted. I think this is a bad idea, not only because, as I said before, it exposes internal implementation details, but also because it's probably not safe. As Noah (I believe, or maybe Tom), observed in a previous post, we've gone to a lot of work to make sure that nothing you do inside a trigger can crash the server, corrupt the catalogs, trigger elog() messages, etc. That applies in spades to DDL. I naively though that doing CommandCounterIncrement() before running the event trigger would go a long enough way towards making the operation safe when the current code is calling back into ProcessUtility(). It's something, but I'd be shocked if it covers all the bases. Let me try to give a concrete example of how I think another firing point could be made to work along the lines I'm suggesting. I'm not going to use DROP as an example because I think upon reflection that will be a particularly challenging case to make work correctly. Instead, let me use the example of ALTER. Goal: Every time an ALTER command is used on object *that actually exists*, we will call some user-defined function and pass the object type, the OID of the object, and some details about what sort of alteration the user has requested. Where shall we put the code to do this? Clearly, ProcessUtility is too early, because at that point we have not done the name lookup, locked the object, or checked permissions. So the OID of the target object is not and cannot be known at that point. We cannot do an extra name lookup at that point without taking a lock, because the answer might change, or, due to non-MVCC catalog access, the lookup might fail to find an object altogether. And we can't take a lock without checking permissions first. And the permissions-checking logic can't be done inside ProcessUtility(), because it's object-type specific and we don't want to duplicate it. So, instead, what we need to do is go into each function that implements ALTER, and modify it so that after it does the dance where we check permissions, take locks, and look up names, we call the trigger function. That's a bit of a nuisance, because we probably have a bunch of call sites to go fix, but in theory it should be doable. The problem of course, is that it's not intrinsically safe to call user-defined code there. If it drops the object and creates a new one, say, hilarity will probably ensue. But that should be a fixable problem. The only things we've done at this point are check permissions, lock, and look up names. I think we can assume that if the event trigger changes permissions, it's safe for the event trigger to ignore that. The trigger cannot have released the lock, so we're OK there. The only thing it can really have done that's problematic is change the results of the name lookup, say by dropping or renaming the object. But that's quite simple to protect against: after firing the trigger, we do a CommandCounterIncrement() and repeat the name lookup, and if we get a different answer, then we bail out with an error and tell the user they're getting far too cute. Then we're safe. Now, there is a further problem: all of the information about the ALTER statement that we want to provide to the trigger may not be available at that point. Simple cases like ALTER .. RENAME won't require anything more than that, but things like ALTER TABLE .. ADD FOREIGN KEY get tricky, because while at this point we have a solid handle on the identity of the relation to which we're adding the constraint, we do NOT yet have knowledge of or a lock on the TARGET relation, whose OID could still change on us up to much later in the computation. To get reliable information about *that*, we'll need to refactor the sequence of events inside ALTER TABLE. Hopefully you can see what I'm driving toward here. In a design like this, we can pretty much prove - after returning from the event trigger - that we're in a state no different from what would have been created by executing a series of commands - lock table, then SELECT event_trigger_func(), then the actual ALTER in sequence. Maybe there's a flaw in that analysis - that's what patch review is for - but it sounds pretty darn tight to me. I could write more, but I have to take the kids to dance class, and this email may be too long already. Does that help at all to clarify the lines along which I am thinking? Note that the above is clearly not ddl_command_start but something else, and the different firing point and additional safety cross-checks are what enable us to pass additional information to the trigger without fear of breaking either the trigger or the world. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent
Re: [HACKERS] [sepgsql 1/3] add name qualified creation label
2013/1/16 Robert Haas robertmh...@gmail.com: On Tue, Jan 15, 2013 at 3:02 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote: This patch adds sepgsql the feature of name qualified creation label. Background, on creation of a certain database object, sepgsql assigns a default security label according to the security policy that has a set of rules to determine a label of new object. Usually, a new object inherits its parent (e.g table is a parent of column) object's label, unless it has a particular type_transition rule in the policy. Type_transition rule allows to describe a particular security label as default label of new object towards a pair of client and parent object. For example, the below rule says columns constructed under the table labeled as sepgsql_table_t by client with staff_t will have staff_column_t, instead of table's label. TYPE_TRANSITION staff_t sepgsql_table_t:db_column staff_column_t; Recently, this rule was enhanced to take 5th argument for object name; that enables to special case handling exceptionally. It was originally designed to describe default security labels for files in /etc directory, because many application put its own configuration files here, thus, traditional type_transition rule was poor to describe all the needed defaults. On the other hand, we can port this concept of database system also. One example is temporary objects being constructed under the pg_temp schema. If we could assign a special default label on this, it allows unprivileged users (who cannot create persistent tables) to create temporary tables that has no risk of information leak to other users. Otherwise, we may be able to assign a special security label on system columns and so on. From the perspective of implementation on sepgsql side, all we need to do is replace old security_compute_create_raw() interface by new security_compute_create_name_raw(). If here is no name qualified type_transition rules, it performs as if existing API, so here is no backword compatible issue. This patch can be applied on the latest master branch. This looks OK on a quick once-over, but should it update the documentation somehow? Documentation does not take so much description for type_transition rules, so I just modified relevant description a bit to mention about type_transition rule may have argument of new object name optionally. In addition, I forgot to update minimum required version for libselinux; (it also takes change in configure script). These two are the point to be updated in documentation. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp sepgsql-v9.3-creation-label-with-name.v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Robert Haas robertmh...@gmail.com writes: Goal: Every time an ALTER command is used on object *that actually exists*, we will call some user-defined function and pass the object type, the OID of the object, and some details about what sort of alteration the user has requested. Ok, in current terms of the proposed patch, it's a ddl_command_end event trigger, and you can choose when to fire it in utility.c. The current place is choosen to be just after the AlterTable() call, because that sounded logical if you believe in CommandCounterIncrement. So, instead, what we need to do is go into each function that implements ALTER, and modify it so that after it does the dance where we check permissions, take locks, and look up names, we call the trigger function. That's a bit of a nuisance, because we probably have a bunch of call sites to go fix, but in theory it should be doable. The problem of course, is that it's not intrinsically safe to call user-defined code there. If it drops the object and creates a new one, say, hilarity will probably ensue. You're trying to find a dangerous point when to fire the trigger here, rather than trying to solve the problem you're describing. For the problem you're describing, either you want the Object Specific Information and the trigger fires at ddl_command_end, or you don't and you can register your trigger at ddl_command_start. If you want something else, well, it won't ship in 9.3. Now, there is a further problem: all of the information about the ALTER statement that we want to provide to the trigger may not be available at that point. Simple cases like ALTER .. RENAME won't require anything more than that, but things like ALTER TABLE .. ADD FOREIGN KEY get tricky, because while at this point we have a solid handle on the identity of the relation to which we're adding the constraint, we do NOT yet have knowledge of or a lock on the TARGET relation, whose OID could still change on us up to much later in the computation. To get reliable information about *that*, we'll need to refactor the sequence of events inside ALTER TABLE. The only valid answer here has already been given by Tom. You can only provide the information if it's available in the catalogs. So again, it's a ddl_command_end event. It can not happen before. Hopefully you can see what I'm driving toward here. In a design like this, we can pretty much prove - after returning from the event trigger - that we're in a state no different from what would have been created by executing a series of commands - lock table, then SELECT event_trigger_func(), then the actual ALTER in sequence. Maybe there's a flaw in that analysis - that's what patch review is for - but it sounds pretty darn tight to me. The only case when we need to do a lookup BEFORE actually running the command is when that command is a DROP, because that's the only timing when the information we want is still in the catalogs. So that's the only case where we do a double object lookup, and we take a ShareUpdateExclusiveLock lock on the object when doing so, so that we can't lose it from another concurrent activity. Note that the above is clearly not ddl_command_start but something else, and the different firing point and additional safety cross-checks are what enable us to pass additional information to the trigger without fear of breaking either the trigger or the world. That's the reason why we are only going to have ddl_command_start and ddl_command_end in 9.3, and also the reason why the extra information is only provided when this very information still exists in the catalogs at the moment when we want to expose it. And that's the reason why ddl_command_trace is attractive too: you won't ever get Object OID and Name and Schema from the event where the object does not exists, and maybe you want an easy way to tell the system you're interested into an event *with* the information, and be done, don't think. Again, I don't care much about keeping ddl_command_trace because it's not interesting much for implementing DDL support in logical replication and my other use cases, but still, I though I would explain it now that we're talking about it. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] HS locking broken in HEAD
Hi, While checking whether I could reproduce the replication delay reported by Michael Paquier I found this very nice tidbit: In a pretty trivial replication setup of only streaming replication I can currently easily reproduce this: standby# BEGIN;SELECT * FROM foo; BEGIN id | data +-- standby=# SELECT relation::regclass, locktype, mode FROM pg_locks; relation | locktype | mode --++- pg_locks | relation | AccessShareLock foo_pkey | relation | AccessShareLock foo | relation | AccessShareLock | virtualxid | ExclusiveLock | virtualxid | ExclusiveLock (5 rows) primary# DROP TABLE foo; DROP TABLE standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks; relation | pid | locktype |mode --+---++- pg_locks | 28068 | relation | AccessShareLock foo_pkey | 28068 | relation | AccessShareLock foo | 28068 | relation | AccessShareLock | 28068 | virtualxid | ExclusiveLock | 28048 | virtualxid | ExclusiveLock foo | 28048 | relation | AccessExclusiveLock (6 rows) standby# SELECT * FROM foo; ERROR: relation foo does not exist LINE 1: select * from foo; ^ Note the conflicting locks held on relation foo by 28048 and 28068. I don't immediately know which patch to blame here? Looks a bit like broken fastpath locking, but I don't immediately see anything in c00dc337b87 that should cause this? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 17.01.2013 21:57, Heikki Linnakangas wrote: On 17.01.2013 20:08, Andres Freund wrote: On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 That's after the commit or before? Because in passing I think I noticed/fixed a bug that could cause exactly that problem... I think I broke that with the teach pg_receivexlog to cross timelines patch. Will take a look... Ok, there was a couple of issues. First, as I suspected above, I added a new result set after copy has ended in START_STREAMING command, but forgot to teach walreceiver about it. Fixed that. Secondly, there's an interesting interaction between the new xlogreader code and streaming replication and timeline switches: The first call to the page_read callback in xlogreader is always made with size SizeOfXLogRecord, counting from the beginning of the page. That's bogus; there can be no WAL record in the very beginning of page, because of the page header, so I think that was supposed to be SizeXLogShortPHD. But that won't fix the issue. The problem is that XLogPageRead callback uses the page address and requested length to decide what timeline to stream from. For example, imagine that there's a timeline switch at offset 1000 in a page, and we have already read all the WAL up to that point. When xlogreader first asks to fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on the old timeline, and starts streaming from the old timeline. Next, xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to read the first record it's actually interested in, XLogPageRead will return an error because that range is not on the timeline that's currently streamed. And we loop back to retry, and run into the same problem again. This interaction is a bit too subtle for my taste, but the straightforward fix is to just modify xlogreader so that the first read_page call requests all the bytes up to record we're actually interested in. That seems like a smart thing to do anyway. I committed a patch for that second issue too, but please take a look in case you come up with a better idea. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Tom Lane t...@sss.pgh.pa.us writes: I have to completely disagree with that. If we don't want PostgreSQL to soon subside into an unfixable morass, as I think Brooks puts it, we must *not* simply patch things in a way that expediently provides an approximation of some desired feature. We have to consider, and put substantial weight on, having a clean and maintainable system design. When in Ottawa next may, I will have to buy you a glass of your favorite wine and explain to you my main use case and vision. You will then realize that all this time that I've been spending on Event Triggers is a sideway to build something else entirely, because I know that it's the only way to get the feature I want in core PostgreSQL. So yes, I know about the clean and maintainable system design constraint. I've a lot to learn still, and I appreciate your help very much. Rest assured that the path you describe is the one I'm taking. I think that we're not realistically going to be able to introduce event triggers in very many of the places we'd like to have them without first doing a lot of fundamental refactoring. We're only talking about ddl_command_start and ddl_command_end now. The table_rewrite event is still on the horizon, but is not realistic to get in 9.3 anymore, I think :( So we're talking about adding some call points only in utility.c, only before or after a ddl command is run, including nested sub-commands. So my opinion is that most of what we'd like to have here is material for 9.4 or 9.5 or even further out. If we can get the event trigger catalog infrastructure and some *very* basic events, like end-of-toplevel-command, into place for 9.3, we'll have done well. My feeling is that I'm sending patches that only implement the *very* basic events here, and nothing more. Most of the heat of this thread came from a discussion about a feature that's very hard to design and that is not in my patch series to review. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER command reworks
Kohei KaiGai escribió: This attached patch is the rebased one towards the latest master branch. Great, thanks. I played with it a bit and it looks almost done to me. The only issue I can find is that it lets you rename an aggregate by using ALTER FUNCTION, which is supposed to be forbidden. (Funnily enough, renaming a non-agg function with ALTER AGGREGATE does raise an error). Didn't immediately spot the right place to add a check. I think these two error cases ought to have regression tests of their own. I attach a version with my changes. I noticed that your proposed design also allows to unify ALTER code of OPERATOR CLASS / FAMILY; that takes index access method for its namespace in addition to name and namespace. Yeah, I had noticed that and was planning on implementing it later. Thanks for saving me the work. So, AlterObjectRename_internal and AlterObjectNamespace_internal have four of special case handling to check name / namespace conflict. Right. (I wonder if it would make sense to encapsulate all those checks in a single function for both to call.) The latest master lookups syscache within special case handing block as follows: [code] But, we already pulls a relevant tuple from syscache on top of AlterObjectNamespace_internal. So, I removed cache lookup code here. Silly me. Thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services pgsql-v9.3-alter-reworks.3-rename.v10.patch.gz Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote: On 17.01.2013 21:57, Heikki Linnakangas wrote: On 17.01.2013 20:08, Andres Freund wrote: On 2013-01-18 03:05:47 +0900, Fujii Masao wrote: I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 That's after the commit or before? Because in passing I think I noticed/fixed a bug that could cause exactly that problem... I think I broke that with the teach pg_receivexlog to cross timelines patch. Will take a look... Ok, there was a couple of issues. First, as I suspected above, I added a new result set after copy has ended in START_STREAMING command, but forgot to teach walreceiver about it. Fixed that. Secondly, there's an interesting interaction between the new xlogreader code and streaming replication and timeline switches: The first call to the page_read callback in xlogreader is always made with size SizeOfXLogRecord, counting from the beginning of the page. That's bogus; there can be no WAL record in the very beginning of page, because of the page header, so I think that was supposed to be SizeXLogShortPHD. But that won't fix the issue. Yea, thats clearly a typo. The problem is that XLogPageRead callback uses the page address and requested length to decide what timeline to stream from. For example, imagine that there's a timeline switch at offset 1000 in a page, and we have already read all the WAL up to that point. When xlogreader first asks to fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on the old timeline, and starts streaming from the old timeline. Next, xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to read the first record it's actually interested in, XLogPageRead will return an error because that range is not on the timeline that's currently streamed. And we loop back to retry, and run into the same problem again. This interaction is a bit too subtle for my taste, but the straightforward fix is to just modify xlogreader so that the first read_page call requests all the bytes up to record we're actually interested in. That seems like a smart thing to do anyway. Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But ISTM that the code isn't really safe now: Assume targetRecOff is 0 (which is valid) then we read up to SizeOfXLogRecord but assume that we can read both the xlog record as well as the page header. Then you also need to factor in that the page header can be differently big. And yes, its too subtle :( Do you want to fix that or shall I? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
On 17 January 2013 20:24, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 17 January 2013 16:15, Robert Haas robertmh...@gmail.com wrote: It pains me that I've evidently failed to communicate this concept clearly despite a year or more of trying. Does that make sense? Is there some way I can make this more clear? The difference seems very clear-cut to me, but evidently I'm not conveying it well. The point, i.e. the main use case, is to replicate the DDL in a useful form. Discussions and reasoning need to focus on the main use case, not on weird futures or qualitative points. I have to completely disagree with that. If we don't want PostgreSQL to soon subside into an unfixable morass, as I think Brooks puts it, we must *not* simply patch things in a way that expediently provides an approximation of some desired feature. We have to consider, and put substantial weight on, having a clean and maintainable system design. So why does focusing on a use case make this turn into an unfixable morass? The two things are completely unrelated. If the patch is anywhere close to an unfixable morass, let's just reject it now. But Robert's arguments were much more tenuous than that. My comments were in response to this I don't really agree with that. I think the point is to expose what the system is doing to the DBA. I'm OK with exposing the fact that creating a table with a serial column also creates a sequence. There is no problem with that - it's all good. What we DON'T want to do is set up a system where the fact that it creates a sequence is exposed because that happens to go through ProcessUtility() and the fact that it creates a constraint is not because that happens not to go through ProcessUtility(). That is not the sort of quality that our users have come to expect from PostgreSQL. The above functionality is sufficient to allow DDL replication. What else do we want to do that requires some other capability? Discussing things in terms of what we will actually use a feature for is how we know we've got it right. And if it does that, we don't need anything else. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Robert Haas robertmh...@gmail.com writes: Let me try to give a concrete example of how I think another firing point could be made to work along the lines I'm suggesting. [ snip description of how an event trigger might safely be fired just after identification and locking of the target object for an ALTER ] Reading that and reflecting on my gripe about the lack of any clear architecture for our DDL code, I had the beginnings of an idea. Perhaps it would improve matters if we refactored DDL processing so that there were separate parse analysis and execution phases, where parse analysis is (perhaps among other responsibilities) responsible for identifying and locking all objects to be used in the command. I note that locking the referenced tables is the responsibility of the parse analysis step in DML processing, so there's solid precedent for this. Also, we have some of this approach already for certain commands such as CREATE TABLE, cf parse_utilcmd.c. If we did that, then it'd be feasible to fire event triggers after the parse analysis step, and the rechecking that Robert describes could be encapsulated as redo the parse analysis and see if the result changed. It's not clear to me just how this ought to extend to the cascaded-DROP or inherited-table-ALTER cases, but hey, it's only the beginnings of an idea. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Dimitri Fontaine dimi...@2ndquadrant.fr writes: Tom Lane t...@sss.pgh.pa.us writes: I think that we're not realistically going to be able to introduce event triggers in very many of the places we'd like to have them without first doing a lot of fundamental refactoring. We're only talking about ddl_command_start and ddl_command_end now. The table_rewrite event is still on the horizon, but is not realistic to get in 9.3 anymore, I think :( So we're talking about adding some call points only in utility.c, only before or after a ddl command is run, including nested sub-commands. Well, that's already a problem, because as Robert keeps saying, what goes through utility.c and what doesn't is pretty random right at the moment, and we shouldn't expose that behavior to users for fear of not being able to change it later. I think we could possibly ship event triggers defined as start and end of a *top level* command and have them working reliably in 9.3. If you want users to be looking at subcommands, there is a lot more work to do there than we have any chance of getting done for 9.3 (if it is to ship in 2013, that is). Alternatively, if you want to get something into 9.3 that has not necessarily got a long-term-stable API, I'd be inclined to suggest that we forget about a SQL-level event trigger facility for now, and just put in some hook call points. It's pretty well established that we're willing to change the API seen by hook functions across major releases. TBH this might be the best thing anyway if your long-term goals have to do with replication, because it'd be a lot cheaper to get to the point where you could write the replication code and see if it all actually works. I'm fairly concerned that we might spend many man-months getting event triggers with definitions A,B,C into the system, only to find out later that what is really needed for logical replication is definitions D,E,F. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Simon Riggs si...@2ndquadrant.com writes: On 17 January 2013 20:24, Tom Lane t...@sss.pgh.pa.us wrote: My comments were in response to this I don't really agree with that. I think the point is to expose what the system is doing to the DBA. I'm OK with exposing the fact that creating a table with a serial column also creates a sequence. There is no problem with that - it's all good. What we DON'T want to do is set up a system where the fact that it creates a sequence is exposed because that happens to go through ProcessUtility() and the fact that it creates a constraint is not because that happens not to go through ProcessUtility(). That is not the sort of quality that our users have come to expect from PostgreSQL. The above functionality is sufficient to allow DDL replication. What else do we want to do that requires some other capability? I think you and Robert are talking past each other. Whether it is or is not sufficient for DDL replication is not what he is concerned about; what he's worried about is whether we can refactor the backend in future without causing a user-visible change in the events that event triggers see. I think that that is an entirely valid concern, and it's not going to go away on the strength of but this is enough to let us do replication. The other point I'd make here is what I just said to Dimitri: it's far from proven, AFAIK, that this actually *is* sufficient to allow DDL replication. I'd be a lot happier if we had a working proof of that before we lock down a SQL-level facility that we're going to have a hard time changing the details of. Maybe we should just introduce some C-level hooks and let you guys go off and code replication using the hooks, before we put in a lot of expensive infrastructure that might turn out to be Not Quite The Right Thing. After we *know* the hooks are in the right places and supply the right information, we can look at replacing them with some more formalized facility. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Tom Lane t...@sss.pgh.pa.us writes: Well, that's already a problem, because as Robert keeps saying, what goes through utility.c and what doesn't is pretty random right at the moment, and we shouldn't expose that behavior to users for fear of not being able to change it later. It didn't feel that random to me. In most cases, the rule is that if the system wants to execute a command that you could have typed as a user in the prompt, then it hacks together a parsenode and call ProcessUtility. The main exception to that rule is the CASCADE implementation, for reasons that you detailed in that thread earlier. I think we could possibly ship event triggers defined as start and end of a *top level* command and have them working reliably in 9.3. If you want users to be looking at subcommands, there is a lot more work to do there than we have any chance of getting done for 9.3 (if it is to ship in 2013, that is). By default, you only get top level in what I've been cooking. I think the other contexts are needed for the logical replication, and maybe it would be good enough if they are restricted to either only internal code or event triggers coded in C. What do you think? Alternatively, if you want to get something into 9.3 that has not necessarily got a long-term-stable API, I'd be inclined to suggest that we forget about a SQL-level event trigger facility for now, and just put in some hook call points. It's pretty well established that we're willing to change the API seen by hook functions across major releases. Or just a hook. That would want to have about the exact same amount of information as the Event Trigger anyway, so I'm thinking we could maybe do that the same way as the parsetree exposure? Another idea would be yet another GUC, ala allow_system_table_mods, so that only intrepid users can have at it. Well, as long as the logical replication use case is covered, I'm done here, so I want to hear from Simon and Andres on that (and maybe the Slony and Londiste guys, etc), and from you to triage what is possible and what is crazy. TBH this might be the best thing anyway if your long-term goals have to do with replication, because it'd be a lot cheaper to get to the point where you could write the replication code and see if it all actually works. I'm fairly concerned that we might spend many man-months getting event triggers with definitions A,B,C into the system, only to find out later that what is really needed for logical replication is definitions D,E,F. I'm worried about that too, and that's why I'm trying to remain general and quite transparent about the way the system actually works. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, 2013-01-17 at 19:58 +, Simon Riggs wrote: Presumably we remember the state of the VM so we can skip the re-visit after every write? That was not a part of my patch, although I remember that you mentioned that previously and I thought it could be a good way to mitigate a problem if it ever came up. However, the tests I did didn't show any problem there. The tests were somewhat noisy, so perhaps I was doing something wrong, but it didn't appear that looking at the VM page for every update was a bottleneck. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing PD_ALL_VISIBLE
On Thu, 2013-01-17 at 09:53 -0500, Robert Haas wrote: The main question in my mind is whether there are any negative consequences to holding a VM buffer pin for that long without interruption. The usual consideration - namely, blocking vacuum - doesn't apply here because vacuum does not take a cleanup lock on the visibility map page, only the heap page, but I'm not sure if there are others. If the without interruption part becomes a practical problem, it seems fairly easy to fix: drop the pin and pick it up again once every K pages. Unless I'm missing something, this is a minor concern. The other part of the issue is cache pressure. I don't think I can say it better than what Tom already wrote: # I'd be worried about the case of a lot of sessions touching a lot of # unrelated tables. This change implies doubling the number of buffers # that are held pinned by any given query, and the distributed overhead # from that (eg, adding cycles to searches for free buffers) is what you # ought to be afraid of. I agree that we ought to be afraid of that. It's a legitimate concern, but I think being afraid goes to far (more below). A pgbench test isn't going to find a problem in this area because there you have a bunch of sessions all accessing the same small group of tables. To find a problem of the type above, you'd need lots of backends accessing lots of different, small tables. That's not the use case we usually benchmark, but I think there are a fair number of people doing things like that - for example, imagine a hosting provider or web application with many databases or schemas on a single instance. AFAICS, Jeff hasn't tried to test this scenario. The concern here is over a lot of different, small tables (e.g. multi-tenancy or something similar) as you say. If we're talking about nearly empty tables, that's easy to fix: just don't use the VM on tables less than N pages. You could say that small tables are really 1-10MB each, and you could have a zillion of those. I will try to create a worst-case here and see what numbers come out. Perhaps the extra time to look for a free buffer does add up. Test plan: 1. Take current patch (without skip VM check for small tables optimization mentioned above). 2. Create 500 tables each about 1MB. 3. VACUUM them all. 4. Start 500 connections (one for each table) 5. Time the running of a loop that executes a COUNT(*) on that connection's table 100 times. I think shared_buffers=64MB is probably appropriate. We want some memory pressure so that it has to find and evict pages to satisfy the queries. But we don't want it to be totally exhausted and unable to even pin a new page; that really doesn't tell us much except that max_connections is too high. Sound reasonable? Now, on the flip side, we should also be thinking about what we would gain from this patch, and what other ways there might be to achieve the same goals. One thing to keep in mind is that the current code to maintain a crash-safe PD_ALL_VISIBLE and VM bit is quite complex and doesn't play by the normal rules. If you want to talk about distributed costs, that has some very real distributed costs in terms of development effort. For instance, my checksums patch took me extra time to write (despite the patch being the simplest checksums design on the table) and will take others extra time to review. So, if the only things keeping that code in place are theoretical fears, let's take them one by one and see if they are real problems or not. All of which is to say that I think this patch is premature. If we adopt something like this, we're likely never going to revert back to the way we do it now, and whatever cache-pressure or other costs this approach carries will be hear to stay - so we had better think awfully carefully before we do that. What about this patch makes it hard to undo/rework in the future? Even if there were, this is exactly the sort of thing that should be committed at the beginning of a release cycle, not the end, so as to allow adequate time for discovery of unforeseen consequences before the code ends up out in the wild. I'm concerned that such a comment at this stage will cut review early, which could prevent also it from being committed early in 9.4. Of course, there's another issue here too, which is that as Pavan points out, the page throws crash-safety out the window My mistake. I believe that is already fixed, and certainly not a fundamental issue. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HS locking broken in HEAD
On 2013-01-17 22:46:21 +0100, Andres Freund wrote: Hi, While checking whether I could reproduce the replication delay reported by Michael Paquier I found this very nice tidbit: In a pretty trivial replication setup of only streaming replication I can currently easily reproduce this: standby# BEGIN;SELECT * FROM foo; BEGIN id | data +-- standby=# SELECT relation::regclass, locktype, mode FROM pg_locks; relation | locktype | mode --++- pg_locks | relation | AccessShareLock foo_pkey | relation | AccessShareLock foo | relation | AccessShareLock | virtualxid | ExclusiveLock | virtualxid | ExclusiveLock (5 rows) primary# DROP TABLE foo; DROP TABLE standby# SELECT relation::regclass, pid, locktype, mode FROM pg_locks; relation | pid | locktype |mode --+---++- pg_locks | 28068 | relation | AccessShareLock foo_pkey | 28068 | relation | AccessShareLock foo | 28068 | relation | AccessShareLock | 28068 | virtualxid | ExclusiveLock | 28048 | virtualxid | ExclusiveLock foo | 28048 | relation | AccessExclusiveLock (6 rows) standby# SELECT * FROM foo; ERROR: relation foo does not exist LINE 1: select * from foo; ^ Note the conflicting locks held on relation foo by 28048 and 28068. I don't immediately know which patch to blame here? Looks a bit like broken fastpath locking, but I don't immediately see anything in c00dc337b87 that should cause this? Rather scarily this got broken in 96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago, including in 9.2.1+. How the heck could this go unnoticed so long? Not sure yet what the cause of this is. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
Dimitri Fontaine dimi...@2ndquadrant.fr writes: Tom Lane t...@sss.pgh.pa.us writes: Well, that's already a problem, because as Robert keeps saying, what goes through utility.c and what doesn't is pretty random right at the moment, and we shouldn't expose that behavior to users for fear of not being able to change it later. It didn't feel that random to me. In most cases, the rule is that if the system wants to execute a command that you could have typed as a user in the prompt, then it hacks together a parsenode and call ProcessUtility. Uh, no, there's no such rule. In some places it was convenient to do that so people did it --- but often it's easier to just call the appropriate function directly, especially if you have any special locking or permissions requirements. I don't think there's any great consistency to it. To be a little more concrete, I looked through backend/catalog and backend/commands, which ought to pretty much cover all the places where commands do things that could be considered subcommands. I find three uses of ProcessUtility: extension.c: uses ProcessUtility to handle non-DML commands read from an extension script. schemacmds.c: CreateSchemaCommand uses ProcessUtility for any subcommand of a CREATE SCHEMA statement. This is really just direct execution of something the user typed, it's not the system creating a subcommand. trigger.c: ConvertTriggerToFK uses ProcessUtility to execute a consed-up ALTER TABLE ADD FOREIGN KEY command in place of a series of legacy CREATE CONSTRAINT TRIGGER commands. Now this is the very definition of ugly, and shouldn't drive our design decisions --- but I think that any user event trigger would be darn surprised to find this being emitted, especially when the two preceding CREATE CONSTRAINT TRIGGERs hadn't done any such thing, and indeed hadn't created any constraint trigger. AFAICT, everything else in those directories is using direct calls and not going through utility.c. So the only other cases where a trigger in ProcessUtility would trap subcommands are where those subcommands are things that parse_utilcmd.c generated during expansion of a CREATE TABLE or such. And that whole area is something that I feel strongly is implementation detail we don't particularly want to expose to users. For instance, the fact that a UNIQUE clause in a CREATE TABLE works that way and not at some lower level is nothing but implementation artifact. [ pokes around a bit more... ] Having now actually looked at every call point of ProcessUtility in the current code, I find myself agreeing very much with Robert: we do *not* want to expose that exact set of events to users. Except for the original top-level commands, it's almost entirely implementation artifacts that can and will change from release to release. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HS locking broken in HEAD
On 2013-01-17 23:56:16 +0100, Andres Freund wrote: On 2013-01-17 22:46:21 +0100, Andres Freund wrote: ^ Note the conflicting locks held on relation foo by 28048 and 28068. I don't immediately know which patch to blame here? Looks a bit like broken fastpath locking, but I don't immediately see anything in c00dc337b87 that should cause this? Rather scarily this got broken in 96cc18eef6489cccefe351baa193f32f12018551. Yes, nearly half a year ago, including in 9.2.1+. How the heck could this go unnoticed so long? That only made the bug more visible - the real bug is somewhere else. Which makes it even scarrier, the bug was in in the fast path locking patch from the start... It assumes conflicting fast path locks can only ever be in the same database as the the backend transfering the locks to itself. But thats obviously not true for the startup process which is in no specific database. I think it might also be a dangerous assumption for shared objects? A patch minimally addressing the is attached, but it only addresses part of the problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index f2cf5c6..b5240da 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -2458,8 +2458,13 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag * less clear that our backend is certain to have performed a memory * fencing operation since the other backend set proc-databaseId. So * for now, we test it after acquiring the LWLock just to be safe. + * + * But if we are the startup process we don't belong to a database but + * still need to lock objects in other databases, so we can do this + * optimization only in case we belong to a database. + * XXX: explain why this is safe for shared tables. */ - if (proc-databaseId != MyDatabaseId) + if (MyDatabaseId != InvalidOid proc-databaseId != MyDatabaseId) { LWLockRelease(proc-backendLock); continue; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao masao.fu...@gmail.com wrote: I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 I am seeing similar issues with master at 88228e6. This is easily reproducible by setting up 2 slaves under a master, then kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then you will notice that the timeline jump is not done. I don't know if Masao tried to put in sync the slave that reconnects to the promoted slave, but in this case slave2 stucks in potential state. That is due to timeline that has not changed on slave2 but better to let you know... The replication delays are still here. -- Michael Paquier http://michael.otacoo.com
Re: [HACKERS] could not create directory ...: File exists
* Tom Lane (t...@sss.pgh.pa.us) wrote: This seems to provide a reasonably principled argument why we might want to fix this case with a localized use of an MVCC scan before we have such a fix globally. I had discussed that idea a bit with Andres on IRC and my only concern was if there's some reason that acquiring a snapshot during createdb() would be problematic. It doesn't appear to currently and I wasn't sure if there'd be any issues. I'll start working on a patch for that. Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY kind of annotation. We know we need the SnapshotNow scan fix. Agreed. Thanks, Stephen signature.asc Description: Digital signature
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote: On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao masao.fu...@gmail.com wrote: I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 I am seeing similar issues with master at 88228e6. This is easily reproducible by setting up 2 slaves under a master, then kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then you will notice that the timeline jump is not done. Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got split off?). The replication delays are still here. That one is caused by this nice bug, courtesy of yours truly: diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 90ba32e..1174493 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8874,7 +8874,7 @@ retry: /* See if we need to retrieve more data */ if (readFile 0 || (readSource == XLOG_FROM_STREAM -receivedUpto = targetPagePtr + reqLen)) +receivedUpto targetPagePtr + reqLen)) { if (StandbyMode) { I didn't notice because I had a testscript inserting stuff continuously and it cause at most lagging by one record... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction
On 17.1.2013 20:19, Alvaro Herrera wrote: I'm curious -- why would you drop tables in groups of 100 instead of just doing the 100,000 in a single transaction? Maybe that's faster now, because you'd do a single scan of the buffer pool instead of 1000? (I'm assuming that in groups of means you do each group in a separate transaction) There's a limited number of locks, and each DROP acquires a lock (possibly more than one). Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not create directory ...: File exists
Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: This seems to provide a reasonably principled argument why we might want to fix this case with a localized use of an MVCC scan before we have such a fix globally. I had discussed that idea a bit with Andres on IRC and my only concern was if there's some reason that acquiring a snapshot during createdb() would be problematic. It doesn't appear to currently and I wasn't sure if there'd be any issues. Don't see what. The main reason we've not yet attempted a global fix is that the most straightforward way (take a new snapshot each time we start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE is so expensive that the cost of an extra snapshot there ain't gonna matter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On Fri, Jan 18, 2013 at 8:48 AM, Andres Freund and...@2ndquadrant.comwrote: Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got split off?). Yes, it is reproducible before the xlog reader split. Just an additional report, the master jumps correctly to the new timeline. The replication delays are still here. That one is caused by this nice bug, courtesy of yours truly: diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 90ba32e..1174493 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8874,7 +8874,7 @@ retry: /* See if we need to retrieve more data */ if (readFile 0 || (readSource == XLOG_FROM_STREAM -receivedUpto = targetPagePtr + reqLen)) +receivedUpto targetPagePtr + reqLen)) { if (StandbyMode) { I didn't notice because I had a testscript inserting stuff continuously and it cause at most lagging by one record... This fix is indeed working. -- Michael Paquier http://michael.otacoo.com
[HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
On 2013-01-18 08:24:31 +0900, Michael Paquier wrote: On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao masao.fu...@gmail.com wrote: I encountered the problem that the timeline switch is not performed expectedly. I set up one master, one standby and one cascade standby. All the servers share the archive directory. restore_command is specified in the recovery.conf in those two standbys. I shut down the master, and then promoted the standby. In this case, the cascade standby should switch to new timeline and replication should be successfully restarted. But the timeline was never changed, and the following log messages were kept outputting. sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 sby2 LOG: restarted WAL streaming at 0/300 on timeline 1 sby2 LOG: replication terminated by primary server sby2 DETAIL: End of WAL reached on timeline 1 I am seeing similar issues with master at 88228e6. This is easily reproducible by setting up 2 slaves under a master, then kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then you will notice that the timeline jump is not done. I don't know if Masao tried to put in sync the slave that reconnects to the promoted slave, but in this case slave2 stucks in potential state. That is due to timeline that has not changed on slave2 but better to let you know... Ok, I know whats causing this now. Rather ugly. Whenever accessing a page in a segment we haven't accessed before we read the first page to do an extra bit of validation as the first page in a segment contains more information. Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends there, wants to read the new timeline, requests record 0/06087088. xlogreader wants to do its validation and goes back to the first page in the segment which triggers xlog.c to rerequest timeline1 to be transferred.. Heikki, any ideas? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: pgbench - aggregation of info written into log
On Thu, Jan 17, 2013 at 2:29 PM, Andrew Dunstan and...@dunslane.net wrote: On 01/17/2013 06:04 AM, Tomas Vondra wrote: The problem is I have access to absolutely no Windows machines, not mentioning the development tools (and that I have no clue about it). I vaguely remember there were people on this list doing Windows development on a virtual machine or something. Any interest in working on this / giving me some help? One of the items on my very long TODO list (see stuff elsewhere about committers not doing enough reviewing and committing) is to create a package that can easily be run to set Windows Postgres development environments for MSVC, Mingw and Cygwin on Amazon, GoGrid etc. Once I get that done I'll be a lot less sympathetic to I don't have access to Windows pleas. Windows does run in a VM very well, but if you're going to set it up on your own VM environment, (e.h. VirtualBox or KVM/qemu) you need your own legal copy to install there. If you don't already have one, that will set you back about $140.00 (for w7 Pro) in the USA. Note that that's significantly better than the situation with OSX, which you can't run at all except on Apple hardware. Yeah. I used to have an AMI with the VS environment preinstalled on Amazon, but I managed to fat finger things and delete it at some point and haven't really had time to rebuild it. Having a script that would download and install all the pre-requisites on such a box would be *great*. Then you could get up and going pretty quickly, and getting a Windows box up running for a few hours there is almost free, and you don't have to deal with licensing hassles. (Of course, the AMI method doesn't work all the way since you'd be distributing Visual Studio, but if we can have a script that auto-downloads-and-installs it as necessary we can get around that) So if my understating is correct, 1)Tomas Vondra commits to work on Windows support for 9.4, 2)on the assumption that one of Andrew Dunstan, Dave Page or Magnus Hagander will help him in Windows development. Ok? If so, I can commit the patch for 9.3 without Windows support. If not, I will move the patch to next CF (for 9.4). Please correct me if I am wrong. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch to add \watch to psql
I have adjusted this patch a little bit to take care of the review issues, along with just doing a bit of review myself. On Thu, Oct 25, 2012 at 2:25 AM, Will Leinweber w...@heroku.com wrote: Thanks for the reviews and comments. Responses inline: . On Sat, Oct 20, 2012 at 9:19 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: Maybe you should call it \repeat or something. I'm sure people would get around to using \watch that way eventually. :-) While I agree that clearing the screen would be nicer, I feel that watching the bottom of the screen instead of the top gets you 95% of the value of unix watch(1), and having the same name will greatly enhance discoverability. Perhaps later on if ncurses is linked for some other reason, this could take advantage of it then. That said, I'm not that strongly attached to the name one way or the other. The name \repeat has grown on me, but I haven't bothered renaming it for the time being. I think sameness with the familiar 'watch' program may not be such a big deal as I thought originally, but 'repeat' sounds a lot more like a kind of flow control for scripts, whereas \watch is more clearly for humans, which is the idea. On Wed, Oct 24, 2012 at 2:55 PM, Peter Eisentraut pete...@gmx.net wrote: This doesn't handle multiline queries: = \watch select 1 + ERROR: 42601: syntax error at end of input LINE 1: select + ^ I think to make it cooperate better with psql syntax, put the \watch at the end, as a replacement for \g, like I have implemented some kind of multi-line support. The rough part is in this part of the patch: + if (query_buf query_buf-len 0) + { + /* + * Check that the query in query_buf has been terminated. This + * is mostly consistent with behavior from commands like \g. + * The reason this is here is to prevent terrible things from + * occuring from submitting incomplete input of statements + * like: + * + * DELETE FROM foo + * \watch + * WHERE + * + * Wherein \watch will go ahead and send whatever has been + * submitted so far. So instead, insist that the user + * terminate the query with a semicolon to be safe. + */ + if (query_buf-data[query_buf-len - 1] == ';') What I found myself reaching for when giving up and writing this hack was a way to thread through the last lexer state of query_buf, which seems it could stand to be accrue a bit more information than being just a byte buffer. But this is the simplest possible thing, so I'll let others comment... -- fdr psql-watch-v2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers: adding information
On Thu, Jan 17, 2013 at 4:43 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Robert Haas robertmh...@gmail.com writes: Goal: Every time an ALTER command is used on object *that actually exists*, we will call some user-defined function and pass the object type, the OID of the object, and some details about what sort of alteration the user has requested. Ok, in current terms of the proposed patch, it's a ddl_command_end event trigger, and you can choose when to fire it in utility.c. The current place is choosen to be just after the AlterTable() call, /me scratches my head. Depends. What I described would fire before the command was executed, not after, so it's not quite the same thing, but it might be just as good for your purposes. It wasn't so much intended as we should do this exact thing as much as this is the kind of thought process that you need to go through if you want to safely run an event trigger in the middle of another command. Now, maybe we don't actually need to do that to serve your purposes. As already discussed, it seems like passing information to ddl_command_end is for most purposes sufficient for what you need - and we can do that without any special gymnastics, because that way it runs completely after the command, not in the middle. If I understand correctly, and I may not, there are basically two problems with that approach: (1) It's not exactly clear how to pass the information about the statement through to the ddl_command_end trigger. I know you've proposed a couple of things, but none of them seem completely satisfying. At least not to me. (2) If the method of transmission is the OIDs of the affected objects, which I believe is your current proposal, the ddl_command_end trigger can't go look those objects up in the catalog, because the catalog entries are already gone by that point. It's possible we could solve both of those problems without running event triggers in the middle of commands at all. In which case, my whole example is moot for now. But I think it would be smart to get ddl_command_end (without any additional information) committed first, and then argue about these details, so that we at least make some definable progress here. because that sounded logical if you believe in CommandCounterIncrement. I'm somewhat bemused by this comment. I don't think CommandCounterIncrement() is an article of faith. If you execute a command to completion, and do a CommandCounterIncrement(), then whatever you do next will look just like a new command, so it should be safe to run user-provided code there. But if you stop in the middle of a command, do a CommandCounterIncrement(), run some user-provided code, and then continue on, the CommandCounterIncrement() by itself protects you from nothing. If the code is not expecting arbitrary operations to be executed at that point, and you execute them, you get to keep both pieces. Indeed, there are some places in the code where inserting a CommandCounterIncrement() *by itself* could be unsafe. I don't believe that's a risk in ProcessUtility(), but that doesn't mean there aren't any risks in ProcessUtility(). So, instead, what we need to do is go into each function that implements ALTER, and modify it so that after it does the dance where we check permissions, take locks, and look up names, we call the trigger function. That's a bit of a nuisance, because we probably have a bunch of call sites to go fix, but in theory it should be doable. The problem of course, is that it's not intrinsically safe to call user-defined code there. If it drops the object and creates a new one, say, hilarity will probably ensue. You're trying to find a dangerous point when to fire the trigger here, Yes, I am. I'm not asking you to implement what I proposed there - I'm just giving an example of how to make a dangerous place safe. You've ALSO found a dangerous point when to fire the trigger. The difference is that my example dangerous point is probably fixable to be safe, whereas your actual dangerous point is probably unfixable, because there are many current paths of execution that go through that function in an extremely ad-hoc fashion, and there may be more in the future. ddl_command_start and ddl_command_end are safe enough *when restricted to toplevel commands*, but that's about as far as it goes. Perhaps there are other special cases that are also safe, but just throwing everything into the pot with no analysis and no future-proofing certainly isn't. Now, there is a further problem: all of the information about the ALTER statement that we want to provide to the trigger may not be available at that point. Simple cases like ALTER .. RENAME won't require anything more than that, but things like ALTER TABLE .. ADD FOREIGN KEY get tricky, because while at this point we have a solid handle on the identity of the relation to which we're adding the constraint, we do NOT yet have knowledge of or a lock on the