Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
On Mon, Nov 17, 2014 at 5:56 AM, Christian Ullrich ch...@chrullrich.net wrote: * Alvaro Herrera wrote: Michael Paquier wrote: Btw, perhaps this diff should be pushed as a different patch as this is a rather different thing: - if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED + if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED !smgrexists(indexRelation-rd_smgr, INIT_FORKNUM) As this is a correctness fix, it does not seem necessary to back-patch it: the parent relation always has the same persistence as its indexes. There was an argument for doing it this way that only applies if this patch went in, but I can't remember now what it was. Anyway I pushed the patch after some slight additional changes. Thanks! The buildfarm says -DCLOBBER_CACHE_ALWAYS does not like this patch. The complaint comes from jaguarundi, like here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundidt=2014-11-16%2015%3A33%3A23 Adding a new parameter to RelationSetNewRelFile As Tom mentioned, adding a new parameter to set the persistence through RelationSetNewRelfilenode works. Patch, actually tested with CLOBBER_CACHE_ALWAYS, attached. http://www.postgresql.org/message-id/27238.1416073...@sss.pgh.pa.us Regards, -- Michael From a1a7e4182bd1b5d66260220f040aee4a35e91c75 Mon Sep 17 00:00:00 2001 From: Michael Paquier mpaqu...@otacoo.com Date: Mon, 17 Nov 2014 17:04:05 + Subject: [PATCH] Fix CLOBBER_CACHE_ALWAYS broken by 85b506b Previous commit that removed ATChangeIndexesPersistence to replace it with a lower-level API able to define a new relpersistence for a relation when reindexing it failed with CLOBBER_CACHE_ALWAYS enabled. This patch fixes it by setting the relpersistence of the newly-created relation after its relfilenode is created by passing a new parameter to RelationSetNewRelfilenode able to set the persistence of a relation. --- src/backend/catalog/index.c| 5 + src/backend/commands/sequence.c| 3 ++- src/backend/commands/tablecmds.c | 6 -- src/backend/utils/cache/relcache.c | 6 +++--- src/include/utils/relcache.h | 4 +++- 5 files changed, 13 insertions(+), 11 deletions(-) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index b57aa95..825235a 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -3191,12 +3191,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence) indexInfo-ii_ExclusionStrats = NULL; } - /* Set the relpersistence of the new index */ - iRel-rd_rel-relpersistence = persistence; - /* We'll build a new physical relation for the index */ RelationSetNewRelfilenode(iRel, InvalidTransactionId, - InvalidMultiXactId); + InvalidMultiXactId, persistence); /* Initialize the index and rebuild */ /* Note: we do not need to re-establish pkey setting */ diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index e5f7765..4f7f586 100644 --- a/src/backend/commands/sequence.c +++ b/src/backend/commands/sequence.c @@ -304,7 +304,8 @@ ResetSequence(Oid seq_relid) * Same with relminmxid, since a sequence will never contain multixacts. */ RelationSetNewRelfilenode(seq_rel, InvalidTransactionId, - InvalidMultiXactId); + InvalidMultiXactId, + seq_rel-rd_rel-relpersistence); /* * Insert the modified tuple into the new storage file. diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 093224f..c57751e 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -1196,7 +1196,8 @@ ExecuteTruncate(TruncateStmt *stmt) * as the relfilenode value. The old storage file is scheduled for * deletion at commit. */ - RelationSetNewRelfilenode(rel, RecentXmin, minmulti); + RelationSetNewRelfilenode(rel, RecentXmin, minmulti, + rel-rd_rel-relpersistence); if (rel-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED) heap_create_init_fork(rel); @@ -1209,7 +1210,8 @@ ExecuteTruncate(TruncateStmt *stmt) if (OidIsValid(toast_relid)) { rel = relation_open(toast_relid, AccessExclusiveLock); -RelationSetNewRelfilenode(rel, RecentXmin, minmulti); +RelationSetNewRelfilenode(rel, RecentXmin, minmulti, + rel-rd_rel-relpersistence); if (rel-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED) heap_create_init_fork(rel); heap_close(rel, NoLock); diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 2250c56..fa75771 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3008,7 +3008,7 @@ RelationBuildLocalRelation(const char *relname, */ void RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid, - MultiXactId minmulti) + MultiXactId minmulti, char relpersistence) { Oid newrelfilenode;
Re: [HACKERS] postgres_fdw behaves oddly
Hi Fujita-san, Here are comments for postgres_fdw-syscol patch. Sanity The patch applies and compiles cleanly. The server regression and regression in contrib/postgres_fdw,file_fdw run cleanly. Code --- 1. Instead of a single liner comment System columns can't be sent to remote., it might be better to explain why system columns can't be sent to the remote. 2. The comment in deparseVar is single line comment, so it should start and end on the same line i.e. /* and */ should be on the same line. 3. Since there is already a testcase which triggered this particular change, it will good, if we add that to regression in postgres_fdw. On Mon, Nov 17, 2014 at 4:06 PM, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: Hi Fujita-san, Here are my comments about the patch fscan_reltargetlist.patch Sanity Patch applies and compiles cleanly. Regressions in test/regress folder and postgres_fdw and file_fdw are clean. 1. This isn't your change, but we might be able to get rid of assignment 2071 /* Are any system columns requested from rel? */ 2072 scan_plan-fsSystemCol = false; since make_foreignscan() already does that. But I will leave upto you to remove this assignment or not. 2. Instead of using rel-reltargetlist, we should use the tlist passed by caller. This is the tlist expected from the Plan node. For foreign scans it will be same as rel-reltargetlist. Using tlist would make the function consistent with create_*scan_plan functions. Otherwise, the patch looks good. On Wed, Nov 12, 2014 at 12:55 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/11/11 18:45), Etsuro Fujita wrote: (2014/11/10 20:05), Ashutosh Bapat wrote: Since two separate issues 1. using reltargetlist instead of attr_needed and 2. system columns usage in FDW are being tackled here, we should separate the patch into two one for each of the issues. Agreed. Will split the patch into two. Here are splitted patches: fscan-reltargetlist.patch - patch for #1 postgres_fdw-syscol.patch - patch for #2 Thanks, Best regards, Etsuro Fujita -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] WAL format and API changes (9.5)
On 11/14/2014 10:31 AM, Michael Paquier wrote: 5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)? + XLogBeginInsert(); + XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT); + XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT); See the comment of the xl_btree_newroot struct. It explains the record format of the BTREE_NEWROOT record type: * Backup Blk 0: new root page (2 tuples as payload, if splitting old root) * Backup Blk 1: left child (if splitting an old root) * Backup Blk 2: metapage When _bt_getroot creates a new root, there is no old root, but the same record type is used in _bt_newroot, which uses block ID 1 to refer to the old root page. (Thanks for the comments, again! I'll post a new version soon) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Refactoring code for sync node detection
On 16 November 2014 12:07, Michael Paquier michael.paqu...@gmail.com wrote: Let's work the multiple node thing once we have a better spec of how to do it, visibly using a dedicated micro-language within s_s_names. Hmm, please make sure that is a new post. That is easily something I could disagree with, even though I support the need for more functionality. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Refactoring code for sync node detection
On Mon, Nov 17, 2014 at 10:00 PM, Simon Riggs si...@2ndquadrant.com wrote: On 16 November 2014 12:07, Michael Paquier michael.paqu...@gmail.com wrote: Let's work the multiple node thing once we have a better spec of how to do it, visibly using a dedicated micro-language within s_s_names. Hmm, please make sure that is a new post. That is easily something I could disagree with, even though I support the need for more functionality. Sure. I am not really on that yet though :) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idle transaction cancel/timeout and SSL revisited
Andres Freund and...@2ndquadrant.com writes: On 2014-11-15 00:11:36 +0300, Alex Shulgin wrote: After reading up through archives on the two $subj related TODO items I'm under impression that the patches[1,2] didn't make it mainly because of the risk of breaking SSL internals if we try to longjump out of the signal handler in the middle of a blocking SSL read and/or if we try to report that final transaction/connection aborted notice to the client. I've written a patch that allows that - check http://archives.postgresql.org/message-id/20140927225421.GE5423%40alap3.anarazel.de I feel pretty confident that that's the way to go. I just need some time to polish it. What if instead of trying to escape from the signal handler we would set a flag and use it to simulate EOF after the recv() is restarted due to EINTR? From the backend perspective it should look like the client has just hang up. That's pretty much what the above does. Except that it actually deals with blocking syscalls by not using them and relying on latches/select instead. Yay, that's nice, because my EOF approach is sort of fishy. I've checked the patches: they apply and compile on top of current HEAD (one failed hunk in pqcomm.c), so I can help with testing if needed. There is a (short) list of items that caught my attention. I will post in reply to your patches thread. -- Alex -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
Michael Paquier wrote: The complaint comes from jaguarundi, like here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundidt=2014-11-16%2015%3A33%3A23 As Tom mentioned, adding a new parameter to set the persistence through RelationSetNewRelfilenode works. Patch, actually tested with CLOBBER_CACHE_ALWAYS, attached. I wrote an identical patch on Saturday and watched it pass CLOBBER_CACHE_ALWAYS test on Sunday. But the reason I didn't push is I couldn't understand *why* is there a problem here. I imagine that the source of the issue is supposed to be a relcache invalidation that takes place (in the original code) after reindex_index changes relpersistence, and before RelationSetNewRelfilenode() creates the filenode. But at what point does that code absorb invalidation messages? Or is there a completely different mechanism that causes the problem? If so, what? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
Alvaro Herrera wrote: I wrote an identical patch on Saturday and watched it pass CLOBBER_CACHE_ALWAYS test on Sunday. But the reason I didn't push is I couldn't understand *why* is there a problem here. I imagine that the source of the issue is supposed to be a relcache invalidation that takes place (in the original code) after reindex_index changes relpersistence, and before RelationSetNewRelfilenode() creates the filenode. But at what point does that code absorb invalidation messages? Or is there a completely different mechanism that causes the problem? If so, what? Ah, it's the anti-collision stuff in GetNewOid(), isn't it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/17/2014 03:22 PM, Heikki Linnakangas wrote: Here is an updated version of the patch. It's rebased over current master, and fixes a bunch of issues you and Michael pointed out, and a ton of other cleanup. That patch had two extra, unrelated patches applied to it. One to use the Intel CRC instruction for the CRC calculation, and another to speed up PageRepairFragmentation, by replacing the pq_qsort() call with a custom sort algorithm. Sorry about that. (the latter is something that showed up very high in a perf profile of replaying the WAL generated by pgbench - I'll start a new thread about that) Here is the same patch without those extra things mixed in. - Heikki wal-format-and-api-changes-11.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX SCHEMA
On Thu, Nov 6, 2014 at 8:00 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Oct 24, 2014 at 3:41 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Wed, Oct 22, 2014 at 9:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Sawada Masahiko wrote: Thank you for reviewing. I agree 2) - 5). Attached patch is latest version patch I modified above. Also, I noticed I had forgotten to add the patch regarding document of reindexdb. Please don't use pg_catalog in the regression test. That way we will need to update the expected file whenever a new catalog is added, which seems pointless. Maybe create a schema with a couple of tables specifically for this, instead. Attached new regression test. Hunk #1 FAILED at 1. 1 out of 2 hunks FAILED -- saving rejects to file src/bin/scripts/t/090_reindexdb.pl.rej I tried to apply the 001 patch after applying the 000, but it was not applied cleanly. At least to me, it's more intuitive to use SCHEMA instead of ALL IN SCHEMA here because we already use DATABASE instead of ALL IN DATABASE. Thank you for reporting. Um, that's one way of looking at it I think It's not quite new syntax, and we have not used ALL IN clause in REINDEX command by now. If we add SQL command to reindex table of all in table space as newly syntax, we might be able to name it REINDEX TABLE ALL IN TABLESPACE. Anyway, I created patch of REINDEX SCHEMA version, and attached. Also inapplicable problem has been solved. Regards, --- Sawada Masahiko 001_Add_schema_option_to_reindexdb_v4.patch Description: Binary data 000_reindex_schema_v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapshot too large error when initializing logical replication (9.4)
Andres Freund wrote: Hi, On 2014-10-25 18:09:36 -0400, Steve Singer wrote: I sometimes get the error snapshot too large from my logical replication walsender process when in response to a CREATE_REPLICATION_SLOT. Yes. That's possible if 'too much' was going on until a consistent point was reached. I think we can just use a much larger size for the array if necessary. I've attached patch for this. Could you try whether that helps? I don't have a testcase handy that reproduces the problem. You haven't pushed this, have you? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapshot too large error when initializing logical replication (9.4)
On 2014-11-17 11:51:54 -0300, Alvaro Herrera wrote: Andres Freund wrote: Hi, On 2014-10-25 18:09:36 -0400, Steve Singer wrote: I sometimes get the error snapshot too large from my logical replication walsender process when in response to a CREATE_REPLICATION_SLOT. Yes. That's possible if 'too much' was going on until a consistent point was reached. I think we can just use a much larger size for the array if necessary. I've attached patch for this. Could you try whether that helps? I don't have a testcase handy that reproduces the problem. You haven't pushed this, have you? No, but it's on my todo list. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2
On Sat, Oct 25, 2014 at 1:50 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Oct 24, 2014 at 4:05 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-24 15:59:30 +0530, Amit Kapila wrote: and w.r.t performance it can lead extra function call, few checks and I think in some cases even can acquire/release spinlock. I fail to see how that could be the case. Won't it happen incase first backend sets releaseOK to true and another backend which tries to wakeup waiters on lock will acquire spinlock and tries to release the waiters. Sure, that can happen. And again, this is code that's only executed around a couple syscalls. And the cacheline will be touched around there *anyway*. Sure, but I think syscalls are required in case we need to wake any waiter. It won't wake up a waiter if there's none on the list. Yeap, but still it will acquire/release spinlock. And it'd be a pretty pointless behaviour, leading to useless increased contention. The only time it'd make sense for X to be woken up is when it gets run faster than the S processes. Do we get any major benefit by changing the logic of waking up waiters? Yes. I think one downside I could see of new strategy is that the chance of Exclusive waiter to take more time before getting woked up is increased as now it will by pass Exclusive waiters in queue. Note that that *already* happens for any *new* shared locker that comes in. It doesn't really make sense to have share lockers queued behind the exclusive locker if others just go in front of it anyway. Yeah, but I think it is difficult to avoid that behaviour as even when it wakes Exclusive locker, some new shared locker can comes in and acquire the lock before Exclusive locker. I think it is difficult to say what is the best waking strategy, as priority for Exclusive lockers is not clearly defined incase of LWLocks. Andres, where are we with this patch? 1. You're going to commit it, but haven't gotten around to it yet. 2. You're going to modify it some more and repost, but haven't gotten around to it yet. 3. You're willing to see it modified if somebody else does the work, but are out of time to spend on it yourself. 4. Something else? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Escaping from blocked send() reprised.
Andres Freund and...@2ndquadrant.com writes: I've invested some more time in this: 0002 now makes sense on its own and doesn't change anything around the interrupt handling. Oh, and it compiles without 0003. In this patch, the endif appears to be misplaced in PostgresMain: + if (MyProcPort != NULL) + { +#ifdef WIN32 + pgwin32_noblock = true; +#else + if (!pg_set_noblock(MyProcPort-sock)) + ereport(COMMERROR, + (errmsg(could not set socket to nonblocking mode: %m))); + } +#endif + pqinitmask(); 0003 Sinval/notify processing got simplified further. There really isn't any need for DisableNotifyInterrupt/DisableCatchupInterrupt anymore. Also begin_client_read/client_read_ended don't make much sense anymore. Instead introduce ProcessClientReadInterrupt (which wants a better name). There's also a very WIP 0004 Allows secure_read/write be interrupted when ProcDiePending is set. All of that happens via the latch mechanism, nothing happens inside signal handlers. So I do think it's quite an improvement over what's been discussed in this thread. But it (and the other approaches) do noticeably increase the likelihood of clients not getting the error message if the client isn't actually dead. The likelihood of write() being blocked *temporarily* due to normal bandwidth constraints is quite high when you consider COPY FROM and similar. Right now we'll wait till we can put the error message into the socket afaics. 1-3 need some serious comment work, but I think the approach is basically sound. I'm much, much less sure about allowing send() to be interrupted. After re-reading these I don't see the rest of items I wanted to inqury about anymore, so it just makes more sense now. One thing I did try is sending a NOTICE to the client when in ProcessInterrupts() and DoingCommandRead is true. I think[1] it was expected to be delivered instantly, but actually the client (psql) only displays it after sending the next statement. While I'm reading on FE/BE protocol someone might want to share his wisdom on this subject. My guess: psql blocks on readline/libedit call and can't effectively poll the server socket before complete input from user. -- Alex [1] http://www.postgresql.org/message-id/1262173040.19367.5015.camel@ebony ``AFAIK, NOTICE was suggested because it can be sent at any time, whereas ERRORs are only associated with statements.'' -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2
On 2014-11-17 10:21:04 -0500, Robert Haas wrote: Andres, where are we with this patch? 1. You're going to commit it, but haven't gotten around to it yet. 2. You're going to modify it some more and repost, but haven't gotten around to it yet. 3. You're willing to see it modified if somebody else does the work, but are out of time to spend on it yourself. 4. Something else? I'm working on it. Amit had found a hang on PPC that I couldn't reproduce on x86. Since then I've reproduced it and I think yesterday I found the problem. Unfortunately it always took a couple hours to trigger... I've also made some, in my opinion, cleanups to the patch since then. Those have the nice side effect of making the size of struct LWLock smaller, but that wasn't actually the indended effect. I'll repost once I've verified the problem is fixed and I've updated all commentary. The current problem is that I seem to have found a problem that's also reproducible with master :(. After a couple of hours a pgbench -h /tmp -p 5440 scale3000 -M prepared -P 5 -c 180 -j 60 -T 2 -S against a -c max_connections=200 -c shared_buffers=4GB cluster seems to hang on PPC. With all the backends waiting in buffer mapping locks. I'm now making sure it's really master and not my patch causing the problem - it's just not trivial with 180 processes involved. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Thu, Nov 13, 2014 at 9:12 PM, Stephen Frost sfr...@snowman.net wrote: I'm not a fan of using pg_class- there are a number of columns in there which I would *not* wish to be allowed to be different per partition (starting with relowner and relacl...). Making those NULL would be just as bad (probably worse, really, since we'd also need to add new columns to pg_class to indicate the partitioning...) as having a sparsely populated new catalog table. I think you are, again, confused as to what we're discussing. Nobody, including Alvaro, has proposed a design where the individual partitions don't have pg_class entries of some kind. What we're talking about is where to store the metadata for partition exclusion and tuple routing. This discussion has gone a few rounds before and, yes, I was just jumping into the middle of this particular round, but I'm pretty sure I'm not the first to point out that storing the individual partition information into pg_class isn't ideal since there are pieces that we don't actually want to be different per partition, as I outlined previously. Perhaps what that means is we should actually go the other way and move *those* columns into a new catalog instead. Consider this (totally off-the-cuff): pg_relation (pg_tables? pg_heaps?) oid relname relnamespace reltype reloftype relowner relam (?) relhas* relisshared relpersistence relkind (?) relnatts relchecks relacl reloptions relhowpartitioned (?) pg_class pg_relation.oid relfilenode reltablespace relpages reltuples reltoastrelid reltoastidxid relfrozenxid relhasindexes (?) relpartitioninfo (whatever this ends up being) The general idea being to seperate the user-facing notion of a table from the underlying implementation, with the implementation allowing multiple sets of files to be used for each table. It's certainly not for the faint of heart, but we saw what happened with our inheiritance structure allowing different permissions on the child tables- we ended up creating a pretty grotty hack to deal with it (going through the parent bypasses the permissions). That's the best solution for that situation, but it's far from ideal and it'd be nice to avoid that same risk with partitioning. Additionally, if each partition is in pg_class, how are we handling name conflicts? Why do individual partitions even need to have a name? Do we allow queries against them directly? etc.. There's certainly something to this, but not for the faint of heart sounds like an understatement. One of the good things about inheritance is that, if the system doesn't automatically do the right thing, there's usually an escape hatch. If the INSERT trigger you use for tuple routing is too slow, you can insert directly into the target partition. If your query doesn't realize that it can prune away all the partitions but one, or takes too long to do it, you can query directly against that partition. These aren't beautiful things and I'm sure we're all united in wanting a mechanism that will reduce the need to do them, but we need to make sure that we are removing the need for the escape hatch, and not just cementing it shut. In other words, I don't think there is a problem with people querying child tables directly; the problem is that people are forced to do so in order to get good performance. We shouldn't remove the ability for people to do that unless we're extremely certain we've fixed the problem that leads them to wish to do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2
On Mon, Nov 17, 2014 at 10:31 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-17 10:21:04 -0500, Robert Haas wrote: Andres, where are we with this patch? 1. You're going to commit it, but haven't gotten around to it yet. 2. You're going to modify it some more and repost, but haven't gotten around to it yet. 3. You're willing to see it modified if somebody else does the work, but are out of time to spend on it yourself. 4. Something else? I'm working on it. Amit had found a hang on PPC that I couldn't reproduce on x86. Since then I've reproduced it and I think yesterday I found the problem. Unfortunately it always took a couple hours to trigger... I've also made some, in my opinion, cleanups to the patch since then. Those have the nice side effect of making the size of struct LWLock smaller, but that wasn't actually the indended effect. I'll repost once I've verified the problem is fixed and I've updated all commentary. The current problem is that I seem to have found a problem that's also reproducible with master :(. After a couple of hours a pgbench -h /tmp -p 5440 scale3000 -M prepared -P 5 -c 180 -j 60 -T 2 -S against a -c max_connections=200 -c shared_buffers=4GB cluster seems to hang on PPC. With all the backends waiting in buffer mapping locks. I'm now making sure it's really master and not my patch causing the problem - it's just not trivial with 180 processes involved. Ah, OK. Thanks for the update. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding - reading a user catalog table
On 11/16/2014 04:49 PM, Steve Singer wrote: I installed things following the above steps on a different system than my usual development laptop and I have been unable to reproduce the error so for (on that system). But I am still able to reproduce it on occasion on my normal development laptop. After continuously running the test I was eventually able to reproduce the error on the other system as well. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding - reading a user catalog table
On 2014-11-17 10:33:52 -0500, Steve Singer wrote: On 11/16/2014 04:49 PM, Steve Singer wrote: I installed things following the above steps on a different system than my usual development laptop and I have been unable to reproduce the error so for (on that system). But I am still able to reproduce it on occasion on my normal development laptop. Thanks for the testcase! I'll try to reproduce/fix. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding - reading a user catalog table
On 2014-11-13 22:23:02 -0500, Steve Singer wrote: On 11/13/2014 02:44 PM, Andres Freund wrote: H I've pushed a fix for a bug that could possibly also cause this. Although it'd be odd that it always hits the user catalog table. Except if your tests mostly modify the slony tables, but do not do much DDL otherwise? The test I was running doesn't do DDL, so only the user catalog tables would have changes being tracked. I still sometimes get the error. When I get sometime on the weekend I'll work on some detailed instructions on how to grab and setup the various components to replicate this test. Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in a normal client connection, not the walsender #3 0x006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598 #6 0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x006c0e35 in exec_simple_query ( query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp, ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot), \pg_catalog\.txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\...) at postgres.c:959 #8 PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50, I have no idea if this has anything to do with your recent changes or some other change. I haven't so far been able to replicate that since the first time I saw it. That crash is decidedly odd. Any chance you still have the full backtrace around? This is in the SSI code... I'm not immediately seeing how this could be related to logical decoding. It can't even be a imported snapshot, because the exported snapshot is marked repeatable read. Are you actually using serializable transactions? If so, how and why? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] BRIN page type identifier
The special area in a BRIN page looks like this: /* special space on all BRIN pages stores a type identifier */ #define BRIN_PAGETYPE_META 0xF091 #define BRIN_PAGETYPE_REVMAP0xF092 #define BRIN_PAGETYPE_REGULAR 0xF093 ... typedef struct BrinSpecialSpace { uint16 flags; uint16 type; } BrinSpecialSpace; I believe this is supposed to follow the usual convention that the last two bytes of a page can be used to identify the page type. SP-GiST uses 0xFF82, while GIN uses values 0x00XX. However, because the special size is MAXALIGNed, the 'type' field are not the last 2 bytes on the page, as intended. I'd suggest just adding char padding[6] in BrinSpecialSpace, before 'flags'. That'll waste 4 bytes on 32-bit systems, but that seems acceptable. Also, the comment in GinPageOpaqueData needs updating: /* * Page opaque data in an inverted index page. * * Note: GIN does not include a page ID word as do the other index types. * This is OK because the opaque data is only 8 bytes and so can be reliably * distinguished by size. Revisit this if the size ever increases. * Further note: as of 9.2, SP-GiST also uses 8-byte special space. This is * still OK, as long as GIN isn't using all of the high-order bits in its * flags word, because that way the flags word cannot match the page ID used * by SP-GiST. */ BRIN now also uses 8-byte special space. While you're at it, might want to move that comment somewhere else, as it's really about a convention among all page types, not just GIN. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unintended restart after recovery error
On Thu, Nov 13, 2014 at 10:59 PM, Fujii Masao masao.fu...@gmail.com wrote: 442231d7f71764b8c628044e7ce2225f9aa43b6 introduced the latter rule for hot-standby case. Maybe *during crash recovery* (i.e., hot standby should not be enabled) it's better to treat the crash of startup process as a catastrophic crash. Maybe, but why, specifically? If the startup process failed internally, it's probably because it hit an error during the replay of some WAL record. So if we restart it, it will back up to the previous checkpoint or restartpoint, replay the same WAL records as before, and die again in the same spot. We don't want it to sit there and do that forever in an infinite loop, so it makes sense to kill the whole server. But if the startup process was killed off because the checkpointer croaked, that logic doesn't necessarily apply. There's no reason to assume that the replay of a particular WAL record was what killed the checkpointer; in fact, it seems like the odds are against it. So it seems right to fall back to our general principle of restarting the server and hoping that's enough to get things back on line. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Thu, Nov 13, 2014 at 7:27 PM, Simon Riggs si...@2ndquadrant.com wrote: On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote: * only functions marked as CONTAINS NO SQL We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. Interestingly, I have a fairly solid idea of what proisparallel is, but I have no clear idea what CONTAINS NO SQL is or why it's relevant. I would imagine that srandom() contains no SQL under any reasonable definition of what that means, but it ain't parallel-safe. What is wrong in generating random numbers in parallel? If they're random, nothing, provide that you use a different seed in each backend. But if the seed has been set using srandom(), then we must generate the particular sequence of random numbers associated with that seed. If we don't, the behavior is incompatible with what happens apart from parallel mode. But I'm sure many volatile functions would be annoying to support, so CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing. It might be OK to assume that immutable functions are fine, but not all stable functions look like they'll be safe, or not without additional effort: e.g. inet_client_addr(), pg_backend_pid(), pg_cursor(), pg_event_trigger_dropped_objects(). Or even, more mundanely, generate_series(), unless we ensure all calls are in the same backend. On the other hand, there are quite a few stable functions that it seems like we would want to allow in parallel queries, like to_char(), concat(), and textanycat(). I still don't know what CONTAINS NO SQL means. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On Fri, Nov 14, 2014 at 11:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-13 11:41:18 -0500, Robert Haas wrote: On Wed, Nov 12, 2014 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: But I think it won't work realistically. We have a *lot* of infrastructure that refers to indexes using it's primary key. I don't think we want to touch all those places to also disambiguate on some other factor. All the relevant APIs are either just passing around oids or relcache entries. I'm not quite following this. The whole point is to AVOID having two indexes. You have one index which may at times have two sets of physical storage. Right. But how are we going to refer to these different relfilenodes? All the indexing infrastructure just uses oids and/or Relation pointers to refer to the index. How would you hand down the knowledge which of the relfilenodes is supposed to be used in some callchain? If you've got a Relation, you don't need someone to tell you which physical storage to use; you can figure that out for yourself by looking at the Relation. If you've got an OID, you're probably going to go conjure up a Relation, and then you can do the same thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 17 November 2014 07:31, Jeff Davis pg...@j-davis.com wrote: On Sat, 2014-11-15 at 21:36 +, Simon Riggs wrote: Do I understand correctly that we are trying to account for exact memory usage at palloc/pfree time? Why?? Not palloc chunks, only tracking at the level of allocated blocks (that we allocate with malloc). It was a surprise to me that accounting at that level would have any measurable impact, but Robert found a reasonable case on a POWER machine that degraded a couple percent. I wasn't able to reproduce it consistently on x86. Surprise to me also. Robert's tests showed a deviation of 0.4 sec after a restart. ISTM that we wouldn't see that every time. AFAIK the whole purpose of the memory allocator is to reduce the number of system calls, so if we are doing so many malloc() calls as to be noticeable just for accounting then something is wrong. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] double counting of lines in psql
This tiny change fixes what I think is a longstanding bug in psql. I causes the first line of every cell to be counted twice, whereas it should in fact be excluded from extra_lines / extra_row_output_lines. The bug appears to date back to commit 43ee2282 in 2008. Changing it appears to make my proposed pager_min_lines feature work as expected. So, should it be backpatched? It's a behaviour change, albeit that the existing behaviour is a bug, and will cause the pager to be invoked on output that is way too short (by about half a screen's height, I think). cheers andrew diff --git a/src/bin/psql/print.c b/src/bin/psql/print.c index 2e158b8..c93f744 100644 --- a/src/bin/psql/print.c +++ b/src/bin/psql/print.c @@ -837,7 +837,7 @@ print_aligned_text(const printTableContent *cont, FILE *fout) { unsigned int extra_lines; - extra_lines = (width - 1) / width_wrap[i] + nl_lines; + extra_lines = ((width - 1) / width_wrap[i]) + (nl_lines - 1); if (extra_lines extra_row_output_lines) extra_row_output_lines = extra_lines; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 17 November 2014 07:31, Jeff Davis pg...@j-davis.com wrote: To calculate the total memory used, I included a function MemoryContextMemAllocated() that walks the memory context and its children recursively. If we assume that we want the results accurate to within 5%, then we can work out a good sampling rate to achieve that accuracy. We would probably need a counting mechanism that didn't slow down O(N) as our allocations get bigger. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
Re: Tom Lane 2014-11-05 29132.1415144...@sss.pgh.pa.us I wrote: An alternative that just occurred to me is to put the no-volatile- I/O-functions check into CREATE TYPE, but make it just WARNING not ERROR. That would be nearly as good as an ERROR in terms of nudging people who'd accidentally omitted a volatility marking from their custom types. But we could leave chkpass as-is and wait to see if anyone complains hey, why am I getting this warning?. If we don't hear anyone complaining, maybe that means we can get away with changing the type's behavior in 9.6 or later. Attached is a complete patch along these lines. As I suggested earlier, this just makes the relevant changes in ltree--1.0.sql and pg_trgm--1.1.sql without bumping their extension version numbers, since it doesn't seem important enough to justify a version bump. I propose that we could back-patch the immutability-additions in ltree and pg_trgm, since they won't hurt anything and might make life a little easier for future adopters of those modules. The WARNING additions should only go into HEAD though. In HEAD, there's this WARNING now: WARNING: type input function chkpass_in should not be volatile From 66c029c842629958b3ae0d389f24ea3407225723: A fly in the ointment is that chkpass_in actually is volatile, because of its use of random() to generate a fresh salt when presented with a not-yet-encrypted password. This is bad because of the general assumption that I/O functions aren't volatile: the consequence is that records or arrays containing chkpass elements may have input behavior a bit different from a bare chkpass column. But there seems no way to fix this without breaking existing usage patterns for chkpass, and the consequences of the inconsistency don't seem bad enough to justify that. So for the moment, just document it in a comment. IMHO built-in functions (even if only in contrib) shouldn't be raising warnings - the user can't do anything about the warnings anyway, so they shouldn't get notified in the first place. (Catched by Debian's postgresql-common testsuite) Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 17 November 2014 16:01, Robert Haas robertmh...@gmail.com wrote: I still don't know what CONTAINS NO SQL means. It's a function marking that would indicate we aren't allowed to take snapshots or run SQL. I think you should publish the scheme for marking functions as safe for parallelism, so we can judge that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
Christoph Berg c...@df7cb.de writes: In HEAD, there's this WARNING now: WARNING: type input function chkpass_in should not be volatile Yeah, that's intentional. IMHO built-in functions (even if only in contrib) shouldn't be raising warnings - the user can't do anything about the warnings anyway, so they shouldn't get notified in the first place. The point is to find out how many people care ... (Catched by Debian's postgresql-common testsuite) ... and by people, I do not mean test suites. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding - reading a user catalog table
On 11/17/2014 10:37 AM, Andres Freund wrote: On 2014-11-13 22:23:02 -0500, Steve Singer wrote: Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in a normal client connection, not the walsender #3 0x006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598 #6 0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x006c0e35 in exec_simple_query ( query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp, ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot), \pg_catalog\.txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\...) at postgres.c:959 #8 PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50, I have no idea if this has anything to do with your recent changes or some other change. I haven't so far been able to replicate that since the first time I saw it. That crash is decidedly odd. Any chance you still have the full backtrace around? Yes I still have the core file This is in the SSI code... I'm not immediately seeing how this could be related to logical decoding. It can't even be a imported snapshot, because the exported snapshot is marked repeatable read. Are you actually using serializable transactions? If so, how and why? Yes, the test client that performs the 'simulated workload' does some serializable transactions. It connects as a normal client via JDBC and does a prepareStatement(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE) and then does some DML. Why? because it seemed like a good thing to include in the test suite. Your right this might have nothing to do with logical decoding. I haven't been able to reproduce again either, I can't even say if this problem was introduced to 9.4 in the past month or if it has been around much longer and I just haven't hit it before. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding - reading a user catalog table
Hi, Kevin: CCed you, because it doesn't really look like a logical decoding related issue. On 2014-11-17 11:25:40 -0500, Steve Singer wrote: On 11/17/2014 10:37 AM, Andres Freund wrote: On 2014-11-13 22:23:02 -0500, Steve Singer wrote: Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in a normal client connection, not the walsender #3 0x006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598 #6 0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x006c0e35 in exec_simple_query ( query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp, ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot), \pg_catalog\.txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\...) at postgres.c:959 #8 PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50, I have no idea if this has anything to do with your recent changes or some other change. I haven't so far been able to replicate that since the first time I saw it. That crash is decidedly odd. Any chance you still have the full backtrace around? Yes I still have the core file Cool, could you show the full thing? Or did you just snip it because it's just the Assert/ExceptionalCondition thing? Could you print *snapshot in frame #3? This is in the SSI code... I'm not immediately seeing how this could be related to logical decoding. It can't even be a imported snapshot, because the exported snapshot is marked repeatable read. Are you actually using serializable transactions? If so, how and why? Yes, the test client that performs the 'simulated workload' does some serializable transactions. It connects as a normal client via JDBC and does a prepareStatement(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE) and then does some DML. Why? because it seemed like a good thing to include in the test suite. Yes, it's a good reason as the above backtrace proves ;). I'm just trying to understand uner which circumstances it happens. The above backtrace looks like it can only be triggered if you do a BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something you do? Your right this might have nothing to do with logical decoding. I haven't been able to reproduce again either, I can't even say if this problem was introduced to 9.4 in the past month or if it has been around much longer and I just haven't hit it before. It's not hard to imagine that the safe/deferred snapshot code isn't all that well exercised. I don't think many people use it yet. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
Re: Tom Lane 2014-11-17 6903.1416241...@sss.pgh.pa.us Christoph Berg c...@df7cb.de writes: In HEAD, there's this WARNING now: WARNING: type input function chkpass_in should not be volatile Yeah, that's intentional. IMHO built-in functions (even if only in contrib) shouldn't be raising warnings - the user can't do anything about the warnings anyway, so they shouldn't get notified in the first place. The point is to find out how many people care ... Is the point of this to figure out whether to fix this properly, or to revert to the old code? The current status isn't something that should be released with 9.5. (Catched by Debian's postgresql-common testsuite) ... and by people, I do not mean test suites. Well atm this breaks the building of 9.5 packages. This means we are not going to find out if anyone cares by this way :) Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 15.11.2014 22:36, Simon Riggs wrote: On 16 October 2014 02:26, Jeff Davis pg...@j-davis.com wrote: The inheritance is awkward anyway, though. If you create a tracked context as a child of an already-tracked context, allocations in the newer one won't count against the original. I don't see a way around that without introducing even more performance problems. This seems to have reached impasse, which is a shame. Let me throw out a few questions to see if that might help. Do I understand correctly that we are trying to account for exact memory usage at palloc/pfree time? Why?? Surely we just want to keep track of the big chunks? Does this need to be exact? How exact does it need to be? What do you mean by big chunks? Blocks, or really over-sized chunks (over 8kB)? I assume blocks, and that's what the proposed patches are doing. Or alternatively, can't we just sample the allocations to reduce the overhead? That might work, but I'm afraid it's trickier than it might seem. Are there some assumptions we can make about micro-contexts never needing to be tracked at all? What is a micro-context? Jeff seems not too bothered by inheritance, whereas Tomas thinks its essential. Perhaps there is a middle ground, depending upon the role of the context? Having a hierarchy of memory contexts and an accounting completely ignoring that hierarchy seems a bit strange to me. Maybe we can impose some restrictions, for example: (a) a single context with tracking explicitly requested (in the whole hierarchy) (b) when tracking is requested for a context, the parent context must not have tracking enabled Either of these restrictions would prevent a situation where a context has to update accounting for two parent contexts. That'd allow updating a single place (because each context has a single parent context with tracking requested). But I'm not sure that those restrictions are acceptable, that we can enforce them reliably, or how frequently that'd break external code (e.g. extensions with custom aggregates etc.). Because this is part of the public API, including the option to enable accounting for a memory context. Imagine you write an extension with a custom aggregate (or whatever) and it works just fine. Then we tweak something in the executor, requesting accounting for another context, and the extension suddenly breaks. Or maybe we don't change anything, and the aggregate works fine with Group Aggregate, but breaks whenever the plan switches to Hash Aggregate (because that's using accounting). Or maybe the user just redirects the tracking somewhere else (e.g. by supplying a pointer to the v6 of the patch), excluding the context from the accounting entirely. So yeah, I think the inheritance is an important aspect, at least for a generic accounting infrastructure. I'd bet that if we choose to do that, there'll be a good deal of various WTF moments and foot-shooting in the near future ... On 23/8 I proposed to use a separate hierarchy of accounting info, parallel to the memory context hierarchy (but only for contexts with tracking explicitly requested) http://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz but no response so far. I do see ~1% regression on Robert's benchmark (on x86), but it's quite noisy. Not sure about PPC, which is what Robert used. I don't see how to make it faster without sacrificing full inheritance support ... Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
Hi, On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: *** a/src/include/nodes/memnodes.h --- b/src/include/nodes/memnodes.h *** *** 60,65 typedef struct MemoryContextData --- 60,66 MemoryContext nextchild;/* next child of same parent */ char *name; /* context name (just for debugging) */ boolisReset;/* T = no space alloced since last reset */ + uint64 mem_allocated; /* track memory allocated for this context */ #ifdef USE_ASSERT_CHECKING boolallowInCritSection; /* allow palloc in critical section */ #endif That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. Consider not counting memory in bytes but blocks and adding it directly after the NodeTag. That'd leave the size the same as right now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Nov 10, 2014 at 3:33 PM, Peter Geoghegan p...@heroku.com wrote: * Documentation clean-up - as I mentioned, I tried to address Simon's concerns here. Also, as you'd expect, the documentation has been fixed up to reflect the new syntax. I'll need to take a pass at updating the UPSERT Wiki page soon, too. I should mention that I've updated the Wiki to reflect the current, post-v1.4 state of affairs. That remains the best place to get a relatively quick overview of where things stand with open items, discussion of the patch, etc: https://wiki.postgresql.org/wiki/UPSERT -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 17.11.2014 08:31, Jeff Davis wrote: On Sat, 2014-11-15 at 21:36 +, Simon Riggs wrote: Do I understand correctly that we are trying to account for exact memory usage at palloc/pfree time? Why?? Not palloc chunks, only tracking at the level of allocated blocks (that we allocate with malloc). It was a surprise to me that accounting at that level would have any measurable impact, but Robert found a reasonable case on a POWER machine that degraded a couple percent. I wasn't able to reproduce it consistently on x86. Or alternatively, can't we just sample the allocations to reduce the overhead? Not sure quite what you mean by sample, but it sounds like something along those lines would work. Maybe. It might also cause unexpected volatility / nondeterminism in the query execution. Imagine a Hash Aggregate, where a small number of the requests is considerably larger than the rest. If you happen to sample only a few of those large requests, the whole hash aggregate happens in-memory, otherwise it starts batching. The users will observe this as random variations to the runtimes - same query / parameters, idle machine, sudden changes to the plan ... Maybe I'm too wary, but I guess there are use cases where the latency uniformity is a concern. Attached is a patch that does something very simple: only tracks blocks held in the current context, with no inheritance or anything like it. This reduces it to a couple arithmetic instructions added to the alloc/dealloc path, so hopefully that removes the general performance concern raised by Robert[1]. To calculate the total memory used, I included a function MemoryContextMemAllocated() that walks the memory context and its children recursively. Of course, I was originally trying to avoid that, because it moves the problem to HashAgg. For each group, it will need to execute MemoryContextMemAllocated() to see if work_mem has been exceeded. It will have to visit a couple contexts, or perhaps many (in the case of array_agg, which creates one per group), which could be a measurable regression for HashAgg. :-( But if that does turn out to be a problem, I think it's solvable. First, I could micro-optimize the function by making it iterative rather than recursive, to save on function call overhead. Second, I could offer a way to prevent the HTAB from creating its own context, which would be one less context to visit. And if those don't work, perhaps I could resort to a sampling method of some kind, as you allude to above. Do you plan to try this approach with your hashagg patch, and do some measurements? I'd expect the performance hit to be significant, but I'd like to see some numbers. kind regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
Christoph Berg c...@df7cb.de writes: Re: Tom Lane 2014-11-17 6903.1416241...@sss.pgh.pa.us The point is to find out how many people care ... Is the point of this to figure out whether to fix this properly, or to revert to the old code? The current status isn't something that should be released with 9.5. The warning will not be reverted, if that's what you mean. The problem is that chkpass_in is broken by design. We need to find some actual users of the type and see what they think is the least painful solution. (Or, if we find out there aren't any, maybe we'll just flush the whole module ...) I will not promise that it will change by 9.5. It may take some time to collect information, and 9.4 will not have been in wide use for all that long before 9.5 freezes. Well atm this breaks the building of 9.5 packages. This means we are not going to find out if anyone cares by this way :) You will need to adjust your test. This is by no means the first time that we've intentionally shipped contrib modules that generate warnings; there was that messy business with the = operator ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PgBench's \setrandom could be better
On Sat, Nov 15, 2014 at 5:51 PM, David Rowley dgrowle...@gmail.com wrote: With pgbench I can do: \setrandom qty 24 25 then throw a :qty into the query and have pgbench replace that with either 24 or 25. The problem is that this does not work for anything apart from integer types. I've got a pending patch here - which I don't think anyone has reviewed - that adds an expression syntax to pgbench: https://commitfest.postgresql.org/action/patch_view?id=1581 Once that patch goes in, I think we should generalize it so that the \set syntax can use not only operators but also functions, like abs(). We could also generalize it to work on data types other than integers, such as strings or (like you're suggesting here) dates, so that you could have a random_date() function. In general I'm pretty skeptical about continuing to expend pgbench by adding more backslash commands and complicating the syntax of the ones we've already got. We can certainly do it that way, but it's kind of a mess. I'd rather have a more general framework to build on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback
On Thu, Nov 13, 2014 at 12:38 PM, Fujii Masao masao.fu...@gmail.com wrote: Thanks for reviewing the patch! On Thu, Nov 13, 2014 at 4:05 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Fujii Masao wrote: --- 127,152 When this option is used, applicationpg_receivexlog/ will report a flush position to the server, indicating when each segment has been synchronized to disk so that the server can remove that segment if it ! is not otherwise needed. literal--synchronous/literal option must ! be specified when making applicationpg_receivexlog/ run as ! synchronous standby by using replication slot. Otherwise WAL data ! cannot be flushed frequently enough for this to work correctly. /para /listitem /varlistentry Whitespace damage here. Fixed. + printf(_( --synchronous flush transaction log in real time\n)); in real time sounds odd. How about flush transaction log immediately after writing, or maybe have transaction log writes be synchronous. The former sounds better to me. So I chose it. --- 781,791 now = feGetCurrentTimestamp(); /* ! * Issue sync command as soon as there are WAL data which ! * has not been flushed yet if synchronous option is true. */ if (lastFlushPosition blockpos ! walfile != -1 synchronous) I'd put the synchronous condition first in the if(), and start the comment with it rather than putting it at the end. Both seem weird. Fixed, i.e., moved the synchronous condition first in the if()'s test and also moved the comment If synchronous option is true also first in the comment. progname, current_walfile_name, strerror(errno)); goto error; } lastFlushPosition = blockpos; ! ! /* ! * Send feedback so that the server sees the latest WAL locations ! * immediately if synchronous option is true. ! */ ! if (!sendFeedback(conn, blockpos, now, false)) ! goto error; ! last_status = now; I'm not clear about this comment .. why does it say if synchronous option is true when it's not checking the condition? I added that comment because the code exists with the if() block checking synchronous condition. But it seems confusing. Just removed that part from the comment. Attached is the updated version of the patch. I've just pushed this. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Refactoring code for sync node detection
On Mon, Nov 17, 2014 at 2:20 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Sun, Nov 16, 2014 at 9:07 PM, Michael Paquier michael.paqu...@gmail.com wrote: I'll send an updated patch tomorrow. Here are updated versions. I divided the patch into two for clarity, the first one is the actual refactoring patch, renaming SyncRepGetSynchronousNode to SyncRepGetSynchronousStandby (+alpha, like updating synchronous to sync in the comments as you mentioned) such as the namings have no conflicts. The second one updates the syncrep code, including WAL sender and WAL receiver, and its surroundings to use the term node instead of standby. This brings in the code the idea that a node using replication APIs is not necessarily a standby, making it more generic. This can be applied on top of the refactoring patch. If any other folks (Heikki or Fujii-san?) have comments about this idea feel free. I've not read the patches yet. But while I was reading the code around sync node detection, I thought that it might be good idea to treat the node with priority as special. That is, if we find the active node with the priority 1, we can immediately go out of the sync-node-detection-loop. In many cases we can expect that the sync standby has the priority 1. If yes, treating the priority 1 as special is good way to do, I think. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch incorporates this feedback. The only user-visible difference between this revision and the previous revision is that it's quite possible for two suggestion to originate from the same RTE (there is exactly one change in the regression test's expected output as compared to the last revision for this reason. The regression tests are otherwise unchanged). It's still not possible to see more than 2 suggestions under any circumstances, no matter where they might have originated from, which I think is appropriate -- we continue to not present any HINT in the event of 3 or more equidistant matches. I think that the restructuring required to pass around a state variable has resulted in somewhat clearer code. Cool! I'm grumpy about the distanceName() function. That seems too generic. If we're going to keep this as it is, I suggest something like computeRTEColumnDistance(). But see below. On a related note, I'm also grumpy about this comment: +/* Charge half as much per deletion as per insertion or per substitution */ +return varstr_levenshtein_less_equal(actual, len, match, match_len, + 2, 1, 2, max); The purpose of a code comment is to articulate WHY we did something, rather than simply to restate what the code quite obviously does. I haven't heard a compelling argument for why this should be 2, 1, 2 rather than the default 1, 1, 1; and I'm inclined to do the latter unless you can make some very good argument for this combination of weights. And if you can make such an argument, then there should be comments so that the next person to come along and look at this code doesn't go, huh, that's whacky, and change it. + int location, FuzzyAttrMatchState *rtestate) I suggest calling this fuzzystate rather than rtestate; it's not the state of the RTE, but the state of the fuzzy matching. Within the scanRTEForColumn block, we have a rather large chunk of code protected by if (rtestate), which contains the only call to distanceName(). I suggest that we move all of this logic to a separate, static function, and merge distanceName into it. I also suggest testing against NULL explicitly instead of implicitly. So this block of code would end up as something like: if (fuzzystate != NULL) updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate); In searchRangeTableForCol, I'm fairly certain that you've changed the behavior by adding a check for if (rte-rtekind == RTE_JOIN) before the call to scanRTEForColumn(). Why not instead put this check into updateFuzzyAttrMatchState? Then you can be sure you're not changing the behavior in any other case. On a similar note, I think the dropped-column test should happen early as well, probably again in updateFuzzyAttrMatchState(). There's little point in adding a suggestion only to throw it away again. +/* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ +if (wrongalias) +rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. ERROR: column oid does not exist LINE 1: select oid 0, * from altwithoid; ^ +HINT: Perhaps you meant to reference the column altwithoid.col. That seems like a stretch. I think I suggested before using a distance threshold of at most 3 or half the word length, whichever is less. For a three-letter column name that means not suggesting anything if more than one character is different. What you implemented here is close to that, yet somehow we've got a suggestion slipping through that has 2 out of 3 characters different. I'm not quite sure I see how that's getting through, but I think it shouldn't. ERROR: column fullname.text does not exist LINE 1: select fullname.text from fullname; ^ +HINT: Perhaps you meant to reference the column fullname.last. Same problem, only worse! They've only got one letter of four in common. ERROR: column xmin does not exist LINE 1: select xmin, * from fooview; ^ +HINT: Perhaps you meant to reference the column fooview.x. Basically the same problem again. I think the distance threshold in this case should be half the shorter column name, i.e. 0. Your new test cases include no negative test cases; that is, cases where the machinery declines to suggest a hint because of, say, 3 equally good possibilities. They probably should have something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] New Event Trigger: table_rewrite
On Sun, Nov 16, 2014 at 5:51 AM, Simon Riggs si...@2ndquadrant.com wrote: On 16 November 2014 06:59, Michael Paquier michael.paqu...@gmail.com wrote: 1) This patch is authorizing VACUUM and CLUSTER to use the event triggers ddl_command_start and ddl_command_end, but aren't those commands actually not DDLs but control commands? I could go either way on that. I'm happy to remove those from this commit. Yeah, this patch definitely shouldn't change the set of commands to which existing event triggers apply as a side-effect. There's no reason new DDL commands need to apply to the same set of operations as existing DDL commands, but the existing ones shouldn't be changed without specific discussion and agreement. It seems pretty weird, also, that the event trigger will fire after we've taken AccessExclusiveLock when you cluster a particular relation, and before we've taken AccessExclusiveLock when you cluster database-wide. That's more or less an implementation artifact of the current code that we're exposing to the use for, really, no good reason. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 17.11.2014 18:04, Andres Freund wrote: Hi, On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: *** a/src/include/nodes/memnodes.h --- b/src/include/nodes/memnodes.h *** *** 60,65 typedef struct MemoryContextData --- 60,66 MemoryContext nextchild;/* next child of same parent */ char *name; /* context name (just for debugging) */ boolisReset;/* T = no space alloced since last reset */ +uint64 mem_allocated; /* track memory allocated for this context */ #ifdef USE_ASSERT_CHECKING boolallowInCritSection; /* allow palloc in critical section */ #endif That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. I'm no PPC64 expert, but I thought the cache lines are 128 B on that platform, since at least Power6? Also, if I'm counting right, the MemoryContextData structure is 56B without the 'mem_allocated' field (and without the allowInCritSection), and 64B with it (at that particular place). sizeof() seems to confirm that. (But I'm on x86, so maybe the alignment on PPC64 is different?). Consider not counting memory in bytes but blocks and adding it directly after the NodeTag. That'd leave the size the same as right now. I suppose you mean kbytes, because the block size is not fixed. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote: On 17.11.2014 18:04, Andres Freund wrote: Hi, On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: *** a/src/include/nodes/memnodes.h --- b/src/include/nodes/memnodes.h *** *** 60,65 typedef struct MemoryContextData --- 60,66 MemoryContext nextchild;/* next child of same parent */ char *name; /* context name (just for debugging) */ boolisReset;/* T = no space alloced since last reset */ + uint64 mem_allocated; /* track memory allocated for this context */ #ifdef USE_ASSERT_CHECKING boolallowInCritSection; /* allow palloc in critical section */ #endif That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. I'm no PPC64 expert, but I thought the cache lines are 128 B on that platform, since at least Power6? Hm, might be true. Also, if I'm counting right, the MemoryContextData structure is 56B without the 'mem_allocated' field (and without the allowInCritSection), and 64B with it (at that particular place). sizeof() seems to confirm that. (But I'm on x86, so maybe the alignment on PPC64 is different?). The MemoryContextData struct is embedded into AllocSetContext. Consider not counting memory in bytes but blocks and adding it directly after the NodeTag. That'd leave the size the same as right now. I suppose you mean kbytes, because the block size is not fixed. Some coarser size than bytes. Don't really care about the granularity. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 18 November 2014 05:19, Simon Riggs si...@2ndquadrant.com wrote: On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. I have something that I've been working on locally. It's far from ready, but it does work in very simple cases, and shows a nice performance boost. I'll start another thread soon and copy you both in. Perhaps we can share some ideas. Regards David Rowley
Re: [HACKERS] [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4
On Thu, Nov 13, 2014 at 7:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: One thing that occurs to me is that if the generic plan estimate comes out much cheaper than the custom one, maybe we should assume that the generic's cost estimate is bogus. Right offhand I can't think of a reason for a custom plan to look worse than a generic one, unless there's a statistical quirk like this one. That's an interesting idea, but what do we do after deciding that it's bogus? The generic plan really can't be cheaper than the custom plan, but it could be the same price, or as close as makes no difference. I have long thought that maybe the custom vs. generic plan decision should have something to do with whether the parameters are MCVs of the table columns they're being compared against. This case is interesting because it demonstrates that querying for a *non*-MCV can require a switch to a custom plan, which is something I don't think I've seen before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Thu, Nov 13, 2014 at 4:57 PM, Jeff Davis pg...@j-davis.com wrote: On Thu, Nov 13, 2014 at 11:26 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis pg...@j-davis.com wrote: If two backends both have an exclusive lock on the relation for a join operation, that implies that they need to do their own synchronization, because obviously the lock manager is not doing it for them. This doesn't make sense to me. Why would they need to synchronize access to a relation in order to join it? Unfortunate typo: that was supposed to be joint operation, just meaning that they are working together for something (e.g. CLUSTER, VACUUM FULL as you suggest). Sorry for the confusion. Ah, well, that's different. I meant that the backends need to divide up the work somehow. And if each operator needs to divide up the work before operating, that means we need to change every operator. I think I see your point, which it just so happens Amit articulated to me in different words while we were chatting about this problem this morning. We want to avoid waiving the mutual exclusion provided by the lock manager only to end up re-implementing something very much like the current lock manager to arbitrate resource contention among backends within a parallel group. However, we also don't want the mutual exclusion that the lock manager provides to serve as a barrier to implementing useful parallelism; that is, if the parallel backends want to manage conflicts themselves instead of letting the lock manager do it, that should be possible. In http://www.postgresql.org/message-id/ca+tgmoygplojo+lg1bebos8gdjwjtq0qdmxsyj4ihfyj11t...@mail.gmail.com I propose a new approach. The basic idea is that any heavyweight lock that we hold at the start of a parallel phase can't represent a bona fide conflict between the activities of two different backends, so we arrange to waive conflicts over those locks. The problem (as Andres points out) is that those locks could be subsequently re-taken by operations that are not parallel-safe, and the new locks would be granted locally regardless of the shared lock table because the current backend would already hold them. However, I think it's fine to make it the job of the parallelism infrastructure to plug that hole, rather than trying to enforce it in the lock manager. There are LOTS of things that need to be prohibited in parallel mode and only allowed once they've been made sufficiently safe, and in the first version, that's certainly going to include - among other things - any operation that writes data. I think it's OK to the job of the people who want to relax those prohibitions to deal with making it safe to do so. To reiterate the basic problem here, if we do nothing at all about the lock manager, a parallel backend can stall trying to grab an AccessExclusiveLock that the user backend alread holds, either because the user backend holds an AccessExclusiveLock as well, or because some other process is waiting for one, we'll deadlock and not notice. We have to do *something* about that. I'd like to arrive at some consensus on how to move forward here as soon as possible, because the clock is ticking here. http://www.postgresql.org/message-id/ca+tgmoygplojo+lg1bebos8gdjwjtq0qdmxsyj4ihfyj11t...@mail.gmail.com articulates what I currently believe to be the best plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On 17.11.2014 19:46, Andres Freund wrote: On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote: On 17.11.2014 18:04, Andres Freund wrote: Hi, On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: *** a/src/include/nodes/memnodes.h --- b/src/include/nodes/memnodes.h *** *** 60,65 typedef struct MemoryContextData --- 60,66 MemoryContext nextchild;/* next child of same parent */ char *name; /* context name (just for debugging) */ boolisReset;/* T = no space alloced since last reset */ + uint64 mem_allocated; /* track memory allocated for this context */ #ifdef USE_ASSERT_CHECKING boolallowInCritSection; /* allow palloc in critical section */ #endif That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. I'm no PPC64 expert, but I thought the cache lines are 128 B on that platform, since at least Power6? Hm, might be true. Also, if I'm counting right, the MemoryContextData structure is 56B without the 'mem_allocated' field (and without the allowInCritSection), and 64B with it (at that particular place). sizeof() seems to confirm that. (But I'm on x86, so maybe the alignment on PPC64 is different?). The MemoryContextData struct is embedded into AllocSetContext. Oh, right. That makes is slightly more complicated, though, because AllocSetContext adds 6 x 8B fields plus an in-line array of freelist pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the additional field. There might be some difference because of alignment, but I still don't see how that one additional field might impact cachelines? But if we separated the freelist, that might actually make it faster, at least for calls that don't touch the freelist at all, no? Because most of the palloc() calls will be handled from the current block. I tried this on the patch I sent on 23/8 (with the MemoryAccounting hierarchy parallel to the contexts), and I got this (running 10x the reindex, without restarts or such): without the patch - ... 1723933 KB used: CPU 1.35s/4.33u sec elapsed 10.38 sec ... 1723933 KB used: CPU 1.39s/4.19u sec elapsed 9.89 sec ... 1723933 KB used: CPU 1.39s/4.26u sec elapsed 9.90 sec ... 1723933 KB used: CPU 1.45s/4.22u sec elapsed 10.15 sec ... 1723933 KB used: CPU 1.36s/4.20u sec elapsed 9.84 sec ... 1723933 KB used: CPU 1.32s/4.20u sec elapsed 9.96 sec ... 1723933 KB used: CPU 1.50s/4.23u sec elapsed 9.91 sec ... 1723933 KB used: CPU 1.47s/4.24u sec elapsed 9.90 sec ... 1723933 KB used: CPU 1.38s/4.23u sec elapsed 10.09 sec ... 1723933 KB used: CPU 1.37s/4.19u sec elapsed 9.84 sec with the patch (no tracking enabled) ... 1723933 KB used: CPU 1.40s/4.28u sec elapsed 10.79 sec ... 1723933 KB used: CPU 1.39s/4.32u sec elapsed 10.14 sec ... 1723933 KB used: CPU 1.40s/4.16u sec elapsed 9.99 sec ... 1723933 KB used: CPU 1.32s/4.21u sec elapsed 10.16 sec ... 1723933 KB used: CPU 1.37s/4.34u sec elapsed 10.03 sec ... 1723933 KB used: CPU 1.35s/4.22u sec elapsed 9.97 sec ... 1723933 KB used: CPU 1.43s/4.20u sec elapsed 9.99 sec ... 1723933 KB used: CPU 1.39s/4.17u sec elapsed 9.71 sec ... 1723933 KB used: CPU 1.44s/4.16u sec elapsed 9.91 sec ... 1723933 KB used: CPU 1.40s/4.25u sec elapsed 9.99 sec So it's about 9,986 vs 10,068 for averages, and 9,905 vs 9,99 for a median. I.e. ~0.8% difference. That's on x86, though. I wonder what'd be the effect on the PPC machine. Patch attached - it's not perfect, though. For example it should free the freelists at some point, but I feel a bit lazy today. Tomas diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 743455e..d2ecb5b 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -175,7 +175,7 @@ typedef struct AllocSetContext MemoryContextData header; /* Standard memory-context fields */ /* Info about storage allocated in this context: */ AllocBlock blocks; /* head of list of blocks in this set */ - AllocChunk freelist[ALLOCSET_NUM_FREELISTS]; /* free chunk lists */ + AllocChunk *freelist; /* free chunk lists */ /* Allocation parameters for this context: */ Size initBlockSize; /* initial block size */ Size maxBlockSize; /* maximum block size */ @@ -242,6 +242,8 @@ typedef struct AllocChunkData #define AllocChunkGetPointer(chk) \ ((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ)) +static void update_allocation(MemoryContext context, int64 size); + /* * These functions implement the MemoryContext API for AllocSet contexts. */ @@ -250,7 +252,7 @@ static void AllocSetFree(MemoryContext context, void *pointer); static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); static void AllocSetInit(MemoryContext
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas robertmh...@gmail.com wrote: I'm grumpy about the distanceName() function. That seems too generic. If we're going to keep this as it is, I suggest something like computeRTEColumnDistance(). But see below. Fair point. On a related note, I'm also grumpy about this comment: +/* Charge half as much per deletion as per insertion or per substitution */ +return varstr_levenshtein_less_equal(actual, len, match, match_len, + 2, 1, 2, max); The purpose of a code comment is to articulate WHY we did something, rather than simply to restate what the code quite obviously does. I haven't heard a compelling argument for why this should be 2, 1, 2 rather than the default 1, 1, 1; and I'm inclined to do the latter unless you can make some very good argument for this combination of weights. And if you can make such an argument, then there should be comments so that the next person to come along and look at this code doesn't go, huh, that's whacky, and change it. Okay. I agree that that deserves a comment. The actual argument for this formulation is that it just seems to work better that way. For example: postgres=# \d orderlines Table public.orderlines Column| Type | Modifiers -+--+--- orderlineid | integer | not null orderid | integer | not null prod_id | integer | not null quantity| smallint | not null orderdate | date | not null postgres=# select qty from orderlines ; ERROR: 42703: column qty does not exist LINE 1: select qty from orderlines ; ^ HINT: Perhaps you meant to reference the column orderlines.quantity. The point is that the fact that the user supplied qty string has so many fewer characters than what was obviously intended - quantity - deserves to be weighed less. If you change the costing to weigh character deletion as being equal to substitution/addition, this example breaks. I also think it's pretty common to have noise words in every attribute (e.g. every column in the orderlines table matches orderlines_*), which might otherwise mess things up by overcharging for deletion. Having extra characters in the correctly spelled column name seems legitimately less significant to me. Or, in other words: having actual characters from the misspelling match the correct spelling (and having actual characters given not fail to match) is most important. What was given by the user is more important than what was not given but should have been, which is not generally true for uses of Levenshtein distance. I reached this conclusion through trying out the patch with a couple of real schemas, and seeing what works best. It's hard to express that idea tersely, in a comment, but I guess I'll try. + int location, FuzzyAttrMatchState *rtestate) I suggest calling this fuzzystate rather than rtestate; it's not the state of the RTE, but the state of the fuzzy matching. The idea here was to differentiate this state from the overall range table state (in general, FuzzyAttrMatchState may be one or the other). But okay. Within the scanRTEForColumn block, we have a rather large chunk of code protected by if (rtestate), which contains the only call to distanceName(). I suggest that we move all of this logic to a separate, static function, and merge distanceName into it. I also suggest testing against NULL explicitly instead of implicitly. So this block of code would end up as something like: if (fuzzystate != NULL) updateFuzzyAttrMatchState(rte, attcolname, colname, fuzzystate); Okay. In searchRangeTableForCol, I'm fairly certain that you've changed the behavior by adding a check for if (rte-rtekind == RTE_JOIN) before the call to scanRTEForColumn(). Why not instead put this check into updateFuzzyAttrMatchState? Then you can be sure you're not changing the behavior in any other case. I thought that I had avoided changing things (beyond what was advertised as changed in relation to this most recent revision) because I also changed things WRT multiple matches per RTE. It's fuzzy. Anyway, yeah, I could do it there instead. On a similar note, I think the dropped-column test should happen early as well, probably again in updateFuzzyAttrMatchState(). There's little point in adding a suggestion only to throw it away again. Agreed. +/* + * Charge extra (for inexact matches only) when an alias was + * specified that differs from what might have been used to + * correctly qualify this RTE's closest column + */ +if (wrongalias) +rtestate.distance += 3; I don't understand what situation this is catering to. Can you explain? It seems to account for a good deal of complexity. Two cases: 1. Distinguishing between the case where there was an exact match to a column that isn't visible
Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver
On 21 August 2014 09:17, Julien Rouhaud julien.rouh...@dalibo.com wrote: Track number of WAL files ready to be archived in pg_stat_archiver Would it be OK to ask what the purpose of this TODO item is? pg_stat_archiver already has a column for last_archived_wal and last_failed_wal, so you can already work out how many files there must be between then and now. Perhaps that can be added directly to the view, to assist the user in calculating it. Reading the directory itself to count the file is unnecessary, except as a diagnostic. Please don't take it is a TODO item as generally accepeted that this makes sense. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
What happened to this patch? I'm going over something that could use the concept of clean some stuff up when reading this page, but only if we're already writing or similar. I see some cases were presented that had a performance decrease. Did we get any numbers for the increase in performance in some other interesting cases? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4
Robert Haas robertmh...@gmail.com writes: On Thu, Nov 13, 2014 at 7:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: One thing that occurs to me is that if the generic plan estimate comes out much cheaper than the custom one, maybe we should assume that the generic's cost estimate is bogus. Right offhand I can't think of a reason for a custom plan to look worse than a generic one, unless there's a statistical quirk like this one. That's an interesting idea, but what do we do after deciding that it's bogus? Keep using custom plans. It's possible that the estimate that's in error is the custom one, but that's not the way to bet IMO, since the custom plan estimate is based on better information. The generic plan really can't be cheaper than the custom plan, but it could be the same price, or as close as makes no difference. Right, and what we want to do is use the generic plan as long as it's close to the same cost (close enough to not justify replanning effort). The trick here is to not be fooled by estimation errors. Can we assume that generic cost custom cost is always an estimation error? Another idea that occurred to me is to run a planning cycle in which the actual parameter values are made available to the planner, but as estimates not hard constants (this facility already exists, it's just not being used by plancache.c). This would yield cost estimates that are more safely comparable to the custom plan. But I'm not sure that we'd want to expend yet another planning cycle to do this, nor am I sure that we'd want to use such a plan as The Generic Plan. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4
One interesting option would be kicking in an extra more expensive planning cycle after the Nth run of the query, in general a lot of these planned queries run 1000s of times, if you add some extra cost to run 100 it may not be prohibitive cost wise. On Tue, Nov 18, 2014 at 8:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Nov 13, 2014 at 7:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: One thing that occurs to me is that if the generic plan estimate comes out much cheaper than the custom one, maybe we should assume that the generic's cost estimate is bogus. Right offhand I can't think of a reason for a custom plan to look worse than a generic one, unless there's a statistical quirk like this one. That's an interesting idea, but what do we do after deciding that it's bogus? Keep using custom plans. It's possible that the estimate that's in error is the custom one, but that's not the way to bet IMO, since the custom plan estimate is based on better information. The generic plan really can't be cheaper than the custom plan, but it could be the same price, or as close as makes no difference. Right, and what we want to do is use the generic plan as long as it's close to the same cost (close enough to not justify replanning effort). The trick here is to not be fooled by estimation errors. Can we assume that generic cost custom cost is always an estimation error? Another idea that occurred to me is to run a planning cycle in which the actual parameter values are made available to the planner, but as estimates not hard constants (this facility already exists, it's just not being used by plancache.c). This would yield cost estimates that are more safely comparable to the custom plan. But I'm not sure that we'd want to expend yet another planning cycle to do this, nor am I sure that we'd want to use such a plan as The Generic Plan. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 17 November 2014 21:09, Alvaro Herrera alvhe...@2ndquadrant.com wrote: What happened to this patch? I'm going over something that could use the concept of clean some stuff up when reading this page, but only if we're already writing or similar. I see some cases were presented that had a performance decrease. Did we get any numbers for the increase in performance in some other interesting cases? It's not dead; it just needs more work. Maybe for next CF, or you can now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better support of exported snapshots with pg_dump
On 28 October 2014 11:34, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Oct 28, 2014 at 8:08 PM, Petr Jelinek p...@2ndquadrant.com wrote: Hi, On 28/10/14 03:15, Michael Paquier wrote: Updated patch with those comments addressed is attached. Great, I have no further comments so I consider this patch ready for committer (and will mark it so momentarily). OK thanks! Just as a note - there is the issue you raised yourself about the less than perfect error message, but I really don't think it's worth the trouble to have special handling for it as the message coming from the server is clear enough IMHO. Yeah thinking the same. Let's see what others think (btw, if this code gets committed, be sure to mark Simon as a co-author). Committed. Thanks very much for pushing forwards with this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
Great, looks good to me, marking as ready for committer. What is wrong with using IF ? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fwd: ability to return number of rows inserted into child partition tables request
Hello. I was trying to get postgres to return the correct number of rows inserted for batch inserts to a partitioned table [using the triggers as suggested here http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html results in it always returning 0 by default]. What I ideally wanted it to do is to be able to insert into just the child partitions, and return number of rows updated. It seems the state of the art is either to return the NEW row from the insert trigger [which causes it to also be saved to the parent master table], then define an extra trigger to remove the parent table. So 2 inserts and 1 delete for an insert. [1] Or you can use an unconditional rule and it will return the number of rows updated [however, in this case, since we're using partitioning, we I think need multiple rules, once for each child table]. It is possible for a view to use a trigger and still return the number of rows updated, which provides another work around. (See bottom of [1]). Is there some more elegant way here? It seems odd that partitioned tables basically cannot, without a *lot* of massaging, return number of rows updated, am I missing something or do I understand ok? [Today this requires people to put in lots of work arounds, like *not* checking for number of rows returned for batch inserts, etc.-- potentially dangerous as well] Is there, for instance, some work around, like a way to manually cause the count of the number of rows affected by the command to be incremented here? Or possibly conditional rules could be made possible to return the output string with number of rows affected (feature request)? I guess this has come up before, FWIW. http://grokbase.com/t/postgresql/pgsql-general/0863bjzths/insert-into-master-table-0-rows-affected-hibernate-problems One way of fixing this would be to allow do instead rules on normal tables, instead of only on views (otherwise we are forced to use a rule, correct me if I'm wrong). I'd wager there would be other viable options as well. Thanks! -roger- [1] http://stackoverflow.com/questions/83093/hibernate-insert-batch-with-partitioned-postgresql
[HACKERS] vacuumdb: Help text for --analyze-only.
Hello there, I observe that the help text of vacuumdb for --analyze, --analyze-only, and --analyze-in-stages could do with a little clarification in order to be self-documenting and thus improve the user experience of vacuumdb. The problem is that the sole addition of the word only to an otherwise identical text for --analyze and --analyze-only seems rather obscure. My suggestion follows. Best regards, Mats Erik Andersson diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c index 86e6ab3..a07f081 100644 --- a/src/bin/scripts/vacuumdb.c +++ b/src/bin/scripts/vacuumdb.c @@ -435,9 +435,9 @@ help(const char *progname) printf(_( -v, --verbose write a lot of output\n)); printf(_( -V, --version output version information, then exit\n)); printf(_( -z, --analyze update optimizer statistics\n)); - printf(_( -Z, --analyze-only only update optimizer statistics\n)); - printf(_( --analyze-in-stages only update optimizer statistics, in multiple\n - stages for faster results\n)); + printf(_( -Z, --analyze-only only update optimizer statistics; no vacuum\n)); + printf(_( --analyze-in-stages only update statistics, but in multiple\n + stages for faster results; no vacuum\n)); printf(_( -?, --help show this help, then exit\n)); printf(_(\nConnection options:\n)); printf(_( -h, --host=HOSTNAME database server host or socket directory\n)); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding - reading a user catalog table
On 11/17/2014 11:34 AM, Andres Freund wrote: Hi, Kevin: CCed you, because it doesn't really look like a logical decoding related issue. On 2014-11-17 11:25:40 -0500, Steve Singer wrote: On 11/17/2014 10:37 AM, Andres Freund wrote: On 2014-11-13 22:23:02 -0500, Steve Singer wrote: Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in a normal client connection, not the walsender #3 0x006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598 #6 0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x006c0e35 in exec_simple_query ( query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp, ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot), \pg_catalog\.txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\...) at postgres.c:959 #8 PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50, I have no idea if this has anything to do with your recent changes or some other change. I haven't so far been able to replicate that since the first time I saw it. That crash is decidedly odd. Any chance you still have the full backtrace around? Yes I still have the core file Cool, could you show the full thing? Or did you just snip it because it's just the Assert/ExceptionalCondition thing? I just snipped the exception stuff. Here is the full thing (gdb) backtrace #0 0x7f6fb22e8295 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7f6fb22eb438 in __GI_abort () at abort.c:90 #2 0x007a08e7 in ExceptionalCondition ( conditionName=conditionName@entry=0x918e88 !(TransactionIdFollows(snapshot-xmin, PredXact-SxactGlobalXmin)), errorType=errorType@entry=0x7da390 FailedAssertion, fileName=fileName@entry=0x9182c3 predicate.c, lineNumber=lineNumber@entry=1738) at assert.c:54 #3 0x006b4978 in GetSerializableTransactionSnapshotInt ( snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData, sourcexid=sourcexid@entry=0) at predicate.c:1738 #4 0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out) at predicate.c:1517 #5 GetSerializableTransactionSnapshot ( snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598 #6 0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200 #7 0x006c0e35 in exec_simple_query ( query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp,ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot), \pg_catalog\.txid_snapshot_xmax(ev_snapshot), coalesce(ev_provider_xid,\...) at postgres.c:959 #8 PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50, ---Type return to continue, or q return to quit--- dbname=0x1f5ab30 test2, username=optimized out) at postgres.c:4016 #9 0x0046a08e in BackendRun (port=0x1f83010) at postmaster.c:4123 #10 BackendStartup (port=0x1f83010) at postmaster.c:3797 #11 ServerLoop () at postmaster.c:1576 #12 0x00665eae in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1f59d00) at postmaster.c:1223 #13 0x0046ab58 in main (argc=3, argv=0x1f59d00) at main.c:227 Could you print *snapshot in frame #3? (gdb) p *snapshot $1 = {satisfies = 0x7d0330 HeapTupleSatisfiesMVCC, xmin = 412635, xmax = 412637, xip = 0x1f86e40, xcnt = 1, subxcnt = 0, subxip = 0x1fd2190, suboverflowed = 0 '\000', takenDuringRecovery = 0 '\000', copied = 0 '\000', curcid = 0, active_count = 0, regd_count = 0} (gdb) This is in the SSI code... I'm not immediately seeing how this could be related to logical decoding. It can't even be a imported snapshot, because the exported snapshot is marked repeatable read. Are you actually using serializable transactions? If so, how and why? Yes, the test client that performs the 'simulated workload' does some serializable transactions. It connects as a normal client via JDBC and does a prepareStatement(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE) and then does some DML. Why? because it seemed like a good thing to include in the test suite. Yes, it's a good reason as the above backtrace proves ;). I'm just trying to understand uner which circumstances it happens. The above backtrace looks like it can only be triggered if you do a BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something you do? The slony remote listener connection does this when it selects from sl_event. We switched to this shortly after 9.1 came out. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better support of exported snapshots with pg_dump
On Tue, Nov 18, 2014 at 7:13 AM, Simon Riggs si...@2ndquadrant.com wrote: Committed. Thanks very much for pushing forwards with this. Thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of Refactoring code for sync node detection
On Tue, Nov 18, 2014 at 3:03 AM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Nov 17, 2014 at 2:20 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Sun, Nov 16, 2014 at 9:07 PM, Michael Paquier michael.paqu...@gmail.com wrote: I'll send an updated patch tomorrow. Here are updated versions. I divided the patch into two for clarity, the first one is the actual refactoring patch, renaming SyncRepGetSynchronousNode to SyncRepGetSynchronousStandby (+alpha, like updating synchronous to sync in the comments as you mentioned) such as the namings have no conflicts. The second one updates the syncrep code, including WAL sender and WAL receiver, and its surroundings to use the term node instead of standby. This brings in the code the idea that a node using replication APIs is not necessarily a standby, making it more generic. This can be applied on top of the refactoring patch. If any other folks (Heikki or Fujii-san?) have comments about this idea feel free. I've not read the patches yet. But while I was reading the code around sync node detection, I thought that it might be good idea to treat the node with priority as special. That is, if we find the active node with the priority 1, we can immediately go out of the sync-node-detection-loop. In many cases we can expect that the sync standby has the priority 1. If yes, treating the priority 1 as special is good way to do, I think. Thought? That would really save some cycles. Now we still need to fetch the priority numbers of all nodes for pg_stat_get_wal_senders, so in this case scanning all the entries in the WAL sender array is necessary. This optimization is orthogonal to the refactoring patch btw, no? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
On 11/17/14, 4:58 PM, Simon Riggs wrote: Great, looks good to me, marking as ready for committer. What is wrong with using IF ? It's a hell of a lot wordier. I've previously created a more sophisticated assert framework to allow more control over things, but ended up also using it just for simple sanity checking because it was much nicer than typeing IF THEN RAISE ERROR END IF. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Obsolete debug #define in pg_config_manual.h
Attached patch removes obsolete debugging #define in pg_config_manual.h (RTDEBUG). This was last used in commit 2a8d3d, from 2005. Apparently RTDEBUG is a contraction of R-Tree Debug. -- Peter Geoghegan diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h index 9e25ce0..265dae0 100644 --- a/src/include/pg_config_manual.h +++ b/src/include/pg_config_manual.h @@ -321,7 +321,6 @@ */ /* #define HEAPDEBUGALL */ /* #define ACLDEBUG */ -/* #define RTDEBUG */ /* * Automatic configuration file name for ALTER SYSTEM. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] vacuumdb: Help text for --analyze-only.
Mats Erik Andersson wrote Hello there, I observe that the help text of vacuumdb for --analyze, --analyze-only, and --analyze-in-stages could do with a little clarification in order to be self-documenting and thus improve the user experience of vacuumdb. The problem is that the sole addition of the word only to an otherwise identical text for --analyze and --analyze-only seems rather obscure. My suggestion follows. -1 for the changes as proposed. I'm not particularly convinced this is bad since the help text rightly presumes the knowledge that the vacuum process also performs analyze and all the flags do is toggle between the three possible combinations of these two actions - and the special in-stages version. The word only quickly communicates that the vacuum does not occur - only the secondary analyze. Help is not typically meant to be self-documenting; it presumes one has read the docs and/or man page for the relevant program and concepts and simply needs a quick reminder as to syntax and capabilities. All that said I would drop the word only and add without vacuuming instead if you feel that only is too vague. I do not like the phrase no vacuum... David J. -- View this message in context: http://postgresql.nabble.com/vacuumdb-Help-text-for-analyze-only-tp5827310p5827317.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index scan optimization
On 17 November 2014 16:05, Simon Riggs Wrote: I confirm 5% improvement in both short and long index scans, on the least beneficial datatype. Looks likely to be a very positive win overall. Thanks a lot for testing and confirmation. The language used in the comments is not clear enough. I'll do my best to improve that, then look to commit this in about 5 hours. Thanks a lot for support. Please let me know if I also need to add something. Thanks and Regards, Kumar Rajeev Rastogi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
2014-11-17 23:58 GMT+01:00 Simon Riggs si...@2ndquadrant.com: Great, looks good to me, marking as ready for committer. What is wrong with using IF ? It significantly increase code' length .. and decrease readability when you intensive use a pattern IF THEN RAISE END IF - when you check every parameter, when you check every result. RAISE ... WHEN ... is shorter with full power of RAISE statement and possibility for future enhancing. Regards Pavel -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 18 November 2014 05:19, Simon Riggs si...@2ndquadrant.com wrote: On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. I have something that I've been working on locally. It's far from ready, but it does work in very simple cases, and shows a nice performance boost. I'll start another thread soon and copy you both in. Perhaps we can share some ideas. Great, it's exactly valuable functionality to be in the core. -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver
On Tue, Nov 18, 2014 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: On 21 August 2014 09:17, Julien Rouhaud julien.rouh...@dalibo.com wrote: Track number of WAL files ready to be archived in pg_stat_archiver Would it be OK to ask what the purpose of this TODO item is? pg_stat_archiver already has a column for last_archived_wal and last_failed_wal, so you can already work out how many files there must be between then and now. Perhaps that can be added directly to the view, to assist the user in calculating it. Reading the directory itself to count the file is unnecessary, except as a diagnostic. Not sure if this holds true in a node freshly started from a base backup with a set of WAL files, or with files manually copied by an operator. Please don't take it is a TODO item as generally accepeted that this makes sense. On systems where the WAL archiving is slower than WAL generation at peak time, the DBA may want to know how long is the queue of WAL files waiting to be archived. That's IMO something we simply forgot in the first implementation of pg_stat_archiver, and the most direct way to know that is to count the .ready files in archive_status. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver
On Wed, Oct 22, 2014 at 12:50 AM, Brightwell, Adam adam.brightw...@crunchydatasolutions.com wrote: Though, I would think that the general desire would be to keep the patch relevant ONLY to the necessary changes. I would not qualify making those types of changes as relevant, IMHO. I do think this is potential for cleanup, however, I would suspect that would be best done in a separate patch. But again, I'd defer to a committer whether such changes are even necessary/acceptable. I have been looking at this patch, and I think that it is a mistake to count the .ready files present in archive_status when calling pg_stat_get_archiver(). If there are many files waiting to be archived, this penalizes the run time of this function, and the application behind relying on those results, not to mention that actually the loop used to count the .ready files is a copy of what is in pgarch.c. Hence I think that we should simply count them in pgarch_readyXlog, and then return a value back to pgarch_ArchiverCopyLoop, value that could be decremented by 1 each time a file is successfully archived to keep the stats as precise as possible, and let the information know useful information when archiver process is within a single loop process of pgarch_ArchiverCopyLoop. This way, we just need to extend PgStat_MsgArchiver with a new counter to track this number. The attached patch, based on v2 sent previously, does so. Thoughts? -- Michael From e12a1aff3f1b423da5277cccf2a76ec09318567a Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Tue, 18 Nov 2014 16:30:23 +0900 Subject: [PATCH] Track number of files marked as ready for archiving in pg_stat_archiver This number of files is directly tracked by the archiver process that then reports the number it finds to the stat machinery. Note that when archiver marks a file as successfully archived, it decrements by one the number of files waiting to be archived, giving more precise information to the user. --- doc/src/sgml/monitoring.sgml | 5 + src/backend/catalog/system_views.sql | 1 + src/backend/postmaster/pgarch.c | 33 +++-- src/backend/postmaster/pgstat.c | 6 +- src/backend/utils/adt/pgstatfuncs.c | 21 +++-- src/include/catalog/pg_proc.h| 2 +- src/include/pgstat.h | 5 - src/test/regress/expected/rules.out | 3 ++- 8 files changed, 52 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index b29e5e6..4f4ac73 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -870,6 +870,11 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser entryTime of the last failed archival operation/entry /row row + entrystructfieldready_count//entry + entrytypebigint/type/entry + entryNumber of files waiting to be archived/entry + /row + row entrystructfieldstats_reset//entry entrytypetimestamp with time zone/type/entry entryTime at which these statistics were last reset/entry diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index a819952..195769c 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -729,6 +729,7 @@ CREATE VIEW pg_stat_archiver AS s.failed_count, s.last_failed_wal, s.last_failed_time, +s.ready_count, s.stats_reset FROM pg_stat_get_archiver() s; diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 6a5c5b0..7f5b813 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -100,7 +100,7 @@ static void pgarch_waken_stop(SIGNAL_ARGS); static void pgarch_MainLoop(void); static void pgarch_ArchiverCopyLoop(void); static bool pgarch_archiveXlog(char *xlog); -static bool pgarch_readyXlog(char *xlog); +static int64 pgarch_readyXlog(char *xlog); static void pgarch_archiveDone(char *xlog); @@ -440,6 +440,7 @@ static void pgarch_ArchiverCopyLoop(void) { char xlog[MAX_XFN_CHARS + 1]; + int64 ready_count; /* * loop through all xlogs with archive_status of .ready and archive @@ -447,7 +448,7 @@ pgarch_ArchiverCopyLoop(void) * some backend will add files onto the list of those that need archiving * while we are still copying earlier archives */ - while (pgarch_readyXlog(xlog)) + while ((ready_count = pgarch_readyXlog(xlog)) != 0) { int failures = 0; @@ -488,11 +489,16 @@ pgarch_ArchiverCopyLoop(void) pgarch_archiveDone(xlog); /* + * File has been archived, reducing by one the entries waiting + * to be archived. + */ +ready_count--; + +/* * Tell the collector about the WAL file that we successfully * archived */ -pgstat_send_archiver(xlog, false); - +pgstat_send_archiver(xlog, ready_count, false); break;
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Mon, 2014-11-17 at 18:04 +0100, Andres Freund wrote: That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. Thank you! I made the same mistake as Tomas: I saw that MemoryContextData went to 64 bytes and thought that should be fine while ignoring AllocSetContext. I did some tests in the past between a version of my patch that made MemoryContextData 72 bytes and one that made it 64 bytes, but I may have neglected to test against the original 56 byte version. I'm still not able to test to see if this is the real problem, but it seems best to eliminate it anyway. Consider not counting memory in bytes but blocks and adding it directly after the NodeTag. That'd leave the size the same as right now. I can also just move isReset there, and keep mem_allocated as a uint64. That way, if I find later that I want to track the aggregated value for the child contexts as well, I can split it into two uint32s. I'll hold off any any such optimizations until I see some numbers from HashAgg though. Attached new version. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 438,451 AllocSetContextCreate(MemoryContext parent, Size initBlockSize, Size maxBlockSize) { ! AllocSet context; /* Do the type-independent part of context creation */ ! context = (AllocSet) MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 438,454 Size initBlockSize, Size maxBlockSize) { ! AllocSet set; ! MemoryContext context; /* Do the type-independent part of context creation */ ! context = MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); ! ! set = (AllocSet) context; /* * Make sure alloc parameters are reasonable, and save them. *** *** 459,467 AllocSetContextCreate(MemoryContext parent, if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! context-initBlockSize = initBlockSize; ! context-maxBlockSize = maxBlockSize; ! context-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be --- 462,470 if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! set-initBlockSize = initBlockSize; ! set-maxBlockSize = maxBlockSize; ! set-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be *** *** 477,486 AllocSetContextCreate(MemoryContext parent, * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! context-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! context-allocChunkLimit = 1; /* * Grab always-allocated space, if requested --- 480,489 * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! set-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! set-allocChunkLimit = 1; /* * Grab always-allocated space, if requested *** *** 500,519 AllocSetContextCreate(MemoryContext parent, errdetail(Failed while creating memory context \%s\., name))); } ! block-aset = context; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = context-blocks; ! context-blocks = block; /* Mark block as not to be released at reset time */ ! context-keeper = block; /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block-freeptr, blksize - ALLOC_BLOCKHDRSZ); } ! return (MemoryContext) context; } /* --- 503,525 errdetail(Failed while creating memory context \%s\., name))); } ! ! context-mem_allocated += blksize; ! ! block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = set-blocks; ! set-blocks = block; /* Mark block as not to be released at reset time */ ! set-keeper = block; /* Mark unallocated space
Re: [HACKERS] Obsolete debug #define in pg_config_manual.h
On 11/18/2014 03:24 AM, Peter Geoghegan wrote: Attached patch removes obsolete debugging #define in pg_config_manual.h (RTDEBUG). This was last used in commit 2a8d3d, from 2005. Apparently RTDEBUG is a contraction of R-Tree Debug. Removed, thanks. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers