Re: [HACKERS] Redesigning checkpoint_segments
On 2015-02-22 21:24:56 -0500, Robert Haas wrote: On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek p...@2ndquadrant.com wrote: I am wondering a bit about interaction with wal_keep_segments. One thing is that wal_keep_segments is still specified in number of segments and not size units, maybe it would be worth to change it also? And the other thing is that, if set, the wal_keep_segments is the real max_wal_size from the user perspective (not from perspective of the algorithm in this patch, but user does not really care about that) which is somewhat weird given the naming. It seems like wal_keep_segments is more closely related to wal_*min*_size. The idea of both settings is that each is a minimum amount of WAL we want to keep around for some purpose. But they're not quite the same, I guess, because wal_min_size just forces us to keep that many files around - they can be overwritten whenever. wal_keep_segments is an amount of actual WAL data we want to keep around. Would it make sense to require that wal_keep_segments = wal_min_size? I don't think so. Right now checkpoint_segments is a useful tool to relatively effectively control the amount of WAL that needs to be replayed in the event of a crash. wal_keep_segments in contrast doesn't have much to do with the normal working of the system, except that it delays recycling of WAL segments a bit. With a condition like above, how would you set up things that you have 50k segments around for replication (say a good days worth), but that your will never have to replay more than ~800 segments (i.e. something like checkpoint_segments = 800)? Am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for Numeric
Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes: Tomas Interesting, but I think Gavin was asking about how much Tomas variation was there for each tested case (e.g. query executed on Tomas the same code / dataset). And in those cases the padding / Tomas alignment won't change, and thus the effects you describe won't Tomas influence the results, no? My point is exactly the fact that since the result is not affected, this variation between runs of the same code is of no real relevance to the question of whether a given change in performance can properly be attributed to a patch. Put it this way: suppose I have a test that when run repeatedly with no code changes takes 6.10s (s=0.025s), and I apply a patch that changes that to 6.26s (s=0.025s). Did the patch have an impact on performance? Now suppose that instead of applying the patch I insert random amounts of padding in an unused function and find that my same test now takes a mean of 6.20s (s=0.058s) when I take the best timing for each padding size and calculate stats across sizes. Now it looks obvious that the actual code of the patch probably wasn't responsible for any change... The numbers used here aren't theoretical; they are obtained by testing a single query - select * from d_flt order by v offset 1000 where d_flt contains 5 million float8 values - over 990 times with 33 different random padding sizes (uniform in 0-32767). Here's a scatter plot, with 3 runs of each padding size so you can see the repeatability: http://tinyurl.com/op9qg8a -- Andrew. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
Amit and I had a long discussion about this on Friday while in Boston together. I previously argued that the master and the slave should be executing the same node, ParallelSeqScan. However, Amit argued persuasively that what the master is doing is really pretty different from what the worker is doing, and that they really ought to be running two different nodes. This led us to cast about for a better design, and we came up with something that I think will be much better. The basic idea is to introduce a new node called Funnel. A Funnel node will have a left child but no right child, and its job will be to fire up a given number of workers. Each worker will execute the plan which is the left child of the funnel. The funnel node itself will pull tuples from all of those workers, and can also (if there are no tuples available from any worker) execute the plan itself. So a parallel sequential scan will look something like this: Funnel Workers: 4 - Partial Heap Scan on xyz What this is saying is that each worker is going to scan part of the heap for xyz; together, they will scan the whole thing. What is the best way to determine the number of workers? Fixed number is an idea. It may also make sense to add a new common field to Path node to introduce how much portion of the node execution can be parallelized, or unavailable to run in parallel. Not on the plan time, we may be able to determine the number according to the number of concurrent workers and number of CPU cores. The neat thing about this way of separating things out is that we can eventually write code to push more stuff down into the funnel. For example, consider this: Nested Loop - Seq Scan on foo - Index Scan on bar Index Cond: bar.x = foo.x Now, if a parallel sequential scan is cheaper than a regular sequential scan, we can instead do this: Nested Loop - Funnel - Partial Heap Scan on foo - Index Scan on bara Index Cond: bar.x = foo.x The problem with this is that the nested loop/index scan is happening entirely in the master. But we can have logic that fixes that by knowing that a nested loop can be pushed through a funnel, yielding this: Funnel - Nested Loop - Partial Heap Scan on foo - Index Scan on bar Index Cond: bar.x = foo.x Now that's pretty neat. One can also imagine doing this with aggregates. Consider: I guess the planner enhancement shall exist around add_paths_to_joinrel(). In case when any underlying join paths that support multi-node execution, the new portion will add Funnel node with these join paths. Just my thought. HashAggregate - Funnel - Partial Heap Scan on foo Filter: x = 1 Here, we can't just push the HashAggregate through the filter, but given infrastructure for we could convert that to something like this: HashAggregateFinish - Funnel - HashAggregatePartial - Partial Heap Scan on foo Filter: x = 1 That'd be swell. You can see that something like this will also work for breaking off an entire plan tree and shoving it down into a worker. The master can't participate in the computation in that case, but it's otherwise the same idea. I believe the entire vision we've discussed around combining aggregate function thread is above, although people primarily considers to apply this feature on aggregate push-down across join. One key infrastructure may be a capability to define the combining function of aggregates. It informs the planner given aggregation support two stage execution. In addition to this, we may need to have a planner enhancement to inject the partial aggregate node during path construction. Probably, we have to set a flag to inform later stage (that will construct Agg plan) the underlying scan/join node takes partial aggregation, thus, final aggregation has to expect state data, instead of usual arguments for row-by-row. Also, I think HashJoin with very large outer relation but unbalanced much small inner is a good candidate to distribute multiple nodes. Even if multi-node HashJoin has to read the small inner relation N-times, separation of very large outer relation will make gain. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL renegotiation
On 02/22/2015 02:05 PM, Andres Freund wrote: On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote: I honestly wonder why postgres uses renegotiation at all. The motivation that cryptoanalysis is easier as more data is sent seems quite far-fetched. I don't think so. There's a fair number of algorithms that can/could be much easier be attached with lots of data available. Especially if you can guess/know/control some of the data. Additionally renegotiating regularly helps to constrain a possible key leagage to a certain amount of time. With backend connections often being alive for weeks at a time that's not a bad thing. Renegotiation will be removed from future TLS versions because it is considered unnecessary with modern ciphers: https://github.com/tlswg/tls13-spec/issues/38 If ciphers require rekeying, that mechanism will be provided at the TLS layer in the future. I think you could remove renegotiation from PostgreSQL as long as you offer something better than RC4 in the TLS handshake. -- Florian Weimer / Red Hat Product Security -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump gets attributes from tables in extensions
On Mon, Feb 23, 2015 at 5:57 PM, Rushabh Lathia rushabh.lat...@gmail.com wrote: Thanks to the easy handy testcase, was able to replicate the test scenario on my local environment. And yes tbinfo-dobj.ext_member check into getTableAttrs() do fix the issue. Looking more into pg_dump code what I found that, generally PG don't have tbinfo-dobj.ext_member check to ignore the object. Mostly we do this kind of check using tbinfo-dobj.dump (look at dumpTable() for reference). Do you have any particular reason if choosing dobj.ext_member over dobj.dump ? Hm. I have been pondering about this code more and I am dropping the patch as either adding a check based on the flag to track dumpable objects or ext_member visibly breaks the logic that binary upgrade and tables in extensions rely on. Particularly this portion makes me think so in getExtensionMembership(): /* * Normally, mark the member object as not to be dumped. But in * binary upgrades, we still dump the members individually, since the * idea is to exactly reproduce the database contents rather than * replace the extension contents with something different. */ if (!dopt-binary_upgrade) dobj-dump = false; else dobj-dump = refdobj-dump; Regards, -- Michael
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote: - * Wait for more WAL to arrive. Time out after 5 seconds, - * like when polling the archive, to react to a trigger - * file promptly. + * Wait for more WAL to arrive. Time out after the amount of + * time specified by wal_retrieve_retry_interval, like + * when polling the archive, to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1000L); This change can prevent the startup process from reacting to a trigger file. Imagine the case where the large interval is set and the user want to promote the standby by using the trigger file instead of pg_ctl promote. I think that the sleep time should be 5s if the interval is set to more than 5s. Thought? I disagree here. It is interesting to accelerate the check of WAL availability from a source in some cases for replication, but the opposite is true as well as mentioned by Alexey at the beginning of the thread to reduce the number of requests when requesting WAL archives from an external storage type AWS. Hence a correct solution would be to check periodically for the trigger file with a maximum one-time wait of 5s to ensure backward-compatible behavior. We could reduce it to 1s or something like that as well. You seem to have misunderstood the code in question. Or I'm missing something. The timeout of the WaitLatch is just the interval to check for the trigger file while waiting for more WAL to arrive from streaming replication. Not related to the retry time to restore WAL from the archive. [Re-reading the code...] Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a maximum of 5s then. I also noticed in previous patch that the wait was maximized to 5s. To begin with, a loop should have been used if it was a sleep, but as now patch uses a latch this limit does not make much sense... Patch updated is attached. On second thought, the interval to check the trigger file is very different from the wait time to retry to retrieve WAL. So it seems strange and even confusing to control them by one parameter. If we really want to make the interval for the trigger file configurable, we should invent new GUC for it. But I don't think that it's worth doing that. If someone wants to react the trigger file more promptly for fast promotion, he or she basically can use pg_ctl promote command, instead. Thought? Hm, OK. Attached is the updated version of the patch. I changed the parameter so that it doesn't affect the interval of checking the trigger file. -static pg_time_t last_fail_time = 0; -pg_time_tnow; +TimestampTz now = GetCurrentTimestamp(); +TimestampTzlast_fail_time = now; I reverted the code here as it was. I don't think GetCurrentTimestamp() needs to be called for each WaitForWALToBecomeAvailable(). +WaitLatch(XLogCtl-recoveryWakeupLatch, + WL_LATCH_SET | WL_TIMEOUT, + wait_time / 1000); Don't we need to specify WL_POSTMASTER_DEATH flag here? Added. Yeah, I am wondering though why this has not been added after 89fd72cb though. +{wal_retrieve_retry_interval, PGC_SIGHUP, WAL_SETTINGS, WAL_SETTINGS should be REPLICATION_STANDBY? Changed. Sure. So I pushed the patch. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ERROR: cannot GetMultiXactIdMembers() during recovery
Hi, Andres asked me on IRC to report this here. Since we upgraded our standby servers to 9.1.15 (though the master is still running 9.1.14), we've seen the error in $SUBJECT a number of times. I managed to reproduce it today by running the same query over and over again, and attached is the back trace. Let me know if you need any additional information. .m #0 GetMultiXactIdMembers (multi=56513428, xids=0x7fff9e3691e8) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/transam/multixact.c:923 #1 0x7f029a917dd4 in MultiXactIdIsRunning (multi=optimized out) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/transam/multixact.c:386 #2 0x7f029abad79d in HeapTupleSatisfiesVacuum (tuple=0x7f0145f2c060, OldestXmin=2418699920, buffer=111945) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/utils/time/tqual.c:1184 #3 0x7f029a8f08f5 in index_getnext (scan=scan@entry=0x7f029b66ac78, direction=direction@entry=ForwardScanDirection) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/index/indexam.c:644 #4 0x7f029a9fe646 in IndexNext (node=node@entry=0x7f029b669550) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeIndexscan.c:78 #5 0x7f029a9f41dc in ExecScanFetch (recheckMtd=0x7f029a9fe5c0 IndexRecheck, accessMtd=0x7f029a9fe600 IndexNext, node=0x7f029b669550) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:82 #6 ExecScan (node=node@entry=0x7f029b669550, accessMtd=accessMtd@entry=0x7f029a9fe600 IndexNext, recheckMtd=recheckMtd@entry=0x7f029a9fe5c0 IndexRecheck) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:167 #7 0x7f029a9fe72b in ExecIndexScan (node=node@entry=0x7f029b669550) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeIndexscan.c:146 #8 0x7f029a9ec968 in ExecProcNode (node=node@entry=0x7f029b669550) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:398 #9 0x7f029a9fd3e1 in MultiExecHash (node=node@entry=0x7f029b6690b0) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHash.c:103 #10 0x7f029a9eca74 in MultiExecProcNode (node=node@entry=0x7f029b6690b0) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:536 #11 0x7f029a9fdd94 in ExecHashJoin (node=node@entry=0x7f029b667aa0) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:177 #12 0x7f029a9ec8b8 in ExecProcNode (node=node@entry=0x7f029b667aa0) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:447 #13 0x7f029a9fe319 in ExecHashJoinOuterGetTuple (hashvalue=0x7fff9e369554, hjstate=0x7f029b666c80, outerNode=0x7f029b667aa0) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:656 #14 ExecHashJoin (node=node@entry=0x7f029b666c80) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:209 #15 0x7f029a9ec8b8 in ExecProcNode (node=node@entry=0x7f029b666c80) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:447 #16 0x7f029aa05a59 in ExecSort (node=node@entry=0x7f029b666a10) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSort.c:103 #17 0x7f029a9ec898 in ExecProcNode (node=node@entry=0x7f029b666a10) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:458 #18 0x7f029aa096e0 in begin_partition (winstate=winstate@entry=0x7f029b665530) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeWindowAgg.c:683 #19 0x7f029aa0b49b in ExecWindowAgg (winstate=winstate@entry=0x7f029b665530) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeWindowAgg.c:1287 #20 0x7f029a9ec868 in ExecProcNode (node=0x7f029b665530) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:470 #21 0x7f029a9f41dc in ExecScanFetch (recheckMtd=0x7f029aa084f0 SubqueryRecheck, accessMtd=0x7f029aa08500 SubqueryNext, node=0x7f029b664e60) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:82 #22 ExecScan (node=node@entry=0x7f029b664e60, accessMtd=accessMtd@entry=0x7f029aa08500 SubqueryNext, recheckMtd=recheckMtd@entry=0x7f029aa084f0 SubqueryRecheck) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:167 #23 0x7f029aa08538 in ExecSubqueryScan (node=node@entry=0x7f029b664e60) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSubqueryscan.c:85 #24 0x7f029a9ec938 in ExecProcNode (node=node@entry=0x7f029b664e60) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:412 #25 0x7f029aa05a59 in ExecSort (node=node@entry=0x7f029b664bf0) at /tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSort.c:103 #26 0x7f029a9ec898 in ExecProcNode
Re: [HACKERS] Enforce creation of destination folders for source files in pg_regress (Was: pg_regress writes into source tree)
On Feb 22, 2015, at 5:41 AM, Michael Paquier michael.paqu...@gmail.com wrote: This is up to the maintainer of each extension to manage their code tree. However I can imagine that some people would be grateful if we allow them to not need sql/ and expected/ containing only one single .gitignore file ignoring everything with *, making the code tree of their extensions cleaner. I don't see this as an improvement. What's wrong with a .gitignore file ignoring everything? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR: cannot GetMultiXactIdMembers() during recovery
Hi, On 2015-02-23 15:00:35 +0100, Marko Tiikkaja wrote: Andres asked me on IRC to report this here. Since we upgraded our standby servers to 9.1.15 (though the master is still running 9.1.14), we've seen the error in $SUBJECT a number of times. FWIW, I think this is just as borked in 9.1.14 and will likely affect all of 9.0 - 9.2. The problem is that in those releases multixacts aren't maintained on the standby in a way that allows access. index_getnext() itself is actually pretty easy to fix, it already checks whether the scan started while in recovery when using the result of the error triggering HeapTupleSatisfiesVacuum(), just too late. I don't remember other HTSV callers that can run in recovery, given that DDL is obviously impossible and we don't support serializable while in recovery. Alternatively we could make MultiXactIdIsRunning() return false 9.3 when in recovery. I think that'd end up fixing things, but it seems awfully fragile to me. I do see a HTSU in pgrowlocks.c - that's not really safe during recovery 9.3, given it accesses multixacts. I guess it needs to throw an error. I wonder if we shouldn't put a Assert() in HTSV/HTSU to prevent such problems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Attached is a patch which has following changes, As suggested above block ID in xlog structs has been replaced by chunk ID. Chunk ID is used to distinguish between different types of xlog record fragments. Like, XLR_CHUNK_ID_DATA_SHORT XLR_CHUNK_ID_DATA_LONG XLR_CHUNK_BKP_COMPRESSED XLR_CHUNK_BKP_WITH_HOLE In block references, block ID follows the chunk ID. Here block ID retains its functionality. This approach increases data by 1 byte for each block reference in an xlog record. This approach separates ID referring different fragments of xlog record from the actual block ID which is used to refer block references in xlog record. I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the startup process could not replay any WAL records when I set up replication and enabled wal_compression. LOG: record with invalid length at 0/3B0 LOG: record with invalid length at 0/3000518 LOG: Invalid block length in record 0/30005A0 LOG: Invalid block length in record 0/3000D60 ... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Primary not sending to synchronous standby
Hi, On 2015-02-23 15:25:57 +, Thom Brown wrote: I've noticed that if the primary is started and then a base backup is immediately taken from it and started as as a synchronous standby, it doesn't replicate and the primary hangs indefinitely when trying to run any WAL-generating statements. It only recovers when either the primary is restarted (which has to use a fast shutdown otherwise it also hangs forever), or the standby is restarted. Here's a way of reproducing it: ... Note that if you run the commands one by one, there isn't a problem. If you run it as a script, the standby doesn't connect to the primary. There aren't any errors reported by either the standby or the primary. The primary's wal sender process reports the following: wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158 Anyone know why this would be happening? And if this could be a problem in other scenarios? Given that normally a walsender doesn't wait for syncrep I guess this is the above backend just did authentication. If you gdb into the walsender, what's the backtrace? We previously had discussions about that being rather annoying; I unfortunately don't remember enough of the thread to reference it here. If it really is this, I think we should add some more smarts about only enabling syncrep once a backend is fully up and maybe even remove it from more scenarios during commits generally (e.g. if no xid was assigned and we just pruned something). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Primary not sending to synchronous standby
Hi, I've noticed that if the primary is started and then a base backup is immediately taken from it and started as as a synchronous standby, it doesn't replicate and the primary hangs indefinitely when trying to run any WAL-generating statements. It only recovers when either the primary is restarted (which has to use a fast shutdown otherwise it also hangs forever), or the standby is restarted. Here's a way of reproducing it: --- mkdir -p -m 0700 primary standby1 initdb -N -k -D primary -E 'UTF8' cat PRIMARYCONFIG primary/postgresql.conf shared_buffers = 8MB logging_collector = on log_line_prefix = '%m - %u - %d' synchronous_standby_names = 'standby1' max_connections = 8 wal_level = 'hot_standby' port = 5530 max_wal_senders = 3 wal_keep_segments = 6 PRIMARYCONFIG cat PRIMARYHBA primary/pg_hba.conf local replication rep_user trust hostreplication rep_user 127.0.0.1/32trust hostreplication rep_user ::1/128 trust PRIMARYHBA pg_ctl start -D primary psql -p 5530 -h localhost -c 'SET SESSION synchronous_commit TO 'off';CREATE USER rep_user REPLICATION;;' -d postgres pg_basebackup -x -D standby1 -h localhost -p 5530 -U rep_user cat STANDBYCONFIG standby1/postgresql.conf port = 5531 hot_standby = on STANDBYCONFIG cat STANDBYRECOVERY standby1/recovery.conf standby_mode = 'on' recovery_target_timeline = 'latest' primary_conninfo = 'host=127.0.0.1 user=rep_user port=5530 application_name=standby1' STANDBYRECOVERY pg_ctl -D standby1 start --- Note that if you run the commands one by one, there isn't a problem. If you run it as a script, the standby doesn't connect to the primary. There aren't any errors reported by either the standby or the primary. The primary's wal sender process reports the following: wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158 Anyone know why this would be happening? And if this could be a problem in other scenarios? Thom
Re: [HACKERS] Primary not sending to synchronous standby
On 2015-02-23 16:38:44 +0100, Andres Freund wrote: I unfortunately don't remember enough of the thread to reference it here. Found the right keywords. The threads below http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau and http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Primary not sending to synchronous standby
On 23 February 2015 at 15:38, Andres Freund and...@2ndquadrant.com wrote: Hi, On 2015-02-23 15:25:57 +, Thom Brown wrote: I've noticed that if the primary is started and then a base backup is immediately taken from it and started as as a synchronous standby, it doesn't replicate and the primary hangs indefinitely when trying to run any WAL-generating statements. It only recovers when either the primary is restarted (which has to use a fast shutdown otherwise it also hangs forever), or the standby is restarted. Here's a way of reproducing it: ... Note that if you run the commands one by one, there isn't a problem. If you run it as a script, the standby doesn't connect to the primary. There aren't any errors reported by either the standby or the primary. The primary's wal sender process reports the following: wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158 Anyone know why this would be happening? And if this could be a problem in other scenarios? Given that normally a walsender doesn't wait for syncrep I guess this is the above backend just did authentication. If you gdb into the walsender, what's the backtrace? #0 0x7f66d1725940 in poll () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00617faa in WaitLatchOrSocket () #2 0x0064741b in SyncRepWaitForLSN () #3 0x004bbf8f in CommitTransaction () #4 0x004be135 in CommitTransactionCommand () #5 0x00757679 in InitPostgres () #6 0x00675032 in PostgresMain () #7 0x004617ef in ServerLoop () #8 0x00627c9c in PostmasterMain () #9 0x0046223d in main () -- Thom
Re: [HACKERS] SSL renegotiation
Florian Weimer wrote: On 02/22/2015 02:05 PM, Andres Freund wrote: On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote: I honestly wonder why postgres uses renegotiation at all. The motivation that cryptoanalysis is easier as more data is sent seems quite far-fetched. I don't think so. There's a fair number of algorithms that can/could be much easier be attached with lots of data available. Especially if you can guess/know/control some of the data. Additionally renegotiating regularly helps to constrain a possible key leagage to a certain amount of time. With backend connections often being alive for weeks at a time that's not a bad thing. Renegotiation will be removed from future TLS versions because it is considered unnecessary with modern ciphers: https://github.com/tlswg/tls13-spec/issues/38 If ciphers require rekeying, that mechanism will be provided at the TLS layer in the future. I think you could remove renegotiation from PostgreSQL as long as you offer something better than RC4 in the TLS handshake. I'd say it is best to wait if and how OpenSSL change their API when they implement TLS 1.3. I'd vote against removing renegotiation. At the very least, if the feature should provide unnecessary and cumbersome with future versions of OpenSSL, we should retain ssl_renegotiation_limit and change the default to 0. It might still be of value with older versions. If changing the encryption is so useless, whe did the TLS workgroup decide to introduce rekeying as a substitute for renegotiation? Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Should smgrtruncate() avoid sending sinval message for temp relations
Hi, On 2014-07-26 18:16:01 +0200, Andres Freund wrote: On 2014-07-26 11:32:24 -0400, Tom Lane wrote: MauMau maumau...@gmail.com writes: [ sinval catchup signal - ProcessCatchupEvent - WaitLatch - deadlock ] Ugh. One line of thought is that it's pretty unsafe to be doing anything as complicated as transaction start/commit in a signal handler, even one that is sure it's not interrupting anything else. Yea, that's really not nice. MauMau, we don't do this anymore. Could you verify that the issue is fixed for you? I'd completely forgotten that this thread made me work on moving everything complicated out of signal handlers... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Primary not sending to synchronous standby
On 23 February 2015 at 15:42, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-23 16:38:44 +0100, Andres Freund wrote: I unfortunately don't remember enough of the thread to reference it here. Found the right keywords. The threads below http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau and http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau Yes, this seems to be virtually the same issue reported. The trace looks the same except for RecordTransactionCommit. -- Thom
Re: [HACKERS] multiple backends attempting to wait for pincount 1
On 2015-02-17 13:14:00 -0500, Tom Lane wrote: Hm, good point. On the other hand, should we worry about the possibility of a lost signal? Moving the flag-clearing would guard against that, which the current code does not. But we've not seen field reports of such issues AFAIR, so this might not be an important consideration. I think if there were lost signals there'd be much bigger problems given the same (or in master) similar mechanics are used for a lot of other things including heavyweight and lightweight locks wait queues. ... /* * Make sure waiter flag is reset - it might not be if * ProcWaitForSignal() returned for another reason than UnpinBuffer() * signalling us. */ LockBufHdr(bufHdr); buf-flags = ~BM_PIN_COUNT_WAITER; Assert(bufHdr-wait_backend_pid == MyProcPid); UnlockBufHdr(bufHdr); PinCountWaitBuf = NULL; /* Loop back and try again */ } to the bottom of the loop would suffice. No, I disagree. If we maintain the rule that the signaler clears BM_PIN_COUNT_WAITER, then once that happens there is nothing to stop a third party from trying to LockBufferForCleanup on the same buffer (except for table-level locking conventions, which IMO this mechanism shouldn't be dependent on). So this coding would potentially clear the BM_PIN_COUNT_WAITER flag belonging to that third party, and then fail the Assert --- but only in debug builds, not in production, where it would just silently lock up the third-party waiter. So I think having a test to verify that it's still our BM_PIN_COUNT_WAITER flag is essential. Pushed with a test guarding against that. I still think it might be slightly better to error out if somebody else waits, but I guess it's unlikely that we'd mistakenly add code doing that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query Rewrite with Postgres' materialized views
Thank you for your answers. I am very eager to contribute to Postgres, especially in the materialized views area. I have created a thread http://www.postgresql.org/message-id/cak7uwezzxjjsxp9uoxwhpnjjutjajvpkja9skzalsnnrdjs...@mail.gmail.com proposing to work on it as my Google Summer of Code project. It became a consensus that implementing the query rewrite feature would be unfeasible for such a short period of time, but Tomas Vondra suggested I could work on a first part of It, such as making a staleness flag for the MVs. If you have any opinions on the thread, i'd be pleased to read them. Thank you and regards, Eric 2015-02-21 16:09 GMT-02:00 Robert Haas robertmh...@gmail.com: On Sat, Feb 21, 2015 at 1:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I'm not really sure what Josh was talking about in that answer. In terms of doing this automatically, I doubt that's likely to happen until we have a way to automatically update a materialized view when the underlying data changes --- and Kevin Grittner has done a bunch of work towards that, but more is needed to get us there. Even with auto maintenance, it seems unlikely that matviews would ever be so transparent that it would be okay for the planner to automatically substitute them into a query. The data in a matview is going to be at least a little bit stale, else You're Doing It Wrong. Perhaps, but this is a feature we've gotten MANY customer requests for. IDK why, but it's true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Redesigning checkpoint_segments
On 02/23/2015 01:01 PM, Andres Freund wrote: On 2015-02-22 21:24:56 -0500, Robert Haas wrote: On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek p...@2ndquadrant.com wrote: I am wondering a bit about interaction with wal_keep_segments. One thing is that wal_keep_segments is still specified in number of segments and not size units, maybe it would be worth to change it also? And the other thing is that, if set, the wal_keep_segments is the real max_wal_size from the user perspective (not from perspective of the algorithm in this patch, but user does not really care about that) which is somewhat weird given the naming. It seems like wal_keep_segments is more closely related to wal_*min*_size. The idea of both settings is that each is a minimum amount of WAL we want to keep around for some purpose. But they're not quite the same, I guess, because wal_min_size just forces us to keep that many files around - they can be overwritten whenever. wal_keep_segments is an amount of actual WAL data we want to keep around. Would it make sense to require that wal_keep_segments = wal_min_size? I don't think so. Right now checkpoint_segments is a useful tool to relatively effectively control the amount of WAL that needs to be replayed in the event of a crash. wal_keep_segments in contrast doesn't have much to do with the normal working of the system, except that it delays recycling of WAL segments a bit. With a condition like above, how would you set up things that you have 50k segments around for replication (say a good days worth), but that your will never have to replay more than ~800 segments (i.e. something like checkpoint_segments = 800)? Right. While wal_keep_segments and wal_min_size both set a kind of a minimum on the amount of WAL that's kept in pg_xlog, they are different things, and a rule that one must be less than or greater than the other doesn't make sense. Everyone seems to be happy with the names and behaviour of the GUCs, so committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mogrify and indent features for jsonb
Hi, Petr, thanks for the review. I think it would be better if the ident printing didn't put the start of array ([) and start of dictionary ({) on separate line Did you mean this? [ { a: 1, b: 2 } ] I tried to verify this in several ways (http://jsonprettyprint.com/, JSON.stringify, json.too from python), and I always get this result (new line after ([) and ({) ). I don't really understand the point of h_atoi() function. Initially, this function was to combine the convertion logic and specific verifications. But I agree, strtol is more correct way, I should improve this. The code looks ok as well except maybe the replacePath could use couple of comments I already added several commentaries (and looks like I should add even more in the nearest future) for this function in the jsonbx extension, and I think we can update this patch one more time with that improvement. About the Assert(ARR_ELEMTYPE(path) == TEXTOID); I based my work on the hstore extension, which contains such kind of assertions. But I suppose, it's not required anymore, so I removed this from the extension. And, I think, we can also remove this from patch. On 18 February 2015 at 08:32, Petr Jelinek p...@2ndquadrant.com wrote: Hi, I looked at the patch and have several comments. First let me say that modifying the individual paths of the json is the feature I miss the most in the current implementation so I am happy that this patch was submitted. I would prefer slightly the patch split into two parts, one for the indent printing and one with the manipulation functions, but it's not too big patch so it's not too bad as it is. There is one compiler warning that I see: jsonfuncs.c:3533:1: warning: no previous prototype for ‘jsonb_delete_path’ [-Wmissing-prototypes] jsonb_delete_path(PG_FUNCTION_ARGS) I think it would be better if the ident printing didn't put the start of array ([) and start of dictionary ({) on separate line since most pretty printing implementations outside of Postgres (like the JSON.stringify or python's json module) don't do that. This is purely about consistency with the rest of the world. The json_ident might be better named as json_pretty perhaps? I don't really understand the point of h_atoi() function. What's wrong with using strtol like pg_atoi does? Also there is no integer overflow check anywhere in that function. There is tons of end of line whitespace mess in jsonb_indent docs. Otherwise everything I tried so far works as expected. The code looks ok as well except maybe the replacePath could use couple of comments (for example the line which uses the h_atoi) to make it easier to follow. About the: + /* XXX : why do we need this assertion? The functions is declared to take text[] */ + Assert(ARR_ELEMTYPE(path) == TEXTOID); I wonder about this also, some functions do that, some don't, I am not really sure what is the rule there myself. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in pg_dump
Le 17/02/2015 14:44, Michael Paquier a écrit : On Tue, Jan 20, 2015 at 8:06 PM, Gilles Darold gilles.dar...@dalibo.com wrote: Le 19/01/2015 14:41, Robert Haas a écrit : On Thu, Jan 15, 2015 at 6:26 AM, Gilles Darold gilles.dar...@dalibo.com wrote: I attach a patch that solves the issue in pg_dump, let me know if it might be included in Commit Fest or if the three other solutions are a better choice. I think a fix in pg_dump is appropriate, so I'd encourage you to add this to the next CommitFest. Ok, thanks Robert. The patch have been added to next CommitFest under the Bug Fixes section. I've also made some review of the patch and more test. I've rewritten some comments and added a test when TableInfo is NULL to avoid potential pg_dump crash. New patch attached: pg_dump.c-extension-FK-v2.patch So, I looked at your patch and I have some comments... The approach taken by the patch looks correct to me as we cannot create FK constraints after loading the data in the case of an extension, something that can be done with a data-only dump. Now, I think that the routine hasExtensionMember may impact performance unnecessarily in the case of databases with many tables, say thousands or more. And we can easily avoid this performance problem by checking the content of each object dumped by doing the legwork directly in getTableData. Also, most of the NULL pointer checks are not needed as most of those code paths would crash if tbinfo is NULL, and actually only keeping the one in dumpConstraint is fine. Yes that's exactly what we discuss at Moscow, thanks for removing the hasExtensionMember() routine. About NULL pointer I'm was not sure that it could not crash on some other parts so I decided to check it everywhere. If that's ok to just check in dumpConstraint() that's fine. On top of those things, I have written a small extension able to reproduce the problem that I have included as a test module in src/test/modules. The dump/restore check is done using the TAP tests, and I have hacked prove_check to install as well the contents of the current folder to be able to use the TAP routines with an extension easily. IMO, as we have no coverage of pg_dump with extensions I think that it would be useful to ensure that this does not break again in the future. Great ! I did not had time to do that, thank you so much. All the hacking I have done during the review results in the set of patch attached: - 0001 is your patch, Gilles, with some comment fixes as well as the performance issue with hasExtensionMember fix - 0002 is the modification of prove_check that makes TAP tests usable with extensions - 0003 is the test module covering the tests needed for pg_dump, at least for the problem of this thread. Gilles, how does that look to you? Looks great to me, I have tested with the postgis_topology extension everything works fine. (Btw, be sure to generate your patches directly with git next time. :) ) Sorry, some old reminiscence :-) Thanks for the review. Best regards, -- Gilles Darold Consultant PostgreSQL http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] json_populate_record issue - TupleDesc reference leak
Hi When I tested json_populate_function in test http://stackoverflow.com/questions/7711432/how-to-set-value-of-composite-variable-field-using-dynamic-sql/28673097#28673097 I found a small issue create type pt as (a int, b int); postgres=# select json_populate_record('(10,20)'::pt, '{}'); WARNING: TupleDesc reference leak: TupleDesc 0x7f10fcf41400 (567018,-1) still referenced json_populate_record -- (10,20) (1 row) jsonb is ok postgres=# select jsonb_populate_record('(10,20)'::pt, '{}'); jsonb_populate_record --- (10,20) (1 row) Regards Pavel
Re: [HACKERS] Primary not sending to synchronous standby
On 2015-02-23 15:48:25 +, Thom Brown wrote: On 23 February 2015 at 15:42, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-23 16:38:44 +0100, Andres Freund wrote: I unfortunately don't remember enough of the thread to reference it here. Found the right keywords. The threads below http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau and http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau Yes, this seems to be virtually the same issue reported. The trace looks the same except for RecordTransactionCommit. So, I proposed in http://www.postgresql.org/message-id/20140707155113.gb1...@alap3.anarazel.de that we make sequences assign a xid and only wait for syncrep when a xid is assigned. The biggest blocker was that somebody would have to do some code reviewing to find other locations that might need similar treatment. I did a, quick, grep for XLogInsert() and I think we're otherwise fine. There's some debatable cases: * XLOG_STANDBY_LOCK doesn't force a xid to be assigned. I think it's harmless though, as we really only need to wait for that to be replicated if the transaction did something relevant (i.e. catalog changes). And those will force xid assignment. * 2pc records don't assign a xid. But twophase.c does it's own waiting, so that's fine. * Plain vacuums will not trigger waits. But I think that's good. There's really no need to wait if all that's been done is some cleanup without visible consequences. * Fujii brought up that we might want to wait for XLOG_SWITCH - I don't really see why. * XLOG_RESTORE_POINT is a similar candidate - I don't see really valid arguments for making 2pc wait. The attached, untested, patch changes things so that we a) only wait for syncrep if we both wrote WAL and had a xid assigned b) use an async commit if we just had a xid assigned, without having written WAL, even if synchronous_commit = off c) acquire a xid when WAL logging sequence changes (arguable at least one of the xid assignments is redundant, but it doesn't cost anything, so ...) I think it makes sense to change a) and b) that way because there's no need to wait for WAL flushes/syncrep waits when all that happened is manipulations of temporary/unlogged tables or HOT pruning. It's slightly wierd that the on-disk flush and the syncrep wait essentially used two different mechanisms for deciding when to flush. Comments? This is obviously just a POC, but I think something like this does make a great deal of sense. Thom, does that help? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 16828b249b4d2ebfc04678f2570d6c439981e472 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Mon, 23 Feb 2015 17:38:40 +0100 Subject: [PATCH] Reconsider when to flush WAL and wait for syncrep while committing. --- src/backend/access/transam/xact.c | 30 ++ src/backend/commands/sequence.c | 23 +++ 2 files changed, 41 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 97000ef..2818a03 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1031,10 +1031,9 @@ RecordTransactionCommit(void) /* * If we didn't create XLOG entries, we're done here; otherwise we - * should flush those entries the same as a commit record. (An - * example of a possible record that wouldn't cause an XID to be - * assigned is a sequence advance record due to nextval() --- we want - * to flush that to disk before reporting commit.) + * should trigger flushing those entries the same as a commit record + * would. This will primarily happen for HOT pruning and the like; we + * want these to be flushed to disk in due time. */ if (!wrote_xlog) goto cleanup; @@ -1153,11 +1152,13 @@ RecordTransactionCommit(void) /* * Check if we want to commit asynchronously. We can allow the XLOG flush * to happen asynchronously if synchronous_commit=off, or if the current - * transaction has not performed any WAL-logged operation. The latter - * case can arise if the current transaction wrote only to temporary - * and/or unlogged tables. In case of a crash, the loss of such a - * transaction will be irrelevant since temp tables will be lost anyway, - * and unlogged tables will be truncated. (Given the foregoing, you might + * transaction has not performed any WAL-logged operation or didn't assign + * a xid. The transaction can end up not writing any WAL, even if it has + * a xid, if it only wrote to temporary and/or unlogged tables. It can + * end up having written WAL without and xid if it did HOT pruning. In + * case of a crash, the loss of such a transaction will be irrelevant; + * temp tables will be lost anyway,
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On 2/18/15 10:25 AM, David Steele wrote: On 2/18/15 6:11 AM, Fujii Masao wrote: The pg_audit doesn't log BIND parameter values when prepared statement is used. Seems this is an oversight of the patch. Or is this intentional? It's actually intentional - following the model I talked about in my earlier emails, the idea is to log statements only. This also follows on 2ndQuadrant's implementation. Unfortunately, I think it's beyond the scope of this module to log bind variables. I'm following not only 2ndQuadrant's implementation, but Oracle's as well. Logging values is interesting, but I'm sure the user would want to specify which columns to log, which I felt was beyond the scope of the patch. The pg_audit cannot log the statement like SELECT 1 which doesn't access to the database object. Is this intentional? I think that there are many users who want to audit even such statement. I think I see how to make this work. I'll work on it for the next version of the patch. This has been fixed in the v2 patch. Imagine the case where you call the user-defined function which executes many nested statements. In this case, pg_audit logs only top-level statement (i.e., issued directly by client) every time nested statement is executed. In fact, one call of such UDF can cause lots of *same* log messages. I think this is problematic. I agree - not sure how to go about addressing it, though. I've tried to cut down on the verbosity of the logging in general, but of course it can still be a problem. Using security definer and a different logging GUC for the defining role might work. I'll add that to my unit tests and see what happens. That didn't work - but I didn't really expect it to. Here are two options I thought of: 1) Follow Oracle's as session option and only log each statement type against an object the first time it happens in a session. This would greatly reduce logging, but might be too little detail. It would increase the memory footprint of the module to add the tracking. 2) Only log once per call to the backend. Essentially, we would only log the statement you see in pg_stat_activity. This could be a good option because it logs what the user accesses directly, rather than functions, views, etc. which hopefully are already going through a review process and can be audited that way. Would either of those address your concerns? -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Abbreviated keys for text cost model fix
Hi, On 22.2.2015 22:30, Peter Geoghegan wrote: On Sun, Feb 22, 2015 at 1:19 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: In short, this fixes all the cases except for the ASC sorted data. I haven't done any code review, but I think we want this. I'll use data from the i5-2500k, but it applies to the Xeon too, except that the Xeon results are more noisy and the speedups are not that significant. For the 'text' data type, and 'random' dataset, the results are these: scaledatumcost-model --- 10 328% 323% 100 392% 391% 200 96% 565% 300 97% 572% 400 97% 571% 500 98% 570% The numbers are speedup vs. master, so 100% means exactly the same speed, 200% means twice as fast. So while with 'datum' patch this actually caused very nice speedup for small datasets - about 3-4x speedup up to 1M rows, for larger datasets we've seen small regression (~3% slower). With the cost model fix, we actually see a significant speedup (about 5.7x) for these cases. Cool. I haven't verified whether this produces the same results, but if it does this is very nice. For 'DESC' dataset (i.e. data sorted in reverse order), we do get even better numbers, with up to 6.5x speedup on large datasets. But for 'ASC' dataset (i.e. already sorted data), we do get this: scaledatumcost-model --- 10 85% 84% 100 87% 87% 200 76% 96% 300 82% 90% 400 91% 83% 500 93% 81% Ummm, not that great, I guess :-( You should try it with the data fully sorted like this, but with one tiny difference: The very last tuple is out of order. How does that look? So here are the results for ASC-ordered dataset, with one 'unsorted' row added to the end of the dataset. As before the complete scripts are attached, and the raw results are available in a spreadsheet: http://bit.ly/18g1nTU The durations are much higher than without the single unsorted row added at the end. Queries often take 20x longer to finish (on the same code), depending on the scale. The speedup results (compared to master) look like this: scale query# datum numeric cost model 10 1 859%861% 856% 10 2 811%814% 805% 10 3 100%100% 97% 1001 805%804% 807% 1002 769%773% 770% 1003 100%100% 98% 2001 97% 97% 673% 2002 96% 97% 646% 2003 99%101% 678% 3001 98% 98% 578% 3002 96% 97% 557% 3003 99%101% 579% 4001 99% 99% 513% 4002 97% 98% 497% 4003 99%101% 510% 5001 99% 99% 469% 5002 97% 98% 456% 5003 99%101% 466% What's interesting here is that some queries are much faster, but query #3 is slow until we hit 2M rows: select * from (select * from stuff_int_desc order by randint offset 1000) foo Looking at the previous tests, I see this is exactly what's happening to this query with 'random' dataset - it's slightly slower than master up until 2M rows, when it suddenly jumps to the same speedup as the other queries. Can we do something about that? Anyway, I'm wondering what conclusion we can do from this? I believe vast majority of datasets in production won't be perfectly sorted, because when the table is CLUSTERed by index we tend to use index scan to do the sort (so no problem), or the data are not actually perfectly sorted (and here we get significant speedup). -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services scripts.tgz Description: application/tgz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for Numeric
Hi, On 23.2.2015 11:59, Andrew Gierth wrote: Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes: Tomas Interesting, but I think Gavin was asking about how much Tomas variation was there for each tested case (e.g. query executed on Tomas the same code / dataset). And in those cases the padding / Tomas alignment won't change, and thus the effects you describe won't Tomas influence the results, no? My point is exactly the fact that since the result is not affected, this variation between runs of the same code is of no real relevance to the question of whether a given change in performance can properly be attributed to a patch. Put it this way: suppose I have a test that when run repeatedly with no code changes takes 6.10s (s=0.025s), and I apply a patch that changes that to 6.26s (s=0.025s). Did the patch have an impact on performance? Now suppose that instead of applying the patch I insert random amounts of padding in an unused function and find that my same test now takes a mean of 6.20s (s=0.058s) when I take the best timing for each padding size and calculate stats across sizes. Now it looks obvious that the actual code of the patch probably wasn't responsible for any change... The numbers used here aren't theoretical; they are obtained by testing a single query - select * from d_flt order by v offset 1000 where d_flt contains 5 million float8 values - over 990 times with 33 different random padding sizes (uniform in 0-32767). Here's a scatter plot, with 3 runs of each padding size so you can see the repeatability: http://tinyurl.com/op9qg8a I think we're talking about slightly different things, then. I believe Gavin was asking about variability for executions with a particular code (i.e. with fixed amount of padding), to decide whether it even makes sense to compare results for different patches or whether the differences are just random noise. Interpreting those differences - whether they are due to changes in the algorithm or a result of some padding somewhere else in the code, that's of course important too. I believe the small regressions (1-10%) for small data sets, might be caused by this 'random padding' effect, because that's probably where L1/L2 cache is most important. For large datasets the caches are probably not as efficient anyway, so the random padding makes no difference, and the speedup is just as good as for the other queries. See for example this: http://www.postgresql.org/message-id/54eb580c.2000...@2ndquadrant.com But I'm speculating here ... time for a profiler, I guess. -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Primary not sending to synchronous standby
On 23 February 2015 at 16:53, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-23 15:48:25 +, Thom Brown wrote: On 23 February 2015 at 15:42, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-23 16:38:44 +0100, Andres Freund wrote: I unfortunately don't remember enough of the thread to reference it here. Found the right keywords. The threads below http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau and http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau Yes, this seems to be virtually the same issue reported. The trace looks the same except for RecordTransactionCommit. So, I proposed in http://www.postgresql.org/message-id/20140707155113.gb1...@alap3.anarazel.de that we make sequences assign a xid and only wait for syncrep when a xid is assigned. The biggest blocker was that somebody would have to do some code reviewing to find other locations that might need similar treatment. I did a, quick, grep for XLogInsert() and I think we're otherwise fine. There's some debatable cases: * XLOG_STANDBY_LOCK doesn't force a xid to be assigned. I think it's harmless though, as we really only need to wait for that to be replicated if the transaction did something relevant (i.e. catalog changes). And those will force xid assignment. * 2pc records don't assign a xid. But twophase.c does it's own waiting, so that's fine. * Plain vacuums will not trigger waits. But I think that's good. There's really no need to wait if all that's been done is some cleanup without visible consequences. * Fujii brought up that we might want to wait for XLOG_SWITCH - I don't really see why. * XLOG_RESTORE_POINT is a similar candidate - I don't see really valid arguments for making 2pc wait. The attached, untested, patch changes things so that we a) only wait for syncrep if we both wrote WAL and had a xid assigned b) use an async commit if we just had a xid assigned, without having written WAL, even if synchronous_commit = off c) acquire a xid when WAL logging sequence changes (arguable at least one of the xid assignments is redundant, but it doesn't cost anything, so ...) I think it makes sense to change a) and b) that way because there's no need to wait for WAL flushes/syncrep waits when all that happened is manipulations of temporary/unlogged tables or HOT pruning. It's slightly wierd that the on-disk flush and the syncrep wait essentially used two different mechanisms for deciding when to flush. Comments? This is obviously just a POC, but I think something like this does make a great deal of sense. Thom, does that help? Yeah, this appears to eliminate the problem, at least in the case I reported. Thanks -- Thom
Re: [HACKERS] SSL renegotiation
On 2015-02-23 15:15:31 +0100, Florian Weimer wrote: On 02/22/2015 02:05 PM, Andres Freund wrote: On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote: I honestly wonder why postgres uses renegotiation at all. The motivation that cryptoanalysis is easier as more data is sent seems quite far-fetched. I don't think so. There's a fair number of algorithms that can/could be much easier be attached with lots of data available. Especially if you can guess/know/control some of the data. Additionally renegotiating regularly helps to constrain a possible key leagage to a certain amount of time. With backend connections often being alive for weeks at a time that's not a bad thing. Renegotiation will be removed from future TLS versions because it is considered unnecessary with modern ciphers: https://github.com/tlswg/tls13-spec/issues/38 If ciphers require rekeying, that mechanism will be provided at the TLS layer in the future. I think you could remove renegotiation from PostgreSQL as long as you offer something better than RC4 in the TLS handshake. As far as I understand it, this argument misses an important point. Without protocol level rekeying, handled by the library in the background, never changing the session key pretty much breaks PFS. In http the sessions almost always are short enough that such a consideration doesn't play a role, but it's far from uncommon to have database connections that live weeks and transport hundreds of gigabytes in their lifetime. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL renegotiation
Renegotiation should be a best practice. Trouble is it's been broken (at the protocol level) three times in the last few years so it's a massive hole in practice. Ideally we should leave the renegotiate in, and only remove it if configure detects a broken version of TLS. Personal email. hbh...@oxy.edu On Feb 23, 2015, at 7:01 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote: I'd say it is best to wait if and how OpenSSL change their API when they implement TLS 1.3. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
Hi Stephen, Thanks for your review. All fixed except for comments below: On 2/17/15 10:34 AM, Stephen Frost wrote: +/* + * Check privileges granted indirectly via role memberships. We do this in + * a separate pass to minimize expensive indirect membership tests. In + * particular, it's worth testing whether a given ACL entry grants any + * privileges still of interest before we perform the has_privs_of_role + * test. + */ I'm a bit on the fence about this.. Can you provide a use-case where doing this makes sense..? Does this mean I could grant admin_role1 to audit and then get auditing on everything that user1 has access to? That seems like it might be useful for environments where such roles already exist though it might end up covering more than is intended.. The idea is that if there are already ready-made roles to be audited then they don't need to be reconstituted for the audit role. You could just do: grant admin_role to audit; grant user_role to audit; Of course, we could list multiple roles in the pg_audit.role GUC, but I thought this would be easier to use and maintain since there was some worry about GUCs being fragile when they refer to database objects. + +# test.pl - pgAudit Unit Tests + This is definitiely nice.. diff --git a/doc/src/sgml/pgaudit.sgml b/doc/src/sgml/pgaudit.sgml new file mode 100644 index 000..f3f4ab9 --- /dev/null +++ b/doc/src/sgml/pgaudit.sgml @@ -0,0 +1,316 @@ +!-- doc/src/sgml/pgaudit.sgml -- + +sect1 id=pgaudit xreflabel=pgaudit + titlepg_audit/title There seems to be a number of places which are 'pgaudit' and a bunch that are 'pg_audit'. I'm guessing you were thinking 'pg_audit', but it'd be good to clean up and make them all consistent. Fixed, though I still left the file name as pgaudit.sgml since all but one module follow that convention. +para + Session auditing allows the logging of all commands that are executed by + a user in the backend. Each command is logged with a single entry and + includes the audit type (e.g. literalSESSION/literal), command type + (e.g. literalCREATE TABLE/literal, literalSELECT/literal) and + statement (e.g. literalselect * from test/literal). + + Fully-qualified names and object types will be logged for + literalCREATE/literal, literalUPDATE/literal, and + literalDROP/literal commands on literalTABLE/literal, + literalMATVIEW/literal, literalVIEW/literal, + literalINDEX/literal, literalFOREIGN TABLE/literal, + literalCOMPOSITE TYPE/literal, literalINDEX/literal, and + literalSEQUENCE/literal objects as well as function calls. +/para Ah, you do have a list of what objects you get fully qualified names for, nice. Are there obvious omissions from that list..? If so, we might be able to change what happens with objectAccess to include them.. It seems like these are the objects where having a name really matters. I'm more interested in using the deparse code to handle fully-qualified names for additional objects rather than adding hooks. +sect3 + titleExamples/title + + para +Set literalpgaudit.log = 'read, ddl'/literal in +literalpostgresql.conf/literal. + /para Perhaps I missed it, but it'd be good to point out that GUCs can be set at various levels. I know we probably say that somewhere else, but it's particularly relevant for this. Yes, it's very relevant for this patch. Do you think it's enough to call out the functionality, or should I provide examples? Maybe a separate section just for this concept? Patch v2 is attached for all changes except the doc change above. -- - David Steele da...@pgmasters.net diff --git a/contrib/Makefile b/contrib/Makefile index 195d447..d8e75f4 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -29,6 +29,7 @@ SUBDIRS = \ pageinspect \ passwordcheck \ pg_archivecleanup \ + pg_audit\ pg_buffercache \ pg_freespacemap \ pg_prewarm \ diff --git a/contrib/pg_audit/Makefile b/contrib/pg_audit/Makefile new file mode 100644 index 000..32bc6d9 --- /dev/null +++ b/contrib/pg_audit/Makefile @@ -0,0 +1,20 @@ +# pg_audit/Makefile + +MODULE = pg_audit +MODULE_big = pg_audit +OBJS = pg_audit.o + +EXTENSION = pg_audit + +DATA = pg_audit--1.0.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_audit +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_audit/pg_audit--1.0.0.sql b/contrib/pg_audit/pg_audit--1.0.0.sql new file
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
On Sun, Feb 22, 2015 at 10:18 PM, Kevin Grittner kgri...@ymail.com wrote: Amit Kapila amit.kapil...@gmail.com wrote: It seems to me that SQL Server also uses similar mechanism to avoid the bloat in version store (place to store previous versions or record). I think if other leading databases provide a way to control the bloat, it indicates that most of the customers having write-intesive workload would like to see such an option. Yeah, I think many users are surprised by the damage that can be done to a database by an idle connection or long-running read-only query, especially when coming from other databases which have protections against that. The bad news is that changing to specifying the limit in time rather than transaction IDs is far more complicated than I thought. Basically, we need to associate times in the past with the value of latestCompletedXid (or derived values like snapshot xmin) at those times. I was initially thinking that by adding a timestamp to the snapshot we could derive this from active snapshots. For any one connection, it's not that hard for it to scan the pairingheap of snapshots and pick out the right values; but the problem is that it is process-local memory and this needs to work while the connection is sitting idle for long periods of time. Could you please explain in slightly more detail why can't it work if we use timestamp instead of snapshot-xmin in your patch in function TestForOldSnapshot()? I've got a couple ideas on how to approach it, but neither seems entirely satisfying: (1) Use a new timer ID to interrupt the process whenever its oldest snapshot which hasn't yet crossed the old snapshot threshold does so. The problem is that this seems as though it would need to do an awful lot more work (scanning the pairing heap for the best match) than anything else is currently doing in a timeout handler. One alternative would be to keep a second pairing heap to track snapshots which have not crossed the threshold, removing items as they get old -- that would make the work needed in the timeout handler closer to other handlers. (2) Use a course enough granularity on time and a short enough maximum for the GUC to just keep a circular buffer of the mappings in memory. We might be able to make this dense enough that one minute resolution for up to 60 days could fit in 338kB. Obviously that could be reduced with courser granularity or a shorter maximum. How exactly will this allow to return snapshot old error, do we need a check for page (page lsn) as in your current patch? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] POLA violation with \c service=
David Fetter wrote: My thinking behind this was that the patch is a bug fix and intended to be back-patched, so I wanted to mess with as little infrastructure as possible. A new version of libpq seems like a very big ask for such a case. You'll recall that the original problem was that \c service=foo only worked accidentally for some pretty narrow use cases and broke without much of a clue for the rest. It turned out that the general problem was that options given to psql on the command line were not even remotely equivalent to \c, even though they were documented to be. So, in view of these arguments and those put forward by Pavel downthread, I think the attached is an acceptable patch for the master branch. It doesn't apply to back branches though; 9.4 and 9.3 have a conflict in tab-complete.c, 9.2 has additional conflicts in command.c, and 9.1 and 9.0 are problematic all over because they don't have src/common. Could you please submit patches adapted for each group of branches? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for text cost model fix
On Mon, Feb 23, 2015 at 11:17 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: So while it's true that for the 3rd query we get much worse results compared to the other queries (i.e. we don't get 400% speedup but ~3% slowdown compared to master), it's true that master performs exceptionally well for this query with small datasets. Once we get to 2M rows, the master performance drops significantly but cost-model keeps the performance characteristics and the speedup jumps back to ~700% which is nice. These numbers are for the 'ASC + unsorted row' test, but I do get exactly the same pattern for the 'random' tests done previously. Yeah. Looks like you're comparing a case where the old cost model did the right thing anyway (i.e. used abbreviation). The difference would then be entirely explainable as noise. Right? It would be nice if we could address the 3% regression for the last query, but I guess it's not a big deal. The numbers in general are absolutely impressive. Kudos. Thanks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POLA violation with \c service=
Here's the real attachment. Previous one was a misguided shell redirection. Meh. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services From 830d41b9d23716af29491cbab59808c35e25ec12 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Fri, 20 Feb 2015 14:36:41 -0300 Subject: [PATCH] Fix psql's \c to work with URIs and conninfo strings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This makes it more consistent with what can be given in the command line. Authors: Andrew Dunstan, David Fetter, Andrew Gierth, Pavel Stěhule --- doc/src/sgml/ref/psql-ref.sgml| 30 ++--- src/bin/psql/command.c| 69 +++ src/bin/psql/tab-complete.c | 20 ++-- src/common/Makefile | 2 +- src/common/connstrings.c | 52 + src/include/common/connstrings.h | 16 + src/interfaces/libpq/fe-connect.c | 36 +--- 7 files changed, 162 insertions(+), 63 deletions(-) create mode 100644 src/common/connstrings.c create mode 100644 src/include/common/connstrings.h diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index a637001..a218af8 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -796,18 +796,23 @@ testdb=gt; /varlistentry varlistentry -termliteral\c/literal or literal\connect/literal literal[ replaceable class=parameterdbname/replaceable [ replaceable class=parameterusername/replaceable ] [ replaceable class=parameterhost/replaceable ] [ replaceable class=parameterport/replaceable ] ]/literal/term +termliteral\c [ { [ replaceable class=parameterdbname/replaceable [ replaceable class=parameterusername/replaceable ] [ replaceable class=parameterhost/replaceable ] [ replaceable class=parameterport/replaceable ] ] | replaceable class=parameterconninfo/replaceable string } ] /literal/term +termliteral\connect [ { [ replaceable class=parameterdbname/replaceable [ replaceable class=parameterusername/replaceable ] [ replaceable class=parameterhost/replaceable ] [ replaceable class=parameterport/replaceable ] ] | replaceable class=parameterconninfo/replaceable string } ] /literal/term listitem para Establishes a new connection to a productnamePostgreSQL/ -server. If the new connection is successfully made, the -previous connection is closed. If any of replaceable +server using positional parameters as described below, a +parameterconninfo/parameter string, or a acronymURI/acronym. +If the new connection is successfully made, the +previous connection is closed. +When using positional parameters, if any of replaceable class=parameterdbname/replaceable, replaceable class=parameterusername/replaceable, replaceable class=parameterhost/replaceable or replaceable class=parameterport/replaceable are omitted or specified as literal-/literal, the value of that parameter from the -previous connection is used. If there is no previous +previous connection is used. +If using positional parameters and there is no previous connection, the applicationlibpq/application default for the parameter's value is used. /para @@ -822,6 +827,23 @@ testdb=gt; mechanism that scripts are not accidentally acting on the wrong database on the other hand. /para + +para +Positional syntax: +programlisting +=gt; \c mydb myuser host.dom 6432 +/programlisting +/para + +para +The conninfo form detailed in xref linkend=LIBPQ-CONNSTRING +takes two forms: keyword-value pairs, and URIs: +/para +programlisting +=gt; \c service=foo +=gt; \c host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable +=gt; \c postgresql://tom@localhost/mydb?application_name=myapp +/programlisting /listitem /varlistentry diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 7c9f28d..7d8d5bb 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -1607,6 +1607,8 @@ do_connect(char *dbname, char *user, char *host, char *port) PGconn *o_conn = pset.db, *n_conn; char *password = NULL; + bool keep_password = true; + bool has_connection_string = false; if (!o_conn (!dbname || !user || !host || !port)) { @@ -1620,15 +1622,31 @@ do_connect(char *dbname, char *user, char *host, char *port) return false; } - if (!dbname) - dbname = PQdb(o_conn); if (!user) user = PQuser(o_conn); + if (!host) host = PQhost(o_conn); + if (!port) port = PQport(o_conn); + if (dbname) + has_connection_string = recognized_connection_string(dbname); + + keep_password = +
Re: [HACKERS] [RFC] LSN Map
On 01/13/2015 01:22 PM, Marco Nenciarini wrote: Il 08/01/15 20:18, Jim Nasby ha scritto: On 1/7/15, 3:50 AM, Marco Nenciarini wrote: The current implementation tracks only heap LSN. It currently does not track any kind of indexes, but this can be easily added later. Would it make sense to do this at a buffer level, instead of at the heap level? That means it would handle both heap and indexes. I don't know if LSN is visible that far down though. Where exactly you are thinking to handle it? Dunno, but Jim's got a point. This is a maintenance burden to all indexams, if they all have to remember to update the LSN map separately. It needs to be done in some common code, like in PageSetLSN or XLogInsert or something. Aside from that, isn't this horrible from a performance point of view? The patch doubles the buffer manager traffic, because any update to any page will also need to modify the LSN map. This code is copied from the visibility map code, but we got away with it there because the VM only needs to be updated the first time a page is modified. Subsequent updates will know the visibility bit is already cleared, and don't need to access the visibility map. Ans scalability: Whether you store one value for every N pages, or the LSN of every page, this is going to have a huge effect of focusing contention to the LSN pages. Currently, if ten backends operate on ten different heap pages, for example, they can run in parallel. There will be some contention on the WAL insertions (much less in 9.4 than before). But with this patch, they will all fight for the exclusive lock on the single LSN map page. You'll need to find a way to not update the LSN map on every update. For example, only update the LSN page on the first update after a checkpoint (although that would still have a big contention focusing effect right after a checkpoint). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json_populate_record issue - TupleDesc reference leak
by the way - this feature is undocumented - I though so only value used as type holder is not used. Should be documented better, - if I understand - it is base stone for implementation #= hstore operator some nice example postgres=# select json_populate_record('(10,20)'::pt, '{a:30}'); json_populate_record -- (30,20) (1 row) a mentioned bug is corner case - bugfix is simple diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c new file mode 100644 index a8cdeaa..6e83f78 *** a/src/backend/utils/adt/jsonfuncs.c --- b/src/backend/utils/adt/jsonfuncs.c *** populate_record_worker(FunctionCallInfo *** 2114,2119 --- 2114,2122 */ if (hash_get_num_entries(json_hash) == 0 rec) { + if (have_record_arg) + ReleaseTupleDesc(tupdesc); + hash_destroy(json_hash); PG_RETURN_POINTER(rec); } 2015-02-23 18:27 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com: Hi When I tested json_populate_function in test http://stackoverflow.com/questions/7711432/how-to-set-value-of-composite-variable-field-using-dynamic-sql/28673097#28673097 I found a small issue create type pt as (a int, b int); postgres=# select json_populate_record('(10,20)'::pt, '{}'); WARNING: TupleDesc reference leak: TupleDesc 0x7f10fcf41400 (567018,-1) still referenced json_populate_record -- (10,20) (1 row) jsonb is ok postgres=# select jsonb_populate_record('(10,20)'::pt, '{}'); jsonb_populate_record --- (10,20) (1 row) Regards Pavel
Re: [HACKERS] Abbreviated keys for text cost model fix
On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: The durations are much higher than without the single unsorted row added at the end. Queries often take 20x longer to finish (on the same code), depending on the scale. I knew this would happen. :-) What's interesting here is that some queries are much faster, but query #3 is slow until we hit 2M rows: select * from (select * from stuff_int_desc order by randint offset 1000) foo Not sure why that would be. Can you see what a #define DEBUG_ABBREV_KEYS (within varlena.c) build shows happens with the cost model? Enable debug1 output for that. Looking at the previous tests, I see this is exactly what's happening to this query with 'random' dataset - it's slightly slower than master up until 2M rows, when it suddenly jumps to the same speedup as the other queries. Can we do something about that? Possibly. Anyway, I'm wondering what conclusion we can do from this? I believe vast majority of datasets in production won't be perfectly sorted, because when the table is CLUSTERed by index we tend to use index scan to do the sort (so no problem), or the data are not actually perfectly sorted (and here we get significant speedup). One conclusion might be that commit a3f0b3d6 should not have added the pre-check for perfectly sorted input, which is not in the Bentley Mcilroy original - exploiting the fact that the set is already partially sorted is a fine thing, but we're not doing the necessary bookkeeping. But that's not really why I suggested that test case. Rather, I wanted to put the possible regression in the event of perfectly sorted input in perspective. Maybe that regression isn't worth worrying about, if in the event of a single tuple being out of order at the end of the set the outcome dramatically shifts to strongly favor abbreviation. Another way of looking at it would be that the pre-sorted case is an argument against strxfrm() sorting in general more than abbreviated keys in particular, an argument that seems new and strange to me. The analysis of algorithms focuses on the worst case and average case for a reason, and (say) the glibc documentation never considers this when discussing strxfrm(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mogrify and indent features for jsonb
On 15 February 2015 at 03:06, Andrew Dunstan and...@dunslane.net wrote: Attached is a patch to provide a number of very useful facilities to jsonb that people have asked for. These are based on work by Dmitry Dolgov in his jsonbx extension, but I take responsibility for any bugs. The facilities are: new operations: concatenation:jsonb || jsonb - jsonb deletion: jsonb - text - jsonb deletion: jsonb - int - text new functions: produce indented text: jsonb_indent(jsonb) - text change an element at a path: jsonb_replace(jsonb, text[], jsonb) - jsonb. It would be relatively trivial to add: delete an element at a path: jsonb_delete(jsonb, text[]) - json Would this support deleting type and the value 'dd' from the following?: {a: 1, b: 2, c: {type: json, stuff: test}, d: [aa,bb,cc,dd]} and I think we should do that for the sake of completeness. The docs might need a little extra work, and the indent code definitely needs work, which I hope to complete in the next day or two, but I wanted to put a stake in the ground. This is high on my wanted list, so thanks for working on this. Seems to work well for me with a few tests. Is there a way to take the json: '{a: 1, b: 2, c: {type: json, stuff: test}, d: [aa,bb,cc,dd]}' and add ee to d without replacing it? I can think of ways of currently doing it, but it's very convoluted just for pushing a value to an array. Also, are there any plans to support the following?: jsonb - text[] # Provide list of keys to delete in array jsonb - jsonb # Deduplicate key:value pairs jsonb jsonb # Return overlapping jsonb (opposite of jsonb - jsonb) Thanks Thom
Re: [HACKERS] Abbreviated keys for text cost model fix
On 23.2.2015 19:22, Peter Geoghegan wrote: On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: The durations are much higher than without the single unsorted row added at the end. Queries often take 20x longer to finish (on the same code), depending on the scale. I knew this would happen. :-) What's interesting here is that some queries are much faster, but query #3 is slow until we hit 2M rows: select * from (select * from stuff_int_desc order by randint offset 1000) foo Not sure why that would be. Can you see what a #define DEBUG_ABBREV_KEYS (within varlena.c) build shows happens with the cost model? Enable debug1 output for that. test=# select * from (select * from stuff_text_asc order by randtxt offset 1000) foo; DEBUG: abbrev_distinct after 160: 152.862464 (key_distinct: 155.187096, norm_abbrev_card: 0.955390, prop_card: 0.20) DEBUG: abbrev_distinct after 320: 314.784336 (key_distinct: 313.425344, norm_abbrev_card: 0.983701, prop_card: 0.20) DEBUG: abbrev_distinct after 640: 629.050490 (key_distinct: 643.945298, norm_abbrev_card: 0.982891, prop_card: 0.20) DEBUG: abbrev_distinct after 1280: 1194.271440 (key_distinct: 1257.153875, norm_abbrev_card: 0.933025, prop_card: 0.20) DEBUG: abbrev_distinct after 2560: 2402.820431 (key_distinct: 2631.772790, norm_abbrev_card: 0.938602, prop_card: 0.20) DEBUG: abbrev_distinct after 5120: 4490.142750 (key_distinct: 5261.865309, norm_abbrev_card: 0.876981, prop_card: 0.20) DEBUG: abbrev_distinct after 10240: 8000.884171 (key_distinct: 10123.941172, norm_abbrev_card: 0.781336, prop_card: 0.20) DEBUG: abbrev_distinct after 20480: 13596.546899 (key_distinct: 20563.364696, norm_abbrev_card: 0.663894, prop_card: 0.13) DEBUG: abbrev_distinct after 40960: 23369.899021 (key_distinct: 40500.600680, norm_abbrev_card: 0.570554, prop_card: 0.084500) DEBUG: abbrev_distinct after 81920: 30884.544030 (key_distinct: 81466.848436, norm_abbrev_card: 0.377009, prop_card: 0.054925) randtxt - (0 rows) Looking at the previous tests, I see this is exactly what's happening to this query with 'random' dataset - it's slightly slower than master up until 2M rows, when it suddenly jumps to the same speedup as the other queries. Can we do something about that? Possibly. Anyway, I'm wondering what conclusion we can do from this? I believe vast majority of datasets in production won't be perfectly sorted, because when the table is CLUSTERed by index we tend to use index scan to do the sort (so no problem), or the data are not actually perfectly sorted (and here we get significant speedup). One conclusion might be that commit a3f0b3d6 should not have added the pre-check for perfectly sorted input, which is not in the Bentley Mcilroy original - exploiting the fact that the set is already partially sorted is a fine thing, but we're not doing the necessary Would that pre-check hurt even the random test case? I can imagine it might behave strangely with almost perfectly sorted data (when only the very last row is unsorted), but not for random data (but see below). bookkeeping. But that's not really why I suggested that test case. Rather, I wanted to put the possible regression in the event of perfectly sorted input in perspective. Maybe that regression isn't worth worrying about, if in the event of a single tuple being out of order at the end of the set the outcome dramatically shifts to strongly favor abbreviation. Another way of looking at it would be that the pre-sorted case is an argument against strxfrm() sorting in general more than abbreviated keys in particular, an argument that seems new and strange to me. The analysis of algorithms focuses on the worst case and average case for a reason, and (say) the glibc documentation never considers this when discussing strxfrm(). Yeah, every algorithm has a worst case - either you get a very fast execution on average with a few rare cases when it's much slower, or you get very bad average performance. But after looking at the data a bit more, I actually think this is a red herring. After looking at the actual runtimes (and not just speedups), I get this: scalequery# master cost-modelspeedup 10 1884 103 856% 100 1 12153 1506 807% 200 1 26187 3894 673% 300 1 38147 6601 578% 400 1 5633210976 513% 500 1 7700916409 469% 10 2883 110 805% 100 2 12114 1574 770% 200 2 26104 4038 646% 300 2 38028 6828 557% 400 2 5613811294 497% 500 2 7672016829
[HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)
I found that the OBJECT_ATTRIBUTE symbol is useless. I can just remove it and replace it with OBJECT_COLUMN, and everything continues to work; no test fails that I can find. I thought we had a prohibition against ALTER TABLE when used on composites, but it's not as severe as I thought. The following commands fail in master: ALTER TABLE comptype RENAME TO comptype2; -- HINT: Use ALTER TYPE ALTER TABLE comptype SET SCHEMA sch;-- HINT: Use ALTER TYPE However, the following command works in master: ALTER TABLE comptype RENAME COLUMN a TO b; and has the same effect as this: ALTER TYPE comptype RENAME ATTRIBUTE a TO b; The RENAME ATTTRIBUTE case in RenameStmt is the only thing currently using OBJECT_ATTRIBUTE; therefore, since in precisely that case we do not prohibit using ALTER TABLE, we can just remove OBJECT_ATTRIBUTE completely. That leads to the attached patch, which changes no test result at all. This symbol was added in commit e440e12c562432a2a695b8054964fb34e3bd823e Author: Peter Eisentraut pete...@gmx.net Date: Sun Sep 26 14:41:03 2010 +0300 Add ALTER TYPE ... ADD/DROP/ALTER/RENAME ATTRIBUTE Like with tables, this also requires allowing the existence of composite types with zero attributes. reviewed by KaiGai Kohei Thoughts? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index d899dd7..1da38c0 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -1591,7 +1591,6 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address, break; case OBJECT_TYPE: case OBJECT_DOMAIN: - case OBJECT_ATTRIBUTE: case OBJECT_DOMCONSTRAINT: if (!pg_type_ownercheck(address.objectId, roleid)) aclcheck_error_type(ACLCHECK_NOT_OWNER, address.objectId); diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c index 78b54b4..7f85fa4 100644 --- a/src/backend/commands/alter.c +++ b/src/backend/commands/alter.c @@ -330,7 +330,6 @@ ExecRenameStmt(RenameStmt *stmt) return RenameRelation(stmt); case OBJECT_COLUMN: - case OBJECT_ATTRIBUTE: return renameatt(stmt); case OBJECT_RULE: diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c index dcf5b98..d14346f 100644 --- a/src/backend/commands/event_trigger.c +++ b/src/backend/commands/event_trigger.c @@ -1056,7 +1056,6 @@ EventTriggerSupportsObjectType(ObjectType obtype) /* no support for event triggers on event triggers */ return false; case OBJECT_AGGREGATE: - case OBJECT_ATTRIBUTE: case OBJECT_CAST: case OBJECT_COLUMN: case OBJECT_COLLATION: diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 6c21002..8ad933d 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -7848,7 +7848,7 @@ RenameStmt: ALTER AGGREGATE func_name aggr_args RENAME TO name | ALTER TYPE_P any_name RENAME ATTRIBUTE name TO name opt_drop_behavior { RenameStmt *n = makeNode(RenameStmt); - n-renameType = OBJECT_ATTRIBUTE; + n-renameType = OBJECT_COLUMN; n-relationType = OBJECT_TYPE; n-relation = makeRangeVarFromAnyName($3, @3, yyscanner); n-subname = $6; diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 6d26986..2b6fc3e 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -1615,9 +1615,6 @@ AlterObjectTypeCommandTag(ObjectType objtype) case OBJECT_AGGREGATE: tag = ALTER AGGREGATE; break; - case OBJECT_ATTRIBUTE: - tag = ALTER TYPE; - break; case OBJECT_CAST: tag = ALTER CAST; break; diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index ac13302..09607eb 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -1212,7 +1212,6 @@ typedef struct SetOperationStmt typedef enum ObjectType { OBJECT_AGGREGATE, - OBJECT_ATTRIBUTE, /* type's attribute, when distinct from column */ OBJECT_CAST, OBJECT_COLUMN, OBJECT_COLLATION, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Precedence of NOT LIKE, NOT BETWEEN, etc
While fooling around with testing operator precedence warnings, I discovered that there's some existing precedence behavior that's not at all what I expected. Consider these examples (all done in 9.4; this is longstanding behavior): regression=# SELECT 1 '(3,4)'::point LIKE 42.0; ERROR: operator does not exist: point ~~ numeric LINE 1: SELECT 1 '(3,4)'::point LIKE 42.0; ^ regression=# SELECT 1 '(3,4)'::point NOT LIKE 42.0; ERROR: operator does not exist: integer point LINE 1: SELECT 1 '(3,4)'::point NOT LIKE 42.0; ^ regression=# SELECT 1 LIKE '(3,4)'::point 42.0; ERROR: operator does not exist: integer ~~ point LINE 1: SELECT 1 LIKE '(3,4)'::point 42.0; ^ regression=# SELECT 1 NOT LIKE '(3,4)'::point 42.0; ERROR: operator does not exist: integer !~~ point LINE 1: SELECT 1 NOT LIKE '(3,4)'::point 42.0; ^ These queries are silly of course, the point is just to expose which operator the parser thinks binds more tightly. And what we have above is: LIKE binds more tightly than , in either combination. NOT LIKE binds less tightly than if is on the left, but more tightly than if is on the right. This seems like a bug; it's certainly not what you'd call intuitive. The cause, if I'm understanding the Bison manual correctly, is that the NOT token has lower precedence than '', so at the critical point where it has scanned 1 '(3,4)'::point and has to either reduce that (thus binding more tightly) or shift, it looks at the lookahead token which is either LIKE or NOT in the first two examples above. LIKE has higher priority than '' so in the first example it shifts; but NOT has lower priority than '' so in the second example it reduces. The other productions starting with a_expr NOT ... also have the same problem; that includes NOT ILIKE, NOT SIMILAR TO, NOT BETWEEN, and NOT IN. I'm not seeing any terribly pleasing ways to fix this. Aside from the option of doing nothing, it seems like these are the choices: 1. We could hack base_yylex() to reduce NOT LIKE to a single token which could be given the same precedence as LIKE. Ditto for the other four cases. This would result in nice behavior for these cases, but as Robert has correctly pointed out, hacks in base_yylex() risk creating other poorly-understood behaviors. 2. We could change the precedence levels so that LIKE/ILIKE/SIMILAR, BETWEEN, and IN all have precedence just above NOT, which would pretty much mask the problem since there would be no other tokens with in-between precedence levels. In the context of the operator precedence changes I proposed earlier, this would mean inserting the IS tests and comparison operators between IN_P and POSTFIXOP rather than where the single-character comparison ops live now. We would likely also have to flatten LIKE/BETWEEN/IN into a single %nonassoc precedence level to avoid having weird interactions among them. This isn't terribly attractive because it would risk larger behavioral changes than the previous proposal. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce pinning in btree indexes
On 02/15/2015 02:19 AM, Kevin Grittner wrote: Interestingly, the btree README points out that using the old TID with a new tuple poses no hazard for a scan using an MVCC snapshot, because the new tuple would not be visible to a snapshot created that long ago. The first question is: Do we really need that interlock for the non-MVCC snapshots either? If we do: For non-MVCC snapshots, we need to ensure that all index scans that started before the VACUUM did complete before the VACUUM does. I wonder if we could find a different mechanism to enforce that. Using the pin-interlock for that has always seemed a bit silly to me. Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index scan on the table begins, and have VACUUM wait on it. I found that the LP_DEAD hinting would be a problem with an old TID, but I figured we could work around that by storing the page LSN into the scan position structure when the page contents were read, and only doing hinting if that matched when we were ready to do the hinting. That wouldn't work for an index which was not WAL-logged, so the patch still holds pins for those. Or you could use GetFakeLSNForUnloggedRel(). Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. Why does an index-only scan need to hold the pin? Finally, there was an optimization for marking buffer position for possible restore that was incompatible with releasing the pin. I use quotes because the optimization adds overhead to every move to the next page in order set some variables in a structure when a mark is requested instead of running two fairly small memcpy() statements. The two-day benchmark of the customer showed no performance hit, and looking at the code I would be amazed if the optimization yielded a measurable benefit. In general, optimization by adding overhead to moving through a scan to save time in a mark operation seems dubious. Hmm. Did your test case actually exercise mark/restore? The memcpy()s are not that small. Especially if it's an index-only scan, you're copying a large portion of the page. Some scans call markpos on every tuple, so that could add up. At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. Understood. Still, I'd like to see if we can easily get rid of the pinning altogether. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Precedence of NOT LIKE, NOT BETWEEN, etc
I wrote: I'm not seeing any terribly pleasing ways to fix this. Aside from the option of doing nothing, it seems like these are the choices: 1. We could hack base_yylex() to reduce NOT LIKE to a single token which could be given the same precedence as LIKE. Ditto for the other four cases. This would result in nice behavior for these cases, but as Robert has correctly pointed out, hacks in base_yylex() risk creating other poorly-understood behaviors. 2. We could change the precedence levels so that LIKE/ILIKE/SIMILAR, BETWEEN, and IN all have precedence just above NOT, which would pretty much mask the problem since there would be no other tokens with in-between precedence levels. In the context of the operator precedence changes I proposed earlier, this would mean inserting the IS tests and comparison operators between IN_P and POSTFIXOP rather than where the single-character comparison ops live now. We would likely also have to flatten LIKE/BETWEEN/IN into a single %nonassoc precedence level to avoid having weird interactions among them. This isn't terribly attractive because it would risk larger behavioral changes than the previous proposal. I thought of another possibility: 3. Leave everything as-is but mark the NOT-operator productions as having the precedence of NOT rather than of LIKE etc. This would change the behavior only for the NOT-LIKE-followed-by- example, and would make the two cases for NOT LIKE consistent though they'd remain inconsistent with LIKE. This behavior seems at least somewhat explainable/documentable (NOT-foo operators have the precedence of NOT), whereas what we have seems about impossible to justify. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, Attached is a patch which has following changes, As suggested above block ID in xlog structs has been replaced by chunk ID. Chunk ID is used to distinguish between different types of xlog record fragments. Like, XLR_CHUNK_ID_DATA_SHORT XLR_CHUNK_ID_DATA_LONG XLR_CHUNK_BKP_COMPRESSED XLR_CHUNK_BKP_WITH_HOLE In block references, block ID follows the chunk ID. Here block ID retains its functionality. This approach increases data by 1 byte for each block reference in an xlog record. This approach separates ID referring different fragments of xlog record from the actual block ID which is used to refer block references in xlog record. Following are WAL numbers for each scenario, WAL FPW compression on 121.652 MB FPW compression off 148.998 MB HEAD 148.764 MB Compression remains nearly same as before. There is some difference in WAL between HEAD and HEAD+patch+compression OFF. This difference corresponds to 1 byte increase with each block reference of xlog record. Thank you, Rahila Syed On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila rahila.s...@nttdata.com wrote: Hello, I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id. Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon. To clarify my understanding of the above change, Instead of a block id to reference different fragments of an xlog record , a single byte field chunk_id should be used. chunk_id will be same as XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments. But for block references, it will take store following values in order to store information about the backup blocks. #define XLR_CHUNK_BKP_COMPRESSED 0x01 #define XLR_CHUNK_BKP_WITH_HOLE 0x02 ... The new xlog format should look like follows, Fixed-size header (XLogRecord struct) Chunk_id(add a field before id field in XLogRecordBlockHeader struct) XLogRecordBlockHeader Chunk_id XLogRecordBlockHeader ... ... Chunk_id ( rename id field of the XLogRecordDataHeader struct) XLogRecordDataHeader[Short|Long] block data block data ... main data I will post a patch based on this. Thank you, Rahila Syed -Original Message- From: Andres Freund [mailto:and...@2ndquadrant.com] Sent: Monday, February 16, 2015 5:26 PM To: Syed, Rahila Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes On 2015-02-16 11:30:20 +, Syed, Rahila wrote: - * As a trivial form of data compression, the XLOG code is aware that - * PG data pages usually contain an unused hole in the middle, which - * contains only zero bytes. If hole_length 0 then we have removed - * such a hole from the stored data (and it's not counted in the - * XLOG record's CRC, either). Hence, the amount of block data actually - * present is BLCKSZ - hole_length bytes. + * Block images are able to do several types of compression: + * - When wal_compression is off, as a trivial form of compression, + the + * XLOG code is aware that PG data pages usually contain an unused hole + * in the middle, which contains only zero bytes. If length BLCKSZ + * then we have removed such a hole from the stored data (and it is + * not counted in the XLOG record's CRC, either). Hence, the amount + * of block data actually present is length bytes. The hole offset + * on page is defined using hole_offset. + * - When wal_compression is on, block images are compressed using a + * compression algorithm without their hole to improve compression + * process of the page. length corresponds in this case to the + length + * of the compressed block. hole_offset is the hole offset of the + page, + * and the length of the uncompressed block is defined by + raw_length, + * whose data is included in the record only when compression is + enabled + * and with_hole is set to true, see below. + * + * is_compressed is used to identify if a given block image is + compressed + * or not. Maximum page size allowed on the system being 32k, the + hole + * offset cannot be more than 15-bit long so the last free bit is + used to + * store the compression state of block image. If the maximum page + size + * allowed is increased to a value higher than that, we should + consider + * increasing this structure size as well,
Re: [HACKERS] pg_dump gets attributes from tables in extensions
Thanks to the easy handy testcase, was able to replicate the test scenario on my local environment. And yes tbinfo-dobj.ext_member check into getTableAttrs() do fix the issue. Looking more into pg_dump code what I found that, generally PG don't have tbinfo-dobj.ext_member check to ignore the object. Mostly we do this kind of check using tbinfo-dobj.dump (look at dumpTable() for reference). Do you have any particular reason if choosing dobj.ext_member over dobj.dump ? On Fri, Feb 20, 2015 at 12:20 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 20, 2015 at 5:33 AM, Peter Eisentraut pete...@gmx.net wrote: On 2/16/15 2:45 AM, Michael Paquier wrote: While looking at the patch to fix pg_dump with extensions containing tables referencing each other, I got surprised by the fact that getTableAttrs tries to dump table attributes even for tables that are part of an extension. Is that normal? Attached is a patch that I think makes things right, but not dumping any tables that are part of ext_member. Can you provide an example/test case? (e.g., which publicly available extension contains tables with attributes?) Sure. Attached is a simplified version of the extension I used for the other patch on pg_dump. $ psql -c 'create extension dump_test' CREATE EXTENSION $ psql -At -c '\dx+ dump_test' table aa_tab_fkey table bb_tab_fkey $ pg_dump -v 21 | grep columns and types pg_dump: finding the columns and types of table public.bb_tab_fkey pg_dump: finding the columns and types of table public.aa_tab_fkey -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Rushabh Lathia
Re: [HACKERS] __attribute__ for non-gcc compilers
23.02.2015, 04:31, Robert Haas kirjoitti: On Tue, Feb 17, 2015 at 8:41 AM, Oskari Saarenmaa o...@ohmu.fi wrote: 15.01.2015, 21:58, Robert Haas kirjoitti: On Wed, Jan 14, 2015 at 5:54 PM, Andres Freund and...@2ndquadrant.com wrote: I think I'd for now simply not define pg_attribute_aligned() on platforms where it's not supported, instead of defining it empty. If we need a softer variant we can name it pg_attribute_aligned_if_possible or something. Sounds sane? Yes, that sounds like a much better plan. Attached an updated patch rebased on today's git master that never defines aligned or packed empty. This is also included in the current commitfest, https://commitfest.postgresql.org/4/115/ Is this going to play nicely with pgindent? I ran pgindent on the tree with this patch applied (with a few changes, pgindent modified atomics headers a bit) and the changes looked ok to me, mostly pgindent just rewrapped lines like this: -extern void quickdie(SIGNAL_ARGS) pg_attribute_noreturn; +extern void +quickdie(SIGNAL_ARGS) pg_attribute_noreturn; but there were two cases where it produced a bit weird indentation: #ifdef __arm__ -pg_attribute_packed/* Appropriate whack upside the head for ARM */ + pg_attribute_packed /* Appropriate whack upside the head for ARM */ #endif ItemPointerData; and void -pg_attribute_noreturn + pg_attribute_noreturn plpgsql_yyerror(const char *message) { / Oskari -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] deparsing utility commands
Andres, * Andres Freund (and...@2ndquadrant.com) wrote: On 2015-02-21 14:51:32 -0500, Stephen Frost wrote: It'd be *really* nice to be able to pass an object identifier to some function and get back the CREATE (in particular, though perhaps DROP, or whatever) command for it. This gets us *awful* close to that without actually giving it to us and that's bugging me. The issue is the parsetree, which I understand may be required in some cases (ALTER, primairly, it seems?), but isn't always. The USING and WITH CHECK quals can be extracted from the polForm also, of course, it's what psql and pg_dump are doing, after all. So, why depend on the parsetree? What about have another argument which a user could use without the parsetree, for things that it's absolutely required for today (eg: ALTER sub-command differentiation)? I'm really not wild about pretty massively expanding the scope at the eleventh hour. There's a fair number of commands where this the deparsing command will give you a bunch of commands - look at CREATE TABLE and CREATE SCHEMA ... for the most extreme examples. For those there's no chance to do that without the parse tree available. For my 2c, I don't agree with either of your assumptions above. I don't see this as a massive expansion of the scope, nor that we're in the 11th hour at this point. Agreed, it's the last commitfest, but we're near the beginning of it and Alvaro has already made modifications to the patch set, as is generally expected to happen for any patch in a commitfest, without much trouble. The changes which I'm suggesting are nearly trivial for the larger body of work and only slightly more than trivial for the more complicated pieces. Yes, it might be possible to use the same code for a bunch of minor commands, but not for the interesting/complex stuff. We can clearly rebuild at least CREATE commands for all objects without access to the parse tree, obviously pg_dump manages somehow. I didn't specify that a single command had to be used. Further, the actual deparse_CreateStmt doesn't look like it'd have a terribly hard time producing something without access to the parsetree. The whole premise of this approach is that we're using the results of the catalog changes as the basis for what's needed to recreate the command. The places where we're referring to the parsetree look more like a crutch of convenience than for any particular good reason to. Consider this part of 0011 which includes deparse_CreateStmt: /* * Process table elements: column definitions and constraints. Only * the column definitions are obtained from the parse node itself. To * get constraints we rely on pg_constraint, because the parse node * might be missing some things such as the name of the constraints. */ and then looking down through to deparse_ColumnDef, we see that the ColumnDef passed in is used almost exclusivly for the *column name* (which is used to look up the entry in pg_attribute..). Looping over the pg_attribute entries for the table in the attnum order would certainly return the same result, or something has gone wrong. Beyond the inheiritance consideration, the only other reference actually leads to what you argued (correctly, in my view) was a mistake in the code- where a NOT NULL is omitted for the primary key case. Weren't you also complaining about the 'OF type' form being a potential issue? That'd also go away with the approach I'm advocating. In short, I suspect that if this approach had been taken originally, at least some of the concerns and issued levied against the current implementation wouldn't exist. Thankfully, the way the code has been developed, the majority of the code is general infrastructure and the changes I'm suggesting are all in code which is simpler, thanks to that infrastructure already being there. All I'm suggesting is that we focus on collecting the information from the catalog and avoid using the parsetree. For at least the CREATE and DROP cases, that should be entirely possible. This does end up being a bit different from the original goal, which was closer to reproduce exactly the command that was specified, but as shown above, that probably wasn't what we ever really wanted. The original command is ambiguous on a number of levels and even where it isn't we can get the canonical information we need straight from the catalog. Maybe that's possible and maybe it isn't, but at least for the CREATE cases we should be able to avoid forcing a user to provide a built parsetree, and that'd be *really* nice and open up this feature to quite a few other use-cases that I can think of. I'd further suggest that we provide these command to the SQL level, and then have a wrapper which will take the name of an object, resolve it to Oid, and then pass back the CREATE command for it. I think this is a different feature that might end up sharing some of the infrastructure, but not more.
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Mon, Feb 23, 2015 at 8:57 PM, Fujii Masao masao.fu...@gmail.com wrote: So I pushed the patch. Thank you. -- Michael
Re: [HACKERS] Bug in pg_dump
On Tue, Feb 24, 2015 at 2:17 AM, Gilles Darold gilles.dar...@dalibo.com wrote: Looks great to me, I have tested with the postgis_topology extension everything works fine. Actually, after looking more in depth at the internals of pg_dump I think that both patches are wrong (did that yesterday night for another patch). First this patch marks a table in an extension as always dumpable: + /* Mark member table as dumpable */ + configtbl-dobj.dump = true; And then many checks on ext_member are added in many code paths to ensure that objects in extensions have their definition never dumped. But actually this assumption is not true all the time per this code in getExtensionMemberShip: if (!dopt-binary_upgrade) dobj-dump = false; else dobj-dump = refdobj-dump; So this patch would break binary upgrade where some extension objects should be dumped (one reason why I haven't noticed that before is that pg_upgrade tests do not include extensions). Hence, one idea coming to my mind to fix the problem would be to add some extra processing directly in getExtensionMembership() after building the objects DO_TABLE_DATA with makeTableDataInfo() by checking the FK dependencies and add a dependency link with addObjectDependency. The good part with that is that even user tables that reference extension tables with a FK can be restored correctly because their constraint is added *after* loading the data. I noticed as well that with this patch the --data-only mode was dumping tables in the correct order. Speaking of which, patches implementing this idea are attached. The module test case has been improved as well with a check using a table not in an extension linked with a FK to a table in an extension. -- Michael From 7fb84d68df023d913c448f2498987ca4f0a70595 Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Tue, 17 Feb 2015 07:39:23 + Subject: [PATCH 1/3] Fix ordering of tables part of extensions linked with constraints Additional checks on FK constraints potentially linking between them extension objects are done and data dump ordering is ensured. Note that this does not take into account foreign keys of tables that are not part of an extension linking to an extension table. --- src/bin/pg_dump/pg_dump.c | 56 ++- 1 file changed, 55 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 2b53c72..5b1b240 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -15236,7 +15236,8 @@ dumpRule(Archive *fout, DumpOptions *dopt, RuleInfo *rinfo) } /* - * getExtensionMembership --- obtain extension membership data + * getExtensionMembership --- obtain extension membership data and check FK + * dependencies among extension tables. */ void getExtensionMembership(Archive *fout, DumpOptions *dopt, ExtensionInfo extinfo[], @@ -15423,6 +15424,59 @@ getExtensionMembership(Archive *fout, DumpOptions *dopt, ExtensionInfo extinfo[] } } } + + /* + * Now that all the TableInfoData objects have been created for + * all the extensions, check their FK dependencies and register + * them to ensure correct data ordering. Note that this is not + * a problem for user tables not included in an extension + * referencing with a FK tables in extensions as their constraint + * is declared after dumping their data. In --data-only mode the + * table ordering is ensured as well thanks to + * getTableDataFKConstraints(). + */ + for (j = 0; j nconfigitems; j++) + { +int i_confrelid, k; +PQExpBuffer query2 = createPQExpBuffer(); +TableInfo *configtbl; +Oid configtbloid = atooid(extconfigarray[j]); + +configtbl = findTableByOid(configtbloid); +if (configtbl == NULL || configtbl-dataObj == NULL) + continue; + +appendPQExpBuffer(query2, + SELECT confrelid + FROM pg_catalog.pg_constraint + WHERE conrelid = '%u' + AND contype = 'f', + configtbloid); + +res = ExecuteSqlQuery(fout, query2-data, PGRES_TUPLES_OK); +ntups = PQntuples(res); +i_confrelid = PQfnumber(res, confrelid); + +for (k = 0; k ntups; k++) +{ + Oid confrelid; + TableInfo *reftable; + + confrelid = atooid(PQgetvalue(res, k, i_confrelid)); + reftable = findTableByOid(confrelid); + + if (reftable == NULL || reftable-dataObj == NULL) + continue; + + /* + * Make referencing TABLE_DATA object depend on the + * referenced table's TABLE_DATA object. + */ + addObjectDependency(configtbl-dataObj-dobj, + reftable-dataObj-dobj.dumpId); +} +resetPQExpBuffer(query2); + } } if (extconfigarray) free(extconfigarray); -- 2.3.0 From ad5cd360243b44e735195c5e94df3c21e8f18e07 Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Tue, 17 Feb 2015 22:29:28 +0900 Subject: [PATCH 2/3] Make
Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)
Please see check_object_ownership(). It checks relation's ownership if OBJECT_COLUMN, however, type's ownership is the correct check if OBJECT_ATTRIBUTE. -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Alvaro Herrera Sent: Tuesday, February 24, 2015 4:02 AM To: Pg Hackers; Peter Eisentraut Subject: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites) I found that the OBJECT_ATTRIBUTE symbol is useless. I can just remove it and replace it with OBJECT_COLUMN, and everything continues to work; no test fails that I can find. I thought we had a prohibition against ALTER TABLE when used on composites, but it's not as severe as I thought. The following commands fail in master: ALTER TABLE comptype RENAME TO comptype2; -- HINT: Use ALTER TYPE ALTER TABLE comptype SET SCHEMA sch; -- HINT: Use ALTER TYPE However, the following command works in master: ALTER TABLE comptype RENAME COLUMN a TO b; and has the same effect as this: ALTER TYPE comptype RENAME ATTRIBUTE a TO b; The RENAME ATTTRIBUTE case in RenameStmt is the only thing currently using OBJECT_ATTRIBUTE; therefore, since in precisely that case we do not prohibit using ALTER TABLE, we can just remove OBJECT_ATTRIBUTE completely. That leads to the attached patch, which changes no test result at all. This symbol was added in commit e440e12c562432a2a695b8054964fb34e3bd823e Author: Peter Eisentraut pete...@gmx.net Date: Sun Sep 26 14:41:03 2010 +0300 Add ALTER TYPE ... ADD/DROP/ALTER/RENAME ATTRIBUTE Like with tables, this also requires allowing the existence of composite types with zero attributes. reviewed by KaiGai Kohei Thoughts? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_check_dir comments and implementation mismatch
On Sun, Feb 22, 2015 at 07:57:41PM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Feb 20, 2015 at 12:59 AM, Noah Misch n...@leadboat.com wrote: On Mon, Feb 02, 2015 at 03:48:33PM -0500, Robert Haas wrote: If readir() fails and closedir() succeeds, the return will be -1 but errno will be 0. Out of curiosity, have you seen a closedir() implementation behave that way? It would violate C99 (The value of errno is zero at program startup, but is never set to zero by any library function.) and POSIX. No. Good point, I didn't think about that. I think this way is safer, though. While the spec forbids library functions from setting errno to zero, there is no restriction on them changing errno in other ways despite returning success; their exit-time value of errno is only well-defined if they fail. So we do need to preserve errno explicitly across closedir(), or we may report the wrong failure from readdir(). Yes. I'm happy with the commit itself. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] deparsing utility commands
Hi, On 2015-02-23 19:48:43 -0500, Stephen Frost wrote: Yes, it might be possible to use the same code for a bunch of minor commands, but not for the interesting/complex stuff. We can clearly rebuild at least CREATE commands for all objects without access to the parse tree, obviously pg_dump manages somehow. That's not the same. pg_dump recreates a static state, not running log of changes that can be replayed. I didn't specify that a single command had to be used. Well, if you want to get 'the create table statement' you'll need to decide what you want. With ddl event triggers it's clear - something that, when replayed, results in the same catalog contents. Such an easy definition doesn't, as far as I know, exist for a set of functions you seem to imagine. Further, the actual deparse_CreateStmt doesn't look like it'd have a terribly hard time producing something without access to the parsetree. The interesting bit isn't the deparse_CreateStmt itself, but all the stuff that's also triggered when you do a CREATE TABLE nontrival-stuff;. utility.c will first transform the statement and then run run and stash every created subcommand. And the subcommands will *also* get deparsed. Just think about what to do about CREATE TABLE foo(id serial primary key, data text, bar_id REFERENCES foo.bar); - there's no way you can get which other objects to dump from the catalog alone. What for a schema, considering CREATE SCHEMA ... (schema_elements)+;? Sure, individual subcommands could refer to the catalog instead of the parse tree. But to get the whole thing you can't easily just refer to it. All I'm suggesting is that we focus on collecting the information from the catalog and avoid using the parsetree. For at least the CREATE and DROP cases, that should be entirely possible. DROP's already in 9.4, the additions in 9.5 were more or less usability things. The problem generating DROPs is right now more identifying which one you want to drop and checking the dependencies - the latter I'm not sure how to do without actually executing the DROP. Maybe that's possible and maybe it isn't, but at least for the CREATE cases we should be able to avoid forcing a user to provide a built parsetree, and that'd be *really* nice and open up this feature to quite a few other use-cases that I can think of. I'd further suggest that we provide these command to the SQL level, and then have a wrapper which will take the name of an object, resolve it to Oid, and then pass back the CREATE command for it. I think this is a different feature that might end up sharing some of the infrastructure, but not more. I know you've looked through this code also and I really don't get the comment that only some of this infrastructure would be shared. As I tried to point out, for the 'CREATE POLICY' case, a few minor modifications would have it provide exactly what I'm suggesting and I'm sure that most of the cases would be similar. Simply looking through the code with an eye towards avoiding the parsetree in favor of pulling information from the catalog would illustrate the point I'm making, I believe. I've no problem at all changing CREATE POLICY (and some other) deparse functions to look more at the catalog than the command. What I don't see is changing all of the create deparse functions, guarantee that they are usable for getting the DDL of catalog objects and provide SQL accessible infrastructure for that. That's a *massive* undertaking. What I mean with 'sharing some of the infrastructure' is that I can see a good portion of the deparse_* functions being reused for what you desire. But the decision about which of those to call will be an entirely separate project. So is a whole new infrastructure to consider locking and visibility (all the deparse stuff uses continualy evolving catalog snapshots!) that it'll need as that's a problem the event trigger stuff has much less to care about, because the objects are new rows and thus can't be updated by other backends. I don't think it's all that related, so I'm not surprised. If you execute a CREATE TABLE(id serial primary key); you'll get a bunch of commands with this facility - it'd be much less clear what exactly shall be dumped in the case of some hypothetical GUI when you want to dump the table. I really don't think it's all that strange or complicated to consider and we've got a rather nice example of what a good approach would be. Right. We got a *massive* program that solves dependencies and doesn't do all that much useful/correct things if you only let it dump individual objects. And that dumping of individual objects is what you want... ;) Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for text cost model fix
On 24.2.2015 01:44, Peter Geoghegan wrote: On Mon, Feb 23, 2015 at 4:41 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Are you going to add this into the CF? Would be nice to get this into 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider it part of one of the existing sortsupport patches. It's a bugfix, IMV. I guess I should add it to the current CF, but I think I might be forbidden from doing so as a non-CF-manager... Yes, I consider that a bugfix too, fixing a performance regression. regards -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How about to have relnamespace and relrole?
Hello, At Thu, 19 Feb 2015 15:30:53 -0500, Peter Eisentraut pete...@gmx.net wrote in 54e647fd.5000...@gmx.net On 2/18/15 3:44 AM, Kyotaro HORIGUCHI wrote: Hello, this is the patchset v2 of this feature. 0001-Add-regrole.patch 0002-Add-regnamespace.patch 0003-Check-new-reg-types-are-not-used-as-default-values.patch 0004-Documentation-for-new-OID-alias-types.patch Somehow, these patches ended up in the commit fest without an author listed. That should probably not be possible. Mmm.. I saw the author for this is listed here, https://commitfest.postgresql.org/4/ Did you fix that manually for me? -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)
Kouhei Kaigai wrote: Please see check_object_ownership(). It checks relation's ownership if OBJECT_COLUMN, however, type's ownership is the correct check if OBJECT_ATTRIBUTE. Hmm. Is there any case where the two are different? AlterObjectTypeCommandTag()? OBJECT_ATTRIBUTE makes ALTER TYPE tag, but ALTER COLUMN tag is made with OBJECT_COLUMN. Above two cases are all I could found. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for text cost model fix
On Mon, Feb 23, 2015 at 4:41 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Are you going to add this into the CF? Would be nice to get this into 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider it part of one of the existing sortsupport patches. It's a bugfix, IMV. I guess I should add it to the current CF, but I think I might be forbidden from doing so as a non-CF-manager... -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How about to have relnamespace and relrole?
On Tue, Feb 24, 2015 at 11:35 AM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, At Thu, 19 Feb 2015 15:30:53 -0500, Peter Eisentraut pete...@gmx.net wrote in 54e647fd.5000...@gmx.net On 2/18/15 3:44 AM, Kyotaro HORIGUCHI wrote: Hello, this is the patchset v2 of this feature. 0001-Add-regrole.patch 0002-Add-regnamespace.patch 0003-Check-new-reg-types-are-not-used-as-default-values.patch 0004-Documentation-for-new-OID-alias-types.patch Somehow, these patches ended up in the commit fest without an author listed. That should probably not be possible. Mmm.. I saw the author for this is listed here, https://commitfest.postgresql.org/4/ Did you fix that manually for me? Looking at the log entry: 2015-02-19 21:11:08 Michael Paquier (michael-kun) Changed authors to Kyotaro Horiguchi (horiguti) I do a pass from time to on the patches... -- Michael
Re: [HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code
On 2/23/15 6:55 PM, Michael Paquier wrote: Hi all, This is a purely informational email... I have put my hands on a Raspberry PI 2, and I have found that it takes 6 minutes to compile the backend code using the 4 cores of the ARMv7 processor, and close to 20 minutes on a single core (without ccache). The test has been done using ArchLinux ARM with a micro SD card of class 10 (said to be able to manage 50MB/s in write, 60MB/s in read). In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes close to 2 hours to do the same operation, using the same OS and a similar class 10 card (30MB/s but the I/O is not the bottleneck). Nice! Looks like Moore's Law works even on tight budgets. -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] pg_dump gets attributes from tables in extensions
On Mon, Feb 23, 2015 at 7:45 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Feb 23, 2015 at 5:57 PM, Rushabh Lathia rushabh.lat...@gmail.com wrote: Thanks to the easy handy testcase, was able to replicate the test scenario on my local environment. And yes tbinfo-dobj.ext_member check into getTableAttrs() do fix the issue. Looking more into pg_dump code what I found that, generally PG don't have tbinfo-dobj.ext_member check to ignore the object. Mostly we do this kind of check using tbinfo-dobj.dump (look at dumpTable() for reference). Do you have any particular reason if choosing dobj.ext_member over dobj.dump ? Hm. I have been pondering about this code more and I am dropping the patch as either adding a check based on the flag to track dumpable objects or ext_member visibly breaks the logic that binary upgrade and tables in extensions rely on. Particularly this portion makes me think so in getExtensionMembership(): /* * Normally, mark the member object as not to be dumped. But in * binary upgrades, we still dump the members individually, since the * idea is to exactly reproduce the database contents rather than * replace the extension contents with something different. */ if (!dopt-binary_upgrade) dobj-dump = false; else dobj-dump = refdobj-dump; Ok. Looking at above code into getExtensionMembership(). It seems like you fix you suggested is not correct. I new table with DEFAULT attribute into dump_test extension and pg_dump with binary-upgrade is failing with pg_dump: invalid column number 1 for table bb_tab_fkey. rushabh@rushabh-centos-vm:dump_test$ cat dump_test--1.0.sql /* dump_test/dump_test--1.0.sql */ -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use CREATE EXTENSION dump_test to load this file. \quit CREATE TABLE IF NOT EXISTS bb_tab_fkey ( id int PRIMARY KEY ); CREATE TABLE IF NOT EXISTS aa_tab_fkey ( id int REFERENCES bb_tab_fkey(id) ); CREATE TABLE IF NOT EXISTS foo ( a int default 100 ); This gave another strong reason to add if (!tbinfo-dobj.dump) check rather then ext_member check. What you say ? Regards,
[HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code
Hi all, This is a purely informational email... I have put my hands on a Raspberry PI 2, and I have found that it takes 6 minutes to compile the backend code using the 4 cores of the ARMv7 processor, and close to 20 minutes on a single core (without ccache). The test has been done using ArchLinux ARM with a micro SD card of class 10 (said to be able to manage 50MB/s in write, 60MB/s in read). In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes close to 2 hours to do the same operation, using the same OS and a similar class 10 card (30MB/s but the I/O is not the bottleneck). Regards, -- Michael
Re: [HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code
Michael, This is a purely informational email... I have put my hands on a Raspberry PI 2, and I have found that it takes 6 minutes to compile the backend code using the 4 cores of the ARMv7 processor, and close to 20 minutes on a single core (without ccache). The test has been done using ArchLinux ARM with a micro SD card of class 10 (said to be able to manage 50MB/s in write, 60MB/s in read). In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes close to 2 hours to do the same operation, using the same OS and a similar class 10 card (30MB/s but the I/O is not the bottleneck). Thanks for sharing this info. I'm looking forward to getting my hands on an rpi2 soon for some of my other projects. Having an idea of the increased performance, especially related to compilation, was certainly something I was curious about. -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com
Re: [HACKERS] Abbreviated keys for text cost model fix
On 23.2.2015 21:52, Peter Geoghegan wrote: On Mon, Feb 23, 2015 at 11:17 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: So while it's true that for the 3rd query we get much worse results compared to the other queries (i.e. we don't get 400% speedup but ~3% slowdown compared to master), it's true that master performs exceptionally well for this query with small datasets. Once we get to 2M rows, the master performance drops significantly but cost-model keeps the performance characteristics and the speedup jumps back to ~700% which is nice. These numbers are for the 'ASC + unsorted row' test, but I do get exactly the same pattern for the 'random' tests done previously. Yeah. Looks like you're comparing a case where the old cost model did the right thing anyway (i.e. used abbreviation). The difference would then be entirely explainable as noise. Right? I don't think it's just noise - on the i5-2500 CPU, the result are very consistent. For 1M random rows with this query (#3) select * from (select * from stuff_text order by randtxt offset 1000) foo I do get this (with 20 of the query): min max medianmean stddev master 932.449 932.902 932.637 932.639 0.116 cost-model 957.16958.47 957.497 957.586 0.411 That is super-consistent, IMHO, and the ~2.6% slowdown is clearly visible. But it might be due to the padding described by Andrew. Similarly on the Xeon E5450 CPU, although the results there are much more noisy. Maybe it's more It would be nice if we could address the 3% regression for the last query, but I guess it's not a big deal. The numbers in general are absolutely impressive. Kudos. Thanks. Are you going to add this into the CF? Would be nice to get this into 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider it part of one of the existing sortsupport patches. -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT UPDATE and logical decoding
On Fri, Feb 20, 2015 at 3:44 PM, Peter Geoghegan p...@heroku.com wrote: they'd only see a REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that was the definitive outcome. No need for output plugins to consider REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all. Yes. It'd be easiest if the only the final insert/update were actually WAL logged as full actions. Well, that implies that we'd actually know that we'd succeed when WAL logging the speculative heap tuple's insertion. We literally have no way of knowing if that's the case at that point, though - that's just the nature of value locking scheme #2's optimistic approach. Attached patch does this. This is just a sketch, to provoke a quick discussion (which is hopefully all this needs). I did things this way because it seemed more efficient than rolling this into a whole new revision of the ON CONFLICT patch series. Ostensibly, this does the right thing: The test_decoding output plugin only ever reports either an INSERT or an UPDATE within a transaction (that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert running, I have not found any problematic cases where this fails to be true (although I have certainly been less than thorough so far - please take that into consideration). I have two basic concerns: * Do you think that what I've sketched here is roughly the right approach? Consolidating super deletions and insertions at transaction reassembly seemed like the right thing to do, but obviously I'm not best qualified to judge that. Clearly, it would be possible to optimize this so REORDER_BUFFER_CHANGE_INSERT peek ahead happens much more selectively, which I haven't bothered with yet. * What are the hazards implied by my abuse of the ReorderBufferIterTXNNext() interface? It looks like it should be okay to peek ahead like this, because it has a slightly odd convention around memory management that accidentally works. But honestly, I'm feeling too lazy to make myself grok the code within ReorderBufferIterTXNNext() at the moment, and so have not put concerns about the code being totally broken to rest yet. Could the current ReorderBufferChange be freed before it is sent to a logical decoding plugin by way of a call to the rb-apply_change() callback (when the xact is spooled to disk, for example)? Thanks -- Peter Geoghegan From 53784c818b39d39c6a1f3588f7823a9010074869 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan peter.geoghega...@gmail.com Date: Sun, 22 Feb 2015 17:02:14 -0800 Subject: [PATCH 7/7] Logical decoding support for ON CONFLICT UPDATE --- src/backend/replication/logical/decode.c| 5 +++- src/backend/replication/logical/reorderbuffer.c | 35 +++-- src/include/replication/reorderbuffer.h | 3 ++- 3 files changed, 39 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index e7614bd..a9cd0df 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -682,7 +682,10 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) return; change = ReorderBufferGetChange(ctx-reorder); - change-action = REORDER_BUFFER_CHANGE_DELETE; + if (!(xlrec-flags XLOG_HEAP_KILLED_SPECULATIVE_TUPLE)) + change-action = REORDER_BUFFER_CHANGE_DELETE; + else + change-action = REORDER_BUFFER_CHANGE_INTERNAL_DELETE; memcpy(change-data.tp.relnode, target_node, sizeof(RelFileNode)); diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 20bb3b7..8f9101e 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -401,6 +401,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change) case REORDER_BUFFER_CHANGE_INSERT: case REORDER_BUFFER_CHANGE_UPDATE: case REORDER_BUFFER_CHANGE_DELETE: + case REORDER_BUFFER_CHANGE_INTERNAL_DELETE: if (change-data.tp.newtuple) { ReorderBufferReturnTupleBuf(rb, change-data.tp.newtuple); @@ -1313,7 +1314,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, PG_TRY(); { + /* + * To merge super delete and insert records, peek-ahead is occasionally + * required + */ ReorderBufferChange *change; + ReorderBufferChange *nextchange = NULL; if (using_subtxn) BeginInternalSubTransaction(replay); @@ -1323,16 +1329,21 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, rb-begin(rb, txn); iterstate = ReorderBufferIterTXNInit(rb, txn); - while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL) + while ((change = nextchange ? nextchange : +ReorderBufferIterTXNNext(rb, iterstate)) != NULL) { Relation relation = NULL; Oid reloid; + /* Forget about previous peek ahead */ + nextchange = NULL; + switch (change-action)
Re: [HACKERS] Reduce pinning in btree indexes
On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. Why does an index-only scan need to hold the pin? Suppose there's a dead tuple in the heap someplace, and an index pointer pointing to that dead tuple. An index scan reaches the relevant page and copies all of the index pointers. VACUUM removes the index pointer and marks the heap page all-visible. The scan now uses the index pointers copied to backend-local memory and notes that the heap-page is all-visible, so the scan sees the tuple even though that tuple is totally gone from the heap by that point. Holding the pin prevents this, because we can't reach the second heap pass until the scan is done with the page, and therefore the dead line pointer in the heap page can't be marked unused, and therefore the page can't be marked all-visible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for Numeric
Tomas == Tomas Vondra tomas.von...@2ndquadrant.com writes: Tomas I believe the small regressions (1-10%) for small data sets, Tomas might be caused by this 'random padding' effect, because that's Tomas probably where L1/L2 cache is most important. For large datasets Tomas the caches are probably not as efficient anyway, so the random Tomas padding makes no difference, Except that _your own results_ show that this is not the case. In your first set of results you claimed a reduction in performance to 91% for a query which is _in no way whatsoever_ affected by any of the code changes. How is this not noise? I refer specifically to this case from your spreadsheet: query select * from (select * from stuff_text order by randtxt offset 1000) foo typetext scale 5 master 26.949 datum 28.051 numeric 29.734 datum 96% numeric 91% This query does not invoke any code path touched by either the datum or numeric patches! The observed slowdown is therefore just noise (assuming here that your timings are correct). Whether that case can be improved by tweaking the _text_ abbreviation code is another question, one which is not relevant to either of the patches currently in play. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] LSN Map
On Mon, Feb 23, 2015 at 12:52 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Dunno, but Jim's got a point. This is a maintenance burden to all indexams, if they all have to remember to update the LSN map separately. It needs to be done in some common code, like in PageSetLSN or XLogInsert or something. Aside from that, isn't this horrible from a performance point of view? The patch doubles the buffer manager traffic, because any update to any page will also need to modify the LSN map. This code is copied from the visibility map code, but we got away with it there because the VM only needs to be updated the first time a page is modified. Subsequent updates will know the visibility bit is already cleared, and don't need to access the visibility map. Ans scalability: Whether you store one value for every N pages, or the LSN of every page, this is going to have a huge effect of focusing contention to the LSN pages. Currently, if ten backends operate on ten different heap pages, for example, they can run in parallel. There will be some contention on the WAL insertions (much less in 9.4 than before). But with this patch, they will all fight for the exclusive lock on the single LSN map page. You'll need to find a way to not update the LSN map on every update. For example, only update the LSN page on the first update after a checkpoint (although that would still have a big contention focusing effect right after a checkpoint). I think it would make more sense to do this in the background. Suppose there's a background process that reads the WAL and figures out which buffers it touched, and then updates the LSN map accordingly. Then the contention-focusing effect disappears, because all of the updates to the LSN map are being made by the same process. You need some way to make sure the WAL sticks around until you've scanned it for changed blocks - but that is mighty close to what a physical replication slot does, so it should be manageable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)
Kouhei Kaigai wrote: Please see check_object_ownership(). It checks relation's ownership if OBJECT_COLUMN, however, type's ownership is the correct check if OBJECT_ATTRIBUTE. Hmm. Is there any case where the two are different? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
Hi, attached is the result of my first attempt to make the logical column ordering patch work. This touches a lot of code in the executor that is mostly new to me, so if you see something that looks like an obvious bug, it probably is (so let me know). improvements The main improvements of this version are that: * initdb actually works (while before it was crashing) * regression tests work, with two exceptions (a) 'subselect' fails because EXPLAIN prints columns in physical order (but we expect logical) (b) col_order crashes works because of tuple descriptor mismatch in a function call (this actually causes a segfault) The main change is this patch is that tlist_matches_tupdesc() now checks target list vs. physical attribute order, which may result in doing a projection (in cases when that would not be done previously). I don not claim this is the best approach - maybe it would be better to keep the physical tuple and reorder it lazily. That's why I kept a few pieces of code (fix_physno_mutator) and a few unused fields in Var. Over the time I've heard various use cases for this patch, but in most cases it was quite speculative. If you have an idea where this might be useful, can you explain it here, or maybe point me to a place where it's described? There's also a few FIXMEs, mostly from Alvaro's version of the patch. Some of them are probably obsolete, but I wasn't 100% sure by that so I've left them in place until I understand the code sufficiently. randomized testing -- I've also attached a python script for simple randomized testing. Just execute it like this: $ python randomize-attlognum.py -t test_1 test_2 \ --init-script attlognum-init.sql \ --test-script attlognum-test.sql and it will do this over and over $ dropdb test $ createdb test $ run init script $ randomly set attlognums for the tables (test_1 and test_2) $ run test script It does not actually check the result, but my experience is that when there's a bug in handling the descriptor, it results in segfault pretty fast (just put some varlena columns into the table). plans / future -- After discussing this with Alvaro, we've both agreed that this is far too high-risk change to commit in the very last CF (even if it was in a better shape). So while it's added to 2015-02 CF, we're aiming for 9.6 if things go well. regards -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index c5892d3..7528724 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2368,6 +2368,9 @@ get_attnum_pk_pos(int *pkattnums, int pknumatts, int key) return -1; } +/* + * FIXME this probably needs to be tweaked. + */ static HeapTuple get_tuple_of_interest(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals) { diff --git a/contrib/spi/timetravel.c b/contrib/spi/timetravel.c index 0699438..30e496c 100644 --- a/contrib/spi/timetravel.c +++ b/contrib/spi/timetravel.c @@ -314,6 +314,7 @@ timetravel(PG_FUNCTION_ARGS) Oid *ctypes; char sql[8192]; char separ = ' '; + Form_pg_attribute *attrs; /* allocate ctypes for preparation */ ctypes = (Oid *) palloc(natts * sizeof(Oid)); @@ -322,10 +323,11 @@ timetravel(PG_FUNCTION_ARGS) * Construct query: INSERT INTO _relation_ VALUES ($1, ...) */ snprintf(sql, sizeof(sql), INSERT INTO %s VALUES (, relname); + attrs = TupleDescGetLogSortedAttrs(tupdesc); for (i = 1; i = natts; i++) { - ctypes[i - 1] = SPI_gettypeid(tupdesc, i); - if (!(tupdesc-attrs[i - 1]-attisdropped)) /* skip dropped columns */ + ctypes[i - 1] = SPI_gettypeid(tupdesc, attrs[i - 1]-attnum); + if (!(attrs[i - 1]-attisdropped)) /* skip dropped columns */ { snprintf(sql + strlen(sql), sizeof(sql) - strlen(sql), %c$%d, separ, i); separ = ','; diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c index 6cd4e8e..14787ce 100644 --- a/src/backend/access/common/heaptuple.c +++ b/src/backend/access/common/heaptuple.c @@ -79,6 +79,8 @@ /* * heap_compute_data_size * Determine size of the data area of a tuple to be constructed + * + * Note: input arrays must be in attnum order. */ Size heap_compute_data_size(TupleDesc tupleDesc, @@ -88,16 +90,23 @@ heap_compute_data_size(TupleDesc tupleDesc, Size data_length = 0; int i; int numberOfAttributes = tupleDesc-natts; - Form_pg_attribute *att = tupleDesc-attrs; + Form_pg_attribute *att = TupleDescGetPhysSortedAttrs(tupleDesc); + /* + * We need to consider the attributes in physical order for storage, yet + * our input arrays are in attnum order. In this loop, i is an index + * into the attphysnum-sorted attribute array, and idx is an index into the + * input arrays. + */ for (i = 0; i numberOfAttributes; i++)