Re: [HACKERS] [v9.4] row level security
On 09/04/2013 11:22 PM, Tom Lane wrote: AFAICT, to deal with update/delete the RLS patch needs to constrain order of qual application without the crutch of having a separate level of subquery; and it's that behavior that I have zero confidence in, either as to whether it works as submitted or as to our odds of not breaking it in the future. Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also protect against malicious RLS functions? For any subclause in a WHERE clause (a) OR (b) you can instead write a short-circuit OR version as: CASE WHEN (a) THEN true ELSE (b) END no? So, given a row security condition like (rowowner = current_user AND rls_malicious_leak()) on table test, this: SELECT * FROM test WHERE client_leak_fn(); could be rewritten by row security as: SELECT * FROM test WHERE CASE WHEN CASE WHEN is_superuser() THEN true ELSE (rowowner = current_user AND rls_malicious_leak()) END THEN client_leak_fn() END; in other words: if the user is the superuser, don't evaluate the RLS predicate, just assume they have the right to see the row. Otherwise evaluate the RLS predicate to determine whether they can see the row. In either case, if they have the right to see the row, run the original WHERE clause. My main concern is that it'd be relying on a guarantee that CASE is always completely ordered, and that might not be ideal. It's also hideously ugly, and future optimiser smarts could create unexpected issues. Unless you propose the creaton of a new short-circuit left-to-right order guaranteed OR operator, and think the planner/optimizer should be taught special case rules to prevent it from doing pull-up / push-down or pre-evaluating subclauses for it, I suspect the notion of using security barrier views or subqueries is still going to be the sanest way to do it. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something fishy happening on frogmouth
On 01.11.2013 18:22, Noah Misch wrote: On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote: On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 31.10.2013 16:43, Robert Haas wrote: There should be no cases where the main shared memory segment gets cleaned up and the dynamic shared memory segments do not. 1. initdb -D data1 2. initdb -D data2 3. postgres -D data1 4. killall -9 postgres 5. postgres -D data2 The system V shmem segment orphaned at step 4 will be cleaned up at step 5. The DSM segment will not. Note that dynamic_shared_memory_type='mmap' will give the desired behavior. Hmm, here's another idea: Postmaster creates the POSIX shared memory object at startup, by calling shm_open(), and immediately calls shm_unlink on it. That way, once all the processes have exited, the object will be removed automatically. Child processes inherit the file descriptor at fork(), and don't need to call shm_open, just mmap(). I'm not sure how dynamic these segments need to be, but if 1-2 such file descriptors is not enough, you could mmap() different offsets from the same shmem object for different segments. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] What stopped SECURITY BARRIER views from being auto-updatable?
The current code just reads: /* * For now, we also don't support security-barrier views, because of the * difficulty of keeping upper-level qual expressions away from * lower-level data. This might get relaxed in future. */ if (RelationIsSecurityView(view)) return gettext_noop(Security-barrier views are not automatically updatable.); which doesn't tell me tons. Is it the same problem RLS faces where you can't just append the view predicate to the update predicate because you may leak information if the update predicate gets evaluated before the view predicate? RLS attempts to solve that by wrapping the range table entry for the relation in a subquery over the original table qualified with the rls predicate, forcing its evaluation before the outer predicate of the user-supplied query. That approach seems to be unpopular, but how much of that is about the implementation by doing rewriting in the planner its self, and how much is objection to the principle? If updateable views supported security barriers it'd really help reduce the amount of complex planner messing-around in the RLS patch, so this is something I'm quite interested in tacling if anyone has a few pointers on where to look to start with. It wouldn't be total solution to RLS, as we'd still have the issue of the rewriter not getting re-run on plan invalidation and re-planning, which causes issues if the RLS clause was omitted first time and shouldn't be this time or vice versa. We'd need to be able to store the pre-rewrite parse tree and re-run the rewriter for plans where one or more relations were flagged as having RLS. From what Kohei KaiGai was saying it sounds like we'd also need to re-define how RLS and inheritance interacted, so a parent's RLS predicate applies to all its children, as one of the reasons he had to do it in the optimizer/planner was because of difficulties handling totally independent RLS expressions for child and parent tables at the rewriter level. Might this be a more palatable way forward than the fiddling with Vars and doing query rewriting inside the optimizer that the current RLS patch does? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pg_receivexlog: fixed to work with logical segno 0
On 01.11.2013 11:42, Mika Eloranta wrote: pg_receivexlog calculated the xlog segment number incorrectly when started after the previous instance was interrupted. Resuming streaming only worked when the physical wal segment counter was zero, i.e. for the first 256 segments or so. Oops. Fixed, thanks for the report! It's a bit scary that this bug went unnoticed for this long; it was introduced quite early in the 9.3 development cycle. Seems that I did all the testing of streaming timeline changes with pg_receivexlog later in 9.3 cycle with segment numbers 256, and no-one else have done long-running tests with pg_receivexlog either. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?
On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote: On 29.10.2013 03:16, Andres Freund wrote: Hi, I've started a valgrind run earlier when trying to run the regression tests with valgrind --error-exitcode=122 (to cause the regression tests to fail visibly) but it crashed frequently... One of them was: ... Which seems to indicate a dangling -rd_smgr pointer, confusing the heck out of me because I couldn't see how that'd happen. Looking a bit closer it seems to me that the fake relcache infrastructure seems to neglect the chance that something used the fake entry to read something which will have done a RelationOpenSmgr(). Which in turn will have set rel-rd_smgr-smgr_owner to rel. When we then pfree() the fake relation in FreeFakeRelcacheEntry we have a dangling pointer. Yeah, that's clearly a bug. diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 5429d5e..ee7c892 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -433,6 +433,7 @@ CreateFakeRelcacheEntry(RelFileNode rnode) void FreeFakeRelcacheEntry(Relation fakerel) { + RelationCloseSmgr(fakerel); pfree(fakerel); } Hmm, I don't think that's a very good solution. Firstly, it will force the underlying files to be closed, hurting performance. Fake relcache entries are used in heapam when clearing visibility map bits, which might happen frequently enough for that to matter. Well, it will only close them when they actually were smgropen()ed which will not always be the case. Although it probably is in the visibility map case. Feels like premature optimization to me. Secondly, it will fail if you create two fake relcache entries for the same relfilenode. Freeing the first will close the smgr entry, and freeing the second will try to close the same smgr entry again. We never do that in the current code, but it seems like a possible source of bugs in the future. I think if we go there we'll need refcounted FakeRelcacheEntry's anyway. Otherwise we'll end up with a relation smgropen()ed multiple times in the same backend and such which doesn't seem like a promising course to me since the smgr itself isn't refcounted. How about the attached instead? diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 5429d5e..f732e71 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode) rel-rd_lockInfo.lockRelId.dbId = rnode.dbNode; rel-rd_lockInfo.lockRelId.relId = rnode.relNode; - rel-rd_smgr = NULL; + /* + * Open the underlying storage, but don't set rd_owner. We want the + * SmgrRelation to persist after the fake relcache entry is freed. + */ + rel-rd_smgr = smgropen(rnode, InvalidBackendId); return rel; } I don't really like that - that will mean we'll leak open smgr handles for every relation touched via smgr during recovery since they will never be closed unless the relation is dropped or such. And in some databases there can be huge amounts of relations. Since recovery is long lived that doesn't seem like a good idea, besides the memory usage it will also bloat smgr's internal hash which actually seems just as likely to hurt performance. I think intentionally not using the owner mechanism also is dangerous - what if we have longer lived fake relcache entries and we've just processed sinval messages? Then we'll have a -rd_smgr pointing into nowhere. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing locking in at least INSERT INTO view WITH CHECK
On 2013-11-02 17:05:24 -0700, Kevin Grittner wrote: Andres Freund and...@2ndquadrant.com wrote: the matview patch (0002) This is definitely needed as a bug fix. Will adjust comments and commit, back-patched to 9.3. Thanks. Also attached is 0004 which just adds a heap_lock() around a newly created temporary table in the matview code which shouldn't be required for correctness but gives warm and fuzzy feelings as well as less debugging noise. Will think about this. I agree is is probably worth doing something to reduce the noise when looking for cases that actually matter. It's pretty much free, so I don't think there really is any reason to deviate from other parts of the code. Note how e.g. copy_heap_data(), DefineRelation() and ATRewriteTable() all lock the new relation, even if it just has been created and is (and in the latter two cases will always be) invisible. Wouldn't it be a good idea to tack such WARNINGs (in a proper and clean form) to index_open (checking the underlying relation is locked), relation_open(..., NoLock) (checking the relation has previously been locked) and maybe RelationIdGetRelation() when cassert is enabled? ISTM we frequently had bugs around this. It would be nice to have such omissions pointed out during early testing. Will try to come up with something. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] buffile.c resource owner breakage on segment extension
On 2013-11-01 15:28:54 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: While not particularly nice, given the API, it seems best for buffile.c to remember the resource owner used for the original segment and temporarily set that during the extension. Hm, yeah, that seems right. It's just like repalloc keeping the memory chunk in its original context. The comments here are a bit inadequate... Thanks for committing and sorry for the need to freshen up the comments. I don't think I had ever opened buffile.c before and thus wasn't sure if there isn't a better fix, so I didn't want to spend too much time on it before somebody agreed it is the right fix. Also, I was actually just trying to recover some data from a corrupted database and this stopped me from it ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something fishy happening on frogmouth
On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote: On 01.11.2013 18:22, Noah Misch wrote: On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote: On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 31.10.2013 16:43, Robert Haas wrote: There should be no cases where the main shared memory segment gets cleaned up and the dynamic shared memory segments do not. 1. initdb -D data1 2. initdb -D data2 3. postgres -D data1 4. killall -9 postgres 5. postgres -D data2 The system V shmem segment orphaned at step 4 will be cleaned up at step 5. The DSM segment will not. Note that dynamic_shared_memory_type='mmap' will give the desired behavior. Well, with the significant price of causing file-io. Hmm, here's another idea: Postmaster creates the POSIX shared memory object at startup, by calling shm_open(), and immediately calls shm_unlink on it. That way, once all the processes have exited, the object will be removed automatically. Child processes inherit the file descriptor at fork(), and don't need to call shm_open, just mmap(). Uh. Won't that completely and utterly remove the point of dsm which is that you can create segments *after* startup? We surely don't want to start overallocating enough shmem so we don't ever dynamically need to allocate segments. Also, I don't think it's portable across platforms to access segments that already have been unlinked. I think this is looking for a solution without an actually relevant problem disregarding the actual problem space. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption
On 2013-11-02 15:29:36 -0400, Tom Lane wrote: Attached is a proposed patch for this. It fixes most of the functions in printtup.c to use a per-row memory context. (I did not bother to fix debugtup(), which is used only in standalone mode. If you're doing queries large enough for mem leaks to be problematic in standalone mode, you're nuts.) FWIW, by that definition I have been nuts several time in the past - without much choice since I was recovering data from a corrupted cluster and the database couldn't be started normally. This now has gotten worse (even in the backbranches) since debugtup won't clean up detoasted data anymore. But I guess the solution for that is to use COPY in those situations which shouldn't have that problem and should work. Also, easier to parse ;) My original thought had been to get rid of all pfree's of output function results, so as to make the world safe for returning constant strings when such functions find it appropriate. However, I ran into a showstopper problem: SPI_getvalue(), which is little more than a wrapper around the appropriate type output function, is documented as returning a pfree'able string. Some though not all of the callers in core and contrib take the hint and pfree the result, and I'm sure there are more in third-party extensions. So if we want to allow output functions to return non-palloc'd strings, it seems we have to either change SPI_getvalue()'s API contract or insert a pstrdup() call in it. Neither of these is attractive, mainly because the vast majority of output function results would still be palloc'd even if we made aggressive use of the option not to. That means that if we did the former, light testing wouldn't necessarily show a problem if someone forgot to remove a pfree() after a SPI_getvalue() call, and it also means that if we did the latter, the pstrdup() would usually be a waste of cycles and space. I guess one option for the future is to to pstrdup() in SPI_getvalue() and adding a new function that tells whether the result can be freed or not. Then callers that care can move over. Although I'd guess that many of those caring will already use SPI_getbinval() manually. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removal of archive in wal_level
Please find attached a patch doing what is written in the $subject. With the documentation updated, this is even better... Regards, -- Michael diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index ccb76d8..0f20253 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -582,7 +582,7 @@ tar -cf backup.tar /usr/local/pgsql/data para To enable WAL archiving, set the xref linkend=guc-wal-level -configuration parameter to literalarchive/ (or literalhot_standby/), +configuration parameter to literalhot_standby/, xref linkend=guc-archive-mode to literalon/, and specify the shell command to use in the xref linkend=guc-archive-command configuration parameter. In practice @@ -1246,7 +1246,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p' If more flexibility in copying the backup files is needed, a lower level process can be used for standalone hot backups as well. To prepare for low level standalone hot backups, set varnamewal_level/ to - literalarchive/ (or literalhot_standby/), varnamearchive_mode/ to + literalhot_standby/, varnamearchive_mode/ to literalon/, and set up an varnamearchive_command/ that performs archiving only when a emphasisswitch file/ exists. For example: programlisting diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 77a9303..2c6a022 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1606,10 +1606,9 @@ include 'filename' varnamewal_level/ determines how much information is written to the WAL. The default value is literalminimal/, which writes only the information needed to recover from a crash or immediate -shutdown. literalarchive/ adds logging required for WAL archiving, -and literalhot_standby/ further adds information required to run -read-only queries on a standby server. -This parameter can only be set at server start. +shutdown. literalhot_standby/ adds logging required for WAL +archiving and information required to run read-only queries on a +standby server. This parameter can only be set at server start. /para para In literalminimal/ level, WAL-logging of some bulk @@ -1625,22 +1624,17 @@ include 'filename' /simplelist But minimal WAL does not contain enough information to reconstruct the data from a base backup and the -WAL logs, so either literalarchive/ or literalhot_standby/ -level must be used to enable +WAL logs literalhot_standby/ level must be used to enable WAL archiving (xref linkend=guc-archive-mode) and streaming replication. /para para -In literalhot_standby/ level, the same information is logged as -with literalarchive/, plus information needed to reconstruct -the status of running transactions from the WAL. To enable read-only -queries on a standby server, varnamewal_level/ must be set to -literalhot_standby/ on the primary, and -xref linkend=guc-hot-standby must be enabled in the standby. It is -thought that there is -little measurable difference in performance between using -literalhot_standby/ and literalarchive/ levels, so feedback -is welcome if any production impacts are noticeable. +In literalhot_standby/ level, necessary information is logged +to reconstruct the status of running transactions from the WAL and +to enable read-only queries on a standby server. Note that +varnamewal_level/ must be set to literalhot_standby/ on +the primary, and xref linkend=guc-hot-standby must be enabled +in the standby. /para /listitem /varlistentry @@ -2198,8 +2192,8 @@ include 'filename' of connections, so the parameter cannot be set higher than xref linkend=guc-max-connections. This parameter can only be set at server start. varnamewal_level/ must be set -to literalarchive/ or literalhot_standby/ to allow -connections from standby servers. +to literalhot_standby/ to allow connections from standby +servers. /para /listitem /varlistentry diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index f407753..af4 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -630,7 +630,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) * WAL record that will allow us to conflict with queries * running on standby. */ - if (XLogStandbyInfoActive()) + if (XLogIsNeeded()) { BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 073190f..81f7c75 100644 ---
Re: [HACKERS] Something fishy happening on frogmouth
On 04.11.2013 11:55, Andres Freund wrote: On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote: Hmm, here's another idea: Postmaster creates the POSIX shared memory object at startup, by calling shm_open(), and immediately calls shm_unlink on it. That way, once all the processes have exited, the object will be removed automatically. Child processes inherit the file descriptor at fork(), and don't need to call shm_open, just mmap(). Uh. Won't that completely and utterly remove the point of dsm which is that you can create segments *after* startup? We surely don't want to start overallocating enough shmem so we don't ever dynamically need to allocate segments. You don't need to allocate the shared memory beforehand, only create the file descriptor. Note that the size of the segment is specified in the shm_open() call, but the mmap() that comes later. If we need a large amount of small segments, so that it's not feasible to shm_open() them all at postmaster startup, you could shm_open() just one segment, and carve out smaller segments from it by mmap()ing at different offsets. Also, I don't think it's portable across platforms to access segments that already have been unlinked. See http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html: If one or more references to the shared memory object exist when the object is unlinked, the name shall be removed before shm_unlink() returns, but the removal of the memory object contents shall be postponed until all open and map references to the shared memory object have been removed. That doesn't explicitly say that a new shm_open() on the file descriptor must still work after shm_unlink, but I would be surprised if there is a platform where it doesn't. I think this is looking for a solution without an actually relevant problem disregarding the actual problem space. I agree. What are these dynamic shared memory segments supposed to be used for? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something fishy happening on frogmouth
On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote: On 04.11.2013 11:55, Andres Freund wrote: On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote: Postmaster creates the POSIX shared memory object at startup, by calling shm_open(), and immediately calls shm_unlink on it. That way, once all the processes have exited, the object will be removed automatically. Child processes inherit the file descriptor at fork(), and don't need to call shm_open, just mmap(). Uh. Won't that completely and utterly remove the point of dsm which is that you can create segments *after* startup? We surely don't want to start overallocating enough shmem so we don't ever dynamically need to allocate segments. You don't need to allocate the shared memory beforehand, only create the file descriptor. Note that the size of the segment is specified in the shm_open() call, but the mmap() that comes later. If we need a large amount of small segments, so that it's not feasible to shm_open() them all at postmaster startup, you could shm_open() just one segment, and carve out smaller segments from it by mmap()ing at different offsets. That quickly will result in fragmentation which we don't have the tools to handle. Also, I don't think it's portable across platforms to access segments that already have been unlinked. See http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html: If one or more references to the shared memory object exist when the object is unlinked, the name shall be removed before shm_unlink() returns, but the removal of the memory object contents shall be postponed until all open and map references to the shared memory object have been removed. We also support sysv shmem and have the same cleanup problem there. That doesn't explicitly say that a new shm_open() on the file descriptor must still work after shm_unlink, but I would be surprised if there is a platform where it doesn't. Probably true. I think this is looking for a solution without an actually relevant problem disregarding the actual problem space. To make that clearer: I think the discussions about making it impossible to leak segments after rm -rf are the irrelevant problem. I agree. What are these dynamic shared memory segments supposed to be used for? Parallel sort and stuff like that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pg_receivexlog: fixed to work with logical segno 0
On Nov 4, 2013, at 11:06, Heikki Linnakangas wrote: On 01.11.2013 11:42, Mika Eloranta wrote: pg_receivexlog calculated the xlog segment number incorrectly when started after the previous instance was interrupted. Resuming streaming only worked when the physical wal segment counter was zero, i.e. for the first 256 segments or so. Oops. Fixed, thanks for the report! It's a bit scary that this bug went unnoticed for this long; it was introduced quite early in the 9.3 development cycle. Seems that I did all the testing of streaming timeline changes with pg_receivexlog later in 9.3 cycle with segment numbers 256, and no-one else have done long-running tests with pg_receivexlog either. Thanks for the fix, Heikki! It sounds like either PostgreSQL 9.3.x and/or pg_receivexlog is not yet used in a lot of places. Otherwise this probably would have been found earlier. Affected versions: $ git tag --contains dfda6eba REL9_3_0 REL9_3_1 REL9_3_BETA1 REL9_3_BETA2 REL9_3_RC1 What makes this a really sneaky and severe problem is the way it stays dormant for a period of time after a fresh db init or pg_upgrade. Here's how I bumped into it: 1. Old postgresql 9.2 db running, pg_receivexlog streaming extra backups to a remote box. 2. pg_upgrade to 9.3.1. 3. pg_receivexlog from the upgraded DB still works ok and handles restarts fine, because the xlog indexes were reset back to zero at pg_upgrade. 4. xlog history eventually grows over 256 * 16MB. 5. pg_receivexlog gets interrupted for whatever reason (gets stopped, killed, crashes, host is restarted). 6. A new pg_receivexlog instance fails to resume streaming and there is no easy workaround that would maintain an uninterrupted, gapless xlog history. Initially, before I had analysed the problem any further, I had to stash the xlogs, restart pg_receivexlog and after that trigger new pg_basebackups. Regardless of this bug, I find that pg_receivexlog (and pg_basebackup) are excellent tools and people should use them more! PS. something like pg_receivexlog --start-pos=2D/1500 might be nice for overriding the streaming start position. -- Mika Eloranta Ohmu Ltd. http://www.ohmu.fi/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?
On 04.11.2013 11:35, Andres Freund wrote: On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote: Secondly, it will fail if you create two fake relcache entries for the same relfilenode. Freeing the first will close the smgr entry, and freeing the second will try to close the same smgr entry again. We never do that in the current code, but it seems like a possible source of bugs in the future. I think if we go there we'll need refcounted FakeRelcacheEntry's anyway. Otherwise we'll end up with a relation smgropen()ed multiple times in the same backend and such which doesn't seem like a promising course to me since the smgr itself isn't refcounted. How about the attached instead? diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 5429d5e..f732e71 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode) rel-rd_lockInfo.lockRelId.dbId = rnode.dbNode; rel-rd_lockInfo.lockRelId.relId = rnode.relNode; - rel-rd_smgr = NULL; + /* +* Open the underlying storage, but don't set rd_owner. We want the +* SmgrRelation to persist after the fake relcache entry is freed. +*/ + rel-rd_smgr = smgropen(rnode, InvalidBackendId); return rel; } I don't really like that - that will mean we'll leak open smgr handles for every relation touched via smgr during recovery since they will never be closed unless the relation is dropped or such. And in some databases there can be huge amounts of relations. Since recovery is long lived that doesn't seem like a good idea, besides the memory usage it will also bloat smgr's internal hash which actually seems just as likely to hurt performance. That's the way SmgrRelations are supposed to be used. Opened on first use, and kept open forever. We never smgrclose() during normal operation either, unless the relation is dropped or something like that. And see how XLogReadBuffer() also calls smgropen() with no corresponding smgrclose(). I think intentionally not using the owner mechanism also is dangerous - what if we have longer lived fake relcache entries and we've just processed sinval messages? Then we'll have a -rd_smgr pointing into nowhere. Hmm, the startup process doesn't participate in sinval messaging at all, does it? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?
On 2013-11-04 14:37:53 +0200, Heikki Linnakangas wrote: On 04.11.2013 11:35, Andres Freund wrote: On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote: diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 5429d5e..f732e71 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode) rel-rd_lockInfo.lockRelId.dbId = rnode.dbNode; rel-rd_lockInfo.lockRelId.relId = rnode.relNode; - rel-rd_smgr = NULL; + /* +* Open the underlying storage, but don't set rd_owner. We want the +* SmgrRelation to persist after the fake relcache entry is freed. +*/ + rel-rd_smgr = smgropen(rnode, InvalidBackendId); return rel; } I don't really like that - that will mean we'll leak open smgr handles for every relation touched via smgr during recovery since they will never be closed unless the relation is dropped or such. And in some databases there can be huge amounts of relations. Since recovery is long lived that doesn't seem like a good idea, besides the memory usage it will also bloat smgr's internal hash which actually seems just as likely to hurt performance. That's the way SmgrRelations are supposed to be used. Opened on first use, and kept open forever. We never smgrclose() during normal operation either, unless the relation is dropped or something like that. And see how XLogReadBuffer() also calls smgropen() with no corresponding smgrclose(). Ok, that's quite the fair argument although I'd say normal backends won't open many relations in comparison to the startup process when doing replication. But that certainly isn't sufficient argument for changing logic in the back branches or even HEAD. I think intentionally not using the owner mechanism also is dangerous - what if we have longer lived fake relcache entries and we've just processed sinval messages? Then we'll have a -rd_smgr pointing into nowhere. Hmm, the startup process doesn't participate in sinval messaging at all, does it? Well, not sinval but inval, in hot standby via commit messages. What about just unowning the smgr entry with if (rel-rd_smgr != NULL) smgrsetowner(NULL, rel-rd_smgr) when closing the fake relcache entry? That shouldn't require any further changes than changing the comment in smgrsetowner() that this isn't something expected to frequently happen. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?
On 2013-11-04 13:48:32 +0100, Andres Freund wrote: Hmm, the startup process doesn't participate in sinval messaging at all, does it? Well, not sinval but inval, in hot standby via commit messages. Err, that's bullshit, sorry for that. We send the messages via sinval, but never (probably at least?) process sinval messages. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extension Templates S03E11
Dimitri, * Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote: So please find v15 of the patch attached to this email, that passes all previously done checks and this one too now. Looks like there's been a bit of unfortunate bitrot due to Tom's change to disable fancy output: patching file src/test/regress/expected/sanity_check.out Hunk #1 FAILED at 104. Hunk #2 FAILED at 166. 2 out of 2 hunks FAILED -- saving rejects to file src/test/regress/expected/sanity_check.out.rej Are there any other changes you have pending for this..? Would be nice to see the latest version which you've tested and which patches cleanly against master... ;) I'll still go ahead and start looking through this, per our discussion. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Row-security writer-side checks proposal
On Fri, Nov 1, 2013 at 3:52 AM, Craig Ringer cr...@2ndquadrant.com wrote: I've been looking some more into write-side checks in row-security and have a suggestion. Even though write-side checks are actually fairly separate to read checks, and can be done as another step, I'd like to think about them before the catalog format and syntax are settled. I think we need fields for write operations in pg_rowsecurity and the syntax to set them so that the catalog information can be used by triggers to enforce write checks. Even if, for the first cut, they're not supported by built-in auto-created triggers. Here's my proposal, let me know what you think: SET ROW SECURITY FOR { ALL COMMANDS | {[SELECT,INSERT,UPDATE,DELETE}+} in other words, you specify either: SET ROW SECURITY FOR ALL COMMANDS I continue to think that this syntax is misguided. For SELECT and DELETE there is only read-side security, and for INSERT there is only write-side security, so that's OK as far as it goes, but for UPDATE both read-side security and write-side security are possible, and there ought to be a way to get one without the other. This syntax won't support that cleanly. I wonder whether it's worth thinking about the relationship between the write-side security contemplated for this feature iand the WITH CHECK OPTION syntax that we have for auto-updateable views, which serves more or less the same purpose. I'm not sure that syntax is any great shakes, but it's existing precedent of some form and could perhaps at least be looked at as a source of inspiration. I would generally expect that most people would want either read side security for all commands or read and write side security for all commands. I think whatever syntax we come up with this feature ought to make each of those things straightforward to get. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removal of archive in wal_level
On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier michael.paqu...@gmail.com wrote: Please find attached a patch doing what is written in the $subject. With the documentation updated, this is even better... I'm unconvinced that there's any value in this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
Craig Ringer cr...@2ndquadrant.com writes: On 09/04/2013 11:22 PM, Tom Lane wrote: AFAICT, to deal with update/delete the RLS patch needs to constrain order of qual application without the crutch of having a separate level of subquery; and it's that behavior that I have zero confidence in, either as to whether it works as submitted or as to our odds of not breaking it in the future. Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also protect against malicious RLS functions? You mean wrap all the query quals in a CASE? Sure, if you didn't mind totally destroying any optimization possibilities. If you did that, every table scan would become a seqscan and every join a nestloop. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption
Andres Freund and...@2ndquadrant.com writes: On 2013-11-02 15:29:36 -0400, Tom Lane wrote: Attached is a proposed patch for this. It fixes most of the functions in printtup.c to use a per-row memory context. (I did not bother to fix debugtup(), which is used only in standalone mode. If you're doing queries large enough for mem leaks to be problematic in standalone mode, you're nuts.) FWIW, by that definition I have been nuts several time in the past - without much choice since I was recovering data from a corrupted cluster and the database couldn't be started normally. This now has gotten worse (even in the backbranches) since debugtup won't clean up detoasted data anymore. But I guess the solution for that is to use COPY in those situations which shouldn't have that problem and should work. Also, easier to parse ;) Considering the output from debugtup goes to stdout where it will be intermixed with prompts etc, I'd have to think that COPY is a way better solution for dealing with bulk data. Really I'd like to see standalone mode, in its current form, go away completely. I had a prototype patch for allowing psql and other clients to interface to a standalone backend. I think getting that finished would be a way more productive use of time than improving debugtup. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption
On 2013-11-04 09:45:22 -0500, Tom Lane wrote: Really I'd like to see standalone mode, in its current form, go away completely. I had a prototype patch for allowing psql and other clients to interface to a standalone backend. I think getting that finished would be a way more productive use of time than improving debugtup. +many Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Something fishy happening on frogmouth
Andres Freund and...@2ndquadrant.com writes: On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote: On 04.11.2013 11:55, Andres Freund wrote: Also, I don't think it's portable across platforms to access segments that already have been unlinked. See http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html: If one or more references to the shared memory object exist when the object is unlinked, the name shall be removed before shm_unlink() returns, but the removal of the memory object contents shall be postponed until all open and map references to the shared memory object have been removed. We also support sysv shmem and have the same cleanup problem there. And what about Windows? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On Mon, Nov 4, 2013 at 9:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: Craig Ringer cr...@2ndquadrant.com writes: On 09/04/2013 11:22 PM, Tom Lane wrote: AFAICT, to deal with update/delete the RLS patch needs to constrain order of qual application without the crutch of having a separate level of subquery; and it's that behavior that I have zero confidence in, either as to whether it works as submitted or as to our odds of not breaking it in the future. Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also protect against malicious RLS functions? You mean wrap all the query quals in a CASE? Sure, if you didn't mind totally destroying any optimization possibilities. If you did that, every table scan would become a seqscan and every join a nestloop. I'd still like to here what's wrong with what I said here: http://www.postgresql.org/message-id/ca+tgmoyr1phw3x9vnvuwdcfxkzk2p_jhtwc0fv2q58negcx...@mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption
On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-11-02 15:29:36 -0400, Tom Lane wrote: Attached is a proposed patch for this. It fixes most of the functions in printtup.c to use a per-row memory context. (I did not bother to fix debugtup(), which is used only in standalone mode. If you're doing queries large enough for mem leaks to be problematic in standalone mode, you're nuts.) FWIW, by that definition I have been nuts several time in the past - without much choice since I was recovering data from a corrupted cluster and the database couldn't be started normally. This now has gotten worse (even in the backbranches) since debugtup won't clean up detoasted data anymore. But I guess the solution for that is to use COPY in those situations which shouldn't have that problem and should work. Also, easier to parse ;) Considering the output from debugtup goes to stdout where it will be intermixed with prompts etc, I'd have to think that COPY is a way better solution for dealing with bulk data. Really I'd like to see standalone mode, in its current form, go away completely. I had a prototype patch for allowing psql and other clients to interface to a standalone backend. I think getting that finished would be a way more productive use of time than improving debugtup. The last patch including Windows implementation was posted at below link sometime back. http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs If this is the right thing, I can rebase the patch and make it ready. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dsm use of uint64
On Fri, Nov 1, 2013 at 11:45 PM, Peter Eisentraut pete...@gmx.net wrote: On Mon, 2013-10-28 at 12:17 -0400, Robert Haas wrote: On Sun, Oct 27, 2013 at 11:34 PM, Noah Misch n...@leadboat.com wrote: On Fri, Oct 25, 2013 at 10:11:41PM -0400, Robert Haas wrote: When I wrote the dynamic shared memory patch, I used uint64 everywhere to measure sizes - rather than, as we do for the main shared memory segment, Size. This now seems to me to have been the wrong decision; This change is now causing compiler warnings on 32-bit platforms. You can see them here, for example: http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwingdt=2013-11-01%2020%3A45%3A01stg=make Ah. This is because I didn't change the format code used to print the arguments; it's still using UINT64_FORMAT, but the argument is now a Size. What's the right way to print out a Size, anyway? Should I just try %lu? It seems that sysv_shmem.c uses %lu, but also casts the value to (unsigned long); I could follow that precedent here, if there's no consistency to what format is needed to print a size_t. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removal of archive in wal_level
On 11/4/13, 8:58 AM, Robert Haas wrote: On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier michael.paqu...@gmail.com wrote: Please find attached a patch doing what is written in the $subject. With the documentation updated, this is even better... I'm unconvinced that there's any value in this. Yeah, the only thing this will accomplish is to annoy people who are actually using that level. It would be more interesting if we could get rid of the wal_level setting altogether, but of course there are valid reasons against that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dsm use of uint64
On 2013-11-04 10:46:06 -0500, Robert Haas wrote: On Fri, Nov 1, 2013 at 11:45 PM, Peter Eisentraut pete...@gmx.net wrote: On Mon, 2013-10-28 at 12:17 -0400, Robert Haas wrote: On Sun, Oct 27, 2013 at 11:34 PM, Noah Misch n...@leadboat.com wrote: On Fri, Oct 25, 2013 at 10:11:41PM -0400, Robert Haas wrote: When I wrote the dynamic shared memory patch, I used uint64 everywhere to measure sizes - rather than, as we do for the main shared memory segment, Size. This now seems to me to have been the wrong decision; This change is now causing compiler warnings on 32-bit platforms. You can see them here, for example: http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwingdt=2013-11-01%2020%3A45%3A01stg=make Ah. This is because I didn't change the format code used to print the arguments; it's still using UINT64_FORMAT, but the argument is now a Size. What's the right way to print out a Size, anyway? There isn't a nice one currently. glibc/gcc have %z that automatically has the right width, but that's not supported by windows. I've been wondering if we shouldn't add support for that just like we have added support for %m. Should I just try %lu? It seems that sysv_shmem.c uses %lu, but also casts the value to (unsigned long); I could follow that precedent here, if there's no consistency to what format is needed to print a size_t. Yes, you need a cast like that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote: On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote: On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it wrote: I don't see much interest in insert-efficient indexes. Presumably someone will get around to implementing a btree index insertion buffer one day. I think that would be a particularly compelling optimization for us, because we could avoid ever inserting index tuples that are already dead when the deferred insertion actually occurs. That's pretty much what the LSM-tree is. What is pretty cool about this sort of thing is that there's no intrinsic reason the insertion buffer needs to be block-structured or disk-backed. In theory, you can structure the in-memory portion of the tree any way you like, using pointers and arbitrary-size memory allocations and all that fun stuff. You need to log that there's a deferred insert (or commit to flushing the insertion buffer before every commit, which would seem to miss the point) so that recovery can reconstruct the in-memory data structure and flush it, but that's it: the WAL format need not know any other details of the in-memory portion of the tree. I think that, plus the ability to use pointers and so forth, might lead to significant performance gains. In practice, the topology of our shared memory segment makes this a bit tricky. The problem isn't so much that it's fixed size as that it lacks a real allocator, and that all the space used for shared_buffers is nailed down and can't be borrowed for other purposes. I'm very interested in solving the problem of getting a real allocator for shared memory because I think I need it for parallel sort, and even if there's a way to avoid needing it there I have a strong feeling that we'll want it for other applications of dynamic shared memory. But it should be possible to write the allocator in such a way that you just give it a chunk of memory to manage, and it does, remaining agnostic about whether the memory is from the main shared memory segment or a dynamic one. Of course, it's possible that even we do get a shared memory allocator, a hypothetical person working on this project might prefer to make the data block-structured anyway and steal storage from shared_buffers. So my aspirations in this area may not even be relevant. But I wanted to mention them, just in case anyone else is thinking about similar things, so that we can potentially coordinate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dsm use of uint64
On Mon, Nov 4, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote: Ah. This is because I didn't change the format code used to print the arguments; it's still using UINT64_FORMAT, but the argument is now a Size. What's the right way to print out a Size, anyway? There isn't a nice one currently. glibc/gcc have %z that automatically has the right width, but that's not supported by windows. I've been wondering if we shouldn't add support for that just like we have added support for %m. Should I just try %lu? It seems that sysv_shmem.c uses %lu, but also casts the value to (unsigned long); I could follow that precedent here, if there's no consistency to what format is needed to print a size_t. Yes, you need a cast like that. Whee, portability is fun. OK, I changed it to work that way. Hopefully that'll do the trick. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 1:09 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote: On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote: On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it wrote: I don't see much interest in insert-efficient indexes. Presumably someone will get around to implementing a btree index insertion buffer one day. I think that would be a particularly compelling optimization for us, because we could avoid ever inserting index tuples that are already dead when the deferred insertion actually occurs. That's pretty much what the LSM-tree is. What is pretty cool about this sort of thing is that there's no intrinsic reason the insertion buffer needs to be block-structured or disk-backed. In theory, you can structure the in-memory portion of the tree any way you like, using pointers and arbitrary-size memory allocations and all that fun stuff. You need to log that there's a deferred insert (or commit to flushing the insertion buffer before every commit, which would seem to miss the point) Such a thing would help COPY, so maybe it's worth a look -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com wrote: Such a thing would help COPY, so maybe it's worth a look I have little doubt that a deferred insertion buffer of some kind could help performance on some workloads, though I suspect the buffer would have to be pretty big to make it worthwhile on a big COPY that generates mostly-random insertions. I think the question is not so much whether it's worth doing but where anyone's going to find the time to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com wrote: Such a thing would help COPY, so maybe it's worth a look I have little doubt that a deferred insertion buffer of some kind could help performance on some workloads, though I suspect the buffer would have to be pretty big to make it worthwhile on a big COPY that generates mostly-random insertions. I think the question is not so much whether it's worth doing but where anyone's going to find the time to do it. However, since an admin can increase work_mem for that COPY, using work_mem for this would be reasonable, don't you agree? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl status with nonexistent data directory
On Sat, Nov 2, 2013 at 3:32 PM, Peter Eisentraut pete...@gmx.net wrote: This doesn't seem right: $ pg_ctl -D /nowhere status pg_ctl: no server running It does exit with status 3, so it's not all that broken, but I think the error message could be more accurate. I doubt anyone will object if you feel the urge to fix it. I won't, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 11:31 AM, Claudio Freire klaussfre...@gmail.com wrote: On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com wrote: Such a thing would help COPY, so maybe it's worth a look I have little doubt that a deferred insertion buffer of some kind could help performance on some workloads, though I suspect the buffer would have to be pretty big to make it worthwhile on a big COPY that generates mostly-random insertions. I think the question is not so much whether it's worth doing but where anyone's going to find the time to do it. However, since an admin can increase work_mem for that COPY, using work_mem for this would be reasonable, don't you agree? Without implementing it and benchmarking the result, it's pretty hard to know. But if I were a betting man, I'd bet that's not nearly big enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 2013-11-04 11:27:33 -0500, Robert Haas wrote: On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com wrote: Such a thing would help COPY, so maybe it's worth a look I have little doubt that a deferred insertion buffer of some kind could help performance on some workloads, though I suspect the buffer would have to be pretty big to make it worthwhile on a big COPY that generates mostly-random insertions. Even for random data presorting the to-be-inserted data appropriately could result in much better io patterns. I think the question is not so much whether it's worth doing but where anyone's going to find the time to do it. Yea :( I think doing this outside of s_b will make stuff rather hard for physical replication and crash recovery since we either will need to flush the whole buffer at checkpoints - which is hard since the checkpointer doesn't work inside individual databases - or we need to persist the in-memory buffer across restart which also sucks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund and...@2ndquadrant.com wrote: I think doing this outside of s_b will make stuff rather hard for physical replication and crash recovery since we either will need to flush the whole buffer at checkpoints - which is hard since the checkpointer doesn't work inside individual databases - or we need to persist the in-memory buffer across restart which also sucks. You might be right, but I think part of the value of LSM-trees is that the in-memory portion of the data structure is supposed to be able to be optimized for in-memory storage rather than on disk storage. It may be that block-structuring that data bleeds away much of the performance benefit. Of course, I'm talking out of my rear end here: I don't really have a clue how these algorithms are supposed to work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removal of archive in wal_level
* Peter Eisentraut (pete...@gmx.net) wrote: On 11/4/13, 8:58 AM, Robert Haas wrote: On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier michael.paqu...@gmail.com wrote: Please find attached a patch doing what is written in the $subject. With the documentation updated, this is even better... I'm unconvinced that there's any value in this. Yeah, the only thing this will accomplish is to annoy people who are actually using that level. It would be more interesting if we could get rid of the wal_level setting altogether, but of course there are valid reasons against that. It would actually be valuable to 'upgrade' those people to hot_standby, which is what I had kind of been hoping would happen eventually. I agree that there's no use for 'archive' today, but rather than break existing configs that use it, just make 'archive' and 'hot_standby' mean the same thing. In the end, I'd probably vote to make 'hot_standby' the 'legacy/deprecated' term anyway. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Removal of archive in wal_level
On Mon, Nov 4, 2013 at 11:45 AM, Stephen Frost sfr...@snowman.net wrote: * Peter Eisentraut (pete...@gmx.net) wrote: On 11/4/13, 8:58 AM, Robert Haas wrote: On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier michael.paqu...@gmail.com wrote: Please find attached a patch doing what is written in the $subject. With the documentation updated, this is even better... I'm unconvinced that there's any value in this. Yeah, the only thing this will accomplish is to annoy people who are actually using that level. It would be more interesting if we could get rid of the wal_level setting altogether, but of course there are valid reasons against that. It would actually be valuable to 'upgrade' those people to hot_standby, which is what I had kind of been hoping would happen eventually. I agree that there's no use for 'archive' today, but rather than break existing configs that use it, just make 'archive' and 'hot_standby' mean the same thing. In the end, I'd probably vote to make 'hot_standby' the 'legacy/deprecated' term anyway. That strikes me as a better idea than what the patch actually does, but I still think it's nanny-ism. I don't believe we have the right to second-guess the choices our users make in this area. We can make recommendations in the documentation, but at the end of the day if users choose to use archive rather than hot_standby, we should respect that choice, not break it because we think we know better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How can I build OSSP UUID support on Windows to avoid duplicate UUIDs?
On Thu, Oct 31, 2013 at 3:42 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Oct 31, 2013 at 2:44 PM, Garick Hamlin gham...@isc.upenn.edu wrote: I think using /dev/urandom directly would be surprising. At least it would have probably have taken me a while to figure out what was depleting the entropy pool here. Perhaps so; a bigger problem IMHO is that it's not portable. I think the only way to solve this problem is to import (or have an option to link with) a strong, sophisticated PRNG with much larger internal state than pg_lrand48, which uses precisely 48 bits of internal state. For this kind of thing, I'm fairly sure that we need something with at least 128 bits of internal state (as wide as the random value we want to generate) and I suspect it might be advantageous to have something a whole lot wider, maybe a few kB. I mentioned the notion of building an entropy pool, into which one might add various sorts of random inputs, under separate cover... The last time I had need of a rather non-repeating RNG, I went with a Fibonacci-based one, namely Mersenne Twister... http://en.wikipedia.org/wiki/Mersenne_twister The sample has 624 integers (presumably that means 624x32 bits) as its internal state. Apparently not terribly suitable for cryptographic purposes, but definitely highly non-repetitive, which is what we're notably worried about for UUIDs. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this?
Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.
On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote: Remove internal uses of CTimeZone/HasCTZSet. The only remaining places where we actually look at CTimeZone/HasCTZSet are abstime2tm() and timestamp2tm(). Now that session_timezone is always valid, we can remove these special cases. The caller-visible impact of this is that these functions now always return a valid zone abbreviation if requested, whereas before they'd return a NULL pointer if a brute-force timezone was in use. In the existing code, the only place I can find that changes behavior is to_char(), whose TZ format code will now print something useful rather than nothing for such zones. (In the places where the returned zone abbreviation is passed to EncodeDateTime, the lack of visible change is because we've chosen the abbreviation used for these zones to match what EncodeTimezone would have printed.) This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES styles, because it inserts a space before tzn but does not insert a space before EncodeTimezone() output. Example: set datestyle = sql,mdy; select '2013-01-01'::timestamptz; old output: timestamptz 01/01/2013 00:00:00+00 new output: timestamptz - 01/01/2013 00:00:00 +00 I assume this was unintended. This could be back-patched if we decide we'd like to fix to_char()'s behavior in the back branches, but there doesn't seem to be much enthusiasm for that at present. Note that for the corresponding scenario in Oracle, the TZD format code yields blank and the TZR format code yields -01:30. (Oracle does not have a plain TZ escape.) So there's precedent for each choice of behavior, and I don't see this as a back-patch candidate. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption
Amit Kapila amit.kapil...@gmail.com writes: On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Really I'd like to see standalone mode, in its current form, go away completely. I had a prototype patch for allowing psql and other clients to interface to a standalone backend. I think getting that finished would be a way more productive use of time than improving debugtup. The last patch including Windows implementation was posted at below link sometime back. http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs If this is the right thing, I can rebase the patch and make it ready. IIRC, the discussion stalled out because people had security concerns and/or there wasn't consensus about how it should look at the user level. There's probably not much point in polishing a patch until we have more agreement about the high-level design. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.
Noah Misch n...@leadboat.com writes: On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote: Remove internal uses of CTimeZone/HasCTZSet. This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES styles, because it inserts a space before tzn but does not insert a space before EncodeTimezone() output. Example: set datestyle = sql,mdy; select '2013-01-01'::timestamptz; old output: timestamptz 01/01/2013 00:00:00+00 new output: timestamptz - 01/01/2013 00:00:00 +00 I assume this was unintended. Yeah, I had seen some cases of that. I don't find it of great concern. We'd have to insert some ugly special-case code that looks at the zone name to decide whether to insert a space or not, and I don't think it'd actually be an improvement to do so. (Arguably, these formats are more consistent this way.) Still, this change is probably a sufficient reason not to back-patch this part of the fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 05/11/13 05:35, Robert Haas wrote: On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund and...@2ndquadrant.com wrote: I think doing this outside of s_b will make stuff rather hard for physical replication and crash recovery since we either will need to flush the whole buffer at checkpoints - which is hard since the checkpointer doesn't work inside individual databases - or we need to persist the in-memory buffer across restart which also sucks. You might be right, but I think part of the value of LSM-trees is that the in-memory portion of the data structure is supposed to be able to be optimized for in-memory storage rather than on disk storage. It may be that block-structuring that data bleeds away much of the performance benefit. Of course, I'm talking out of my rear end here: I don't really have a clue how these algorithms are supposed to work. How about having a 'TRANSIENT INDEX' that only exists in memory, so there is no requirement to write it to disk or to replicate directly? This type of index would be very fast and easier to implement. Recovery would involve rebuilding the index, and sharing would involve recreating on a slave. Probably not appropriate for a primary index, but may be okay for secondary indexes used to speed specific queries. This could be useful in some situations now, and allow time to get experience in how best to implement the basic concept. Then a more robust solution using WAL etc can be developed later. I suspect that such a TRANSIENT INDEX would still be useful even when a more robust in memory index method was available. As I expect it would be faster to set up than a robust memory index - which might be good when you need to have one or more indexes for a short period of time, or the size of the index is so small that recreating it requires very little time (total elapsed time might even be less than a disk backed one?). Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On 4 November 2013 16:09, Robert Haas robertmh...@gmail.com wrote: On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote: On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote: On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it wrote: I don't see much interest in insert-efficient indexes. Presumably someone will get around to implementing a btree index insertion buffer one day. I think that would be a particularly compelling optimization for us, because we could avoid ever inserting index tuples that are already dead when the deferred insertion actually occurs. That's pretty much what the LSM-tree is. What is pretty cool about this sort of thing is that there's no intrinsic reason the insertion buffer needs to be block-structured or disk-backed. In theory, you can structure the in-memory portion of the tree any way you like, using pointers and arbitrary-size memory allocations and all that fun stuff. You need to log that there's a deferred insert (or commit to flushing the insertion buffer before every commit, which would seem to miss the point) so that recovery can reconstruct the in-memory data structure and flush it, but that's it: the WAL format need not know any other details of the in-memory portion of the tree. I think that, plus the ability to use pointers and so forth, might lead to significant performance gains. Sounds like it could be cool, yes. In practice, the topology of our shared memory segment makes this a bit tricky. The problem isn't so much that it's fixed size as that it lacks a real allocator, and that all the space used for shared_buffers is nailed down and can't be borrowed for other purposes. I'm very interested in solving the problem of getting a real allocator for shared memory because I think I need it for parallel sort Agreed, you need a shared memory allocator for parallel query. It's just too tempting to build a hash table in shared memory on one thread, then use the hash table from multiple sessions as we do a parallel scan. Roughly same thing for parallel sort - passing pointers to complete data objects around is much easier than actually moving the data between processes, which would slow things down. We do also need generalised inter-node pipe but that's not the best solution to every problem. It's also a great way of controlling over-allocation of resources. , and even if there's a way to avoid needing it there I have a strong feeling that we'll want it for other applications of dynamic shared memory. But it should be possible to write the allocator in such a way that you just give it a chunk of memory to manage, and it does, remaining agnostic about whether the memory is from the main shared memory segment or a dynamic one. Agreed Of course, it's possible that even we do get a shared memory allocator, a hypothetical person working on this project might prefer to make the data block-structured anyway and steal storage from shared_buffers. So my aspirations in this area may not even be relevant. But I wanted to mention them, just in case anyone else is thinking about similar things, so that we can potentially coordinate. If anyone was going to work on LSM tree, I would advise building a tree in shared/temp buffers first, then merging with the main tree. The merge process could use the killed tuple approach to mark the merging. The most difficult thing about buffering the inserts is deciding which poor sucker gets the task of cleaning up. That's probably better as an off-line process, which is where the work comes in. Non shared buffered approaches would add too much overhead to the main task. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 8:09 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote: On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote: On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it wrote: I don't see much interest in insert-efficient indexes. Presumably someone will get around to implementing a btree index insertion buffer one day. I think that would be a particularly compelling optimization for us, because we could avoid ever inserting index tuples that are already dead when the deferred insertion actually occurs. That's pretty much what the LSM-tree is. What is pretty cool about this sort of thing is that there's no intrinsic reason the insertion buffer needs to be block-structured or disk-backed. How do we commit to not spilling to disk, in the face of an unbounded number of indexes existing and wanting to use this mechanism simultaneously? If it routinely needs to spill to disk, that would probably defeat the purpose of having it in the first place, but committing to never doing so seems to be extremely restrictive. As you say it is also freeing, in terms of using pointers and such, but I think the restrictions would outweigh the freedom. In theory, you can structure the in-memory portion of the tree any way you like, using pointers and arbitrary-size memory allocations and all that fun stuff. You need to log that there's a deferred insert (or commit to flushing the insertion buffer before every commit, which would seem to miss the point) so that recovery can reconstruct the in-memory data structure and flush it, but that's it: the WAL format need not know any other details of the in-memory portion of the tree. I think that, plus the ability to use pointers and so forth, might lead to significant performance gains. In practice, the topology of our shared memory segment makes this a bit tricky. The problem isn't so much that it's fixed size as that it lacks a real allocator, and that all the space used for shared_buffers is nailed down and can't be borrowed for other purposes. I think the fixed size is also a real problem, especially given the ubiquitous advice not to exceed 2 to 8 GB. Cheers, Jeff
Re: [HACKERS] [BUGS] BUG #8542: Materialized View with another column_name does not work?
Kevin Grittner kgri...@ymail.com wrote: Michael Paquier michael.paqu...@gmail.com wrote: I am not sure that adding a boolean flag introducing a concept related to matview inside checkRuleResultList is the best approach to solve that. checkRuleResultList is something related only to rules, and has nothing related to matviews in it yet. Well, I was tempted to keep that concept in DefineQueryRewrite() where the call is made, by calling the new bool something like requireColumnNameMatch and not having checkRuleResultList() continue to use isSelect for that purpose at all. DefineQueryRewrite() already references RELKIND_RELATION once, RELKIND_MATVIEW twice, and RELKIND_VIEW three times, so it would hardly be introducing a new concept there. Upon reflection, that seemed to be cleaner. Pushed fix that way. Not much of a change from the previously posted patch, but attached here in case anyone wants to argue against this approach. Thanks for the report! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company*** a/src/backend/rewrite/rewriteDefine.c --- b/src/backend/rewrite/rewriteDefine.c *** *** 44,50 static void checkRuleResultList(List *targetList, TupleDesc resultDesc, ! bool isSelect); static bool setRuleCheckAsUser_walker(Node *node, Oid *context); static void setRuleCheckAsUser_Query(Query *qry, Oid userid); --- 44,50 static void checkRuleResultList(List *targetList, TupleDesc resultDesc, ! bool isSelect, bool requireColumnNameMatch); static bool setRuleCheckAsUser_walker(Node *node, Oid *context); static void setRuleCheckAsUser_Query(Query *qry, Oid userid); *** *** 355,361 DefineQueryRewrite(char *rulename, */ checkRuleResultList(query-targetList, RelationGetDescr(event_relation), ! true); /* * ... there must not be another ON SELECT rule already ... --- 355,363 */ checkRuleResultList(query-targetList, RelationGetDescr(event_relation), ! true, ! event_relation-rd_rel-relkind != ! RELKIND_MATVIEW); /* * ... there must not be another ON SELECT rule already ... *** *** 484,490 DefineQueryRewrite(char *rulename, errmsg(RETURNING lists are not supported in non-INSTEAD rules))); checkRuleResultList(query-returningList, RelationGetDescr(event_relation), ! false); } } --- 486,492 errmsg(RETURNING lists are not supported in non-INSTEAD rules))); checkRuleResultList(query-returningList, RelationGetDescr(event_relation), ! false, false); } } *** *** 613,627 DefineQueryRewrite(char *rulename, * Verify that targetList produces output compatible with a tupledesc * * The targetList might be either a SELECT targetlist, or a RETURNING list; ! * isSelect tells which. (This is mostly used for choosing error messages, ! * but also we don't enforce column name matching for RETURNING.) */ static void ! checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect) { ListCell *tllist; int i; i = 0; foreach(tllist, targetList) { --- 615,634 * Verify that targetList produces output compatible with a tupledesc * * The targetList might be either a SELECT targetlist, or a RETURNING list; ! * isSelect tells which. This is used for choosing error messages. ! * ! * A SELECT targetlist may optionally require that column names match. */ static void ! checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect, ! bool requireColumnNameMatch) { ListCell *tllist; int i; + /* Only a SELECT may require a column name match. */ + Assert(isSelect || !requireColumnNameMatch); + i = 0; foreach(tllist, targetList) { *** *** 657,663 checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect) (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg(cannot convert relation containing dropped columns to view))); ! if (isSelect strcmp(tle-resname, attname) != 0) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg(SELECT rule's target entry %d has different column name from \%s\, i, attname))); --- 664,670 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg(cannot convert relation containing dropped columns to view))); ! if (requireColumnNameMatch strcmp(tle-resname, attname) != 0) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg(SELECT rule's target entry %d has different column name from \%s\, i, attname))); *** a/src/test/regress/expected/matview.out --- b/src/test/regress/expected/matview.out *** *** 450,452 SELECT * FROM boxmv ORDER BY id; --- 450,475 DROP TABLE boxes CASCADE; NOTICE: drop cascades to materialized view
Re: [HACKERS] GIN improvements part 1: additional information
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov aekorot...@gmail.comwrote: Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring. event | period ---+- index_build | 00:01:45.416822 index_build_recovery | 00:00:06 index_update | 00:05:17.263606 index_update_recovery | 00:01:22 search_new| 00:24:07.706482 search_updated| 00:26:25.364494 (6 rows) label | blocks_mark +- search_new | 847587636 search_updated | 881210486 (2 rows) label | size ---+--- new | 419299328 after_updates | 715915264 (2 rows) Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted. Now, I'm trying to implement support of old page format. Then we can decide which approach to use. Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. -- With best regards, Alexander Korotkov. gin-packed-postinglists-12.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] List of binary-compatible data types
Folks, From our docs: Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten. As an exception, if the USING clause does not change the column contents and the old type is either binary coercible to the new type or an unconstrained domain over the new type, a table rewrite is not needed ... Which is nice, but nowhere do we present users with a set of binary-compatible data types, even among the built-in types. I'd happily write this up, if I knew what the binary-compatible data types *were*. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
On Mon, Nov 4, 2013 at 5:01 PM, Simon Riggs si...@2ndquadrant.com wrote: Of course, it's possible that even we do get a shared memory allocator, a hypothetical person working on this project might prefer to make the data block-structured anyway and steal storage from shared_buffers. So my aspirations in this area may not even be relevant. But I wanted to mention them, just in case anyone else is thinking about similar things, so that we can potentially coordinate. If anyone was going to work on LSM tree, I would advise building a tree in shared/temp buffers first, then merging with the main tree. The merge process could use the killed tuple approach to mark the merging. The most difficult thing about buffering the inserts is deciding which poor sucker gets the task of cleaning up. That's probably better as an off-line process, which is where the work comes in. Non shared buffered approaches would add too much overhead to the main task. Thing is, if you want crash safety guarantees, you cannot use temp (unlogged) buffers, and then you always have to flush to WAL at each commit. If the staging index is shared, then it could mean a lot of WAL (ie: probably around double the amount of WAL a regular b-tree would generate). Process-private staging trees that get merged on commit, ie: transaction-scope staging trees, on the other hand, do not require WAL logging, they can use temp buffers, and since they don't outlive the transaction, it's quite obvious who does the merging (the committer). Question is what kind of workload does that speed up with any significance and whether the amount of work is worth that speedup on those workloads. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] List of binary-compatible data types
On 4 November 2013 21:58, Josh Berkus j...@agliodbs.com wrote: Folks, From our docs: Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten. As an exception, if the USING clause does not change the column contents and the old type is either binary coercible to the new type or an unconstrained domain over the new type, a table rewrite is not needed ... Which is nice, but nowhere do we present users with a set of binary-compatible data types, even among the built-in types. I'd happily write this up, if I knew what the binary-compatible data types *were*. You could try this: SELECT castsource::regtype::text, array_agg(casttarget::regtype order by casttarget::regtype::text) casttargets FROM pg_cast WHERE castmethod = 'b' GROUP BY 1 ORDER BY 1; -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row-security writer-side checks proposal
On 11/04/2013 09:55 PM, Robert Haas wrote: I continue to think that this syntax is misguided. For SELECT and DELETE there is only read-side security, and for INSERT there is only write-side security, so that's OK as far as it goes, but for UPDATE both read-side security and write-side security are possible, and there ought to be a way to get one without the other. This syntax won't support that cleanly. That's what I was thinking earlier too - separate FOR READ and FOR WRITE instead. The reason I came back to insert/update/delete was that it's entirely reasonable to want to prohibit deletes but permit updates to the same tuple. Both are writes; both set xmax, it's just that one _replaces_ the tuple, the other doesn't. So really, there are four cases: READ WRITE INSERT WRITE UPDATE WRITE DELETE I wonder whether it's worth thinking about the relationship between the write-side security contemplated for this feature iand the WITH CHECK OPTION syntax that we have for auto-updateable views, which serves more or less the same purpose. I'm not sure that syntax is any great shakes, but it's existing precedent of some form and could perhaps at least be looked at as a source of inspiration. I've been thinking about the overlap with WITH CHECK OPTION as well. I would generally expect that most people would want either read side security for all commands or read and write side security for all commands. I think whatever syntax we come up with this feature ought to make each of those things straightforward to get. but sometimes with different predicates for read and write, i.e. you can see rows you can't modify or can insert rows / update rows that you can't see after the change. Similarly, saying you can update but not delete seems quite reasonable to me. On the other hand, we might choose to say if you want to do things with that granularity use your own triggers to enforce it and provide only READ and WRITE for RLS. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.
On Mon, Nov 04, 2013 at 02:34:02PM -0500, Tom Lane wrote: Noah Misch n...@leadboat.com writes: On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote: Remove internal uses of CTimeZone/HasCTZSet. This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES styles, because it inserts a space before tzn but does not insert a space before EncodeTimezone() output. Example: set datestyle = sql,mdy; select '2013-01-01'::timestamptz; old output: timestamptz 01/01/2013 00:00:00+00 new output: timestamptz - 01/01/2013 00:00:00 +00 I assume this was unintended. Yeah, I had seen some cases of that. I don't find it of great concern. We'd have to insert some ugly special-case code that looks at the zone name to decide whether to insert a space or not, and I don't think it'd actually be an improvement to do so. (Arguably, these formats are more consistent this way.) Still, this change is probably a sufficient reason not to back-patch this part of the fix. I was prepared to suppose that no substantial clientele relies on the to_char() TZ format code expanding to blank, the other behavior that changed with this patch. It's more of a stretch to figure applications won't stumble over this new whitespace. I do see greater consistency in the new behavior, but changing type output is a big deal. While preserving the old output does require ugly special-case code, such code was in place for years until removal by this commit and the commit following it. Perhaps making the behavior change is best anyway, but we should not conclude that decision on the basis of its origin as a side effect of otherwise-desirable refactoring. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] List of binary-compatible data types
Thom, SELECT castsource::regtype::text, array_agg(casttarget::regtype order by casttarget::regtype::text) casttargets FROM pg_cast WHERE castmethod = 'b' GROUP BY 1 ORDER BY 1; Are we actually covering 100% of these for ALTER COLUMN now? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] List of binary-compatible data types
On 11/04/2013 05:21 PM, Josh Berkus wrote: Thom, SELECT castsource::regtype::text, array_agg(casttarget::regtype order by casttarget::regtype::text) casttargets FROM pg_cast WHERE castmethod = 'b' GROUP BY 1 ORDER BY 1; Are we actually covering 100% of these for ALTER COLUMN now? Also, JSON -- Text seems to be missing from the possible binary conversions. That's a TODO, I suppose. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
On 11/04/2013 11:17 PM, Robert Haas wrote: I'd still like to here what's wrong with what I said here: http://www.postgresql.org/message-id/ca+tgmoyr1phw3x9vnvuwdcfxkzk2p_jhtwc0fv2q58negcx...@mail.gmail.com For me, just my understanding. I'm still too new to the planner and rewriter to grasp that properly as written. I was responding to Tom's objection, and he makes a good point about CASE and optimisation. We have to be free to re-order and pre-evaluate where LEAKPROOF flags make it safe and permissible, without ever otherwise doing so. That makes the SECURITY BARRIER subquery look better, since the limited pull-up / push-down is already implemented there. Robert, any suggesitons on how to approach what you suggest? I'm pretty new to the planner's guts, but I know there've been some complaints about the way the current RLS code fiddles with Vars when it changes a direct scan of a rel into a subquery scan. The code in: https://github.com/ringerc/postgres/blob/rls-9.4/src/backend/optimizer/prep/prepunion.c#L1647 and https://github.com/ringerc/postgres/blob/rls-9.4/src/backend/optimizer/prep/prepunion.c#L1591 seems to be the one folks were complaining about earlier. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] List of binary-compatible data types
On Mon, Nov 04, 2013 at 05:23:36PM -0800, Josh Berkus wrote: On 11/04/2013 05:21 PM, Josh Berkus wrote: Thom, SELECT castsource::regtype::text, array_agg(casttarget::regtype order by casttarget::regtype::text) casttargets FROM pg_cast WHERE castmethod = 'b' GROUP BY 1 ORDER BY 1; Are we actually covering 100% of these for ALTER COLUMN now? Yes; ALTER TABLE ALTER TYPE refers to the same metadata as Thom's query. If you add to the list by issuing CREATE CAST ... WITHOUT FUNCTION, ALTER TABLE will respect that, too. Also, JSON -- Text seems to be missing from the possible binary conversions. That's a TODO, I suppose. Only json -- text, not json -- text. Note that you can add the cast manually if you have an immediate need. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption
On Tue, Nov 5, 2013 at 12:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Really I'd like to see standalone mode, in its current form, go away completely. I had a prototype patch for allowing psql and other clients to interface to a standalone backend. I think getting that finished would be a way more productive use of time than improving debugtup. The last patch including Windows implementation was posted at below link sometime back. http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs If this is the right thing, I can rebase the patch and make it ready. IIRC, the discussion stalled out because people had security concerns and/or there wasn't consensus about how it should look at the user level. There's probably not much point in polishing a patch until we have more agreement about the high-level design. Okay. However +1 for working on that idea as lot of people have shown interest in it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITHIN GROUP patch
On 11/04/2013 08:43 AM, Atri Sharma wrote: Please find attached our latest version of the patch. This version fixes the issues pointed out by the reviewers. No, it doesn't. The documentation still contains formatting and grammatical errors, and the code comments still do not match the their surroundings. (The use of I in the code comments is a point I have conceded on IRC, but I stand by my other remarks.) Don't bother submitting a new patch until I've posted my technical review, but please fix these issues on your local copy. -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column order and physical column order
On Mon, Nov 4, 2013 at 3:14 AM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: David Rowley escribió: I've just been looking at how alignment of columns in tuples can make the tuple larger than needed. This has been discussed at length previously, and there was a design proposed to solve this problem. See these past discussions: http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php http://archives.postgresql.org/pgsql-hackers/2008-11/msg01680.php I started work on this, and managed to get parts of it to work. While doing so I realized that it was quite a lot more hideous than I had originally expected. I published a tree at github: https://github.com/alvherre/postgres/tree/column Thanks for the archive links... I read these last night and pulled out some key pieces of information from some of the posts. I should say that I've not dived into the code too much to see how hard it would be, but my, perhaps naive original idea would have just be to add 1 column to pg_attribute to store the logical order and have attnum store the physical order... This would have meant that at least only the following places would have to take into account the change. 1. pg_dump needs to display columns in logical order both for create tables and copy/insert statements. 2. INSERT INTO table values( ... ) (i.e without column names) needs to look at logical order. 3. create table like table 4. backup and restore using copy 5. select * expand to column names And of lesser importance as I'd assume it would just be a change in an ORDER BY clause in their queries of pg_attribute 1. Display in clients... psql Pg Admin I thought the above would have been doable and I did wonder what all the fuss was about relating to bugs in the code where it could use the logical number instead of attnum. On reading of the posts last night I can see that the idea was to add not 1 but 2 new fields to pg_attribute. One was for the physical order and one for the logical order and at first I didn't really understand as I thought attnum would always be the physical order. I didn't really know before this that attnum was static... I did some tests were I dropped one of the middle columns out of a table and then rewrote the table with cluster and I see that the pg_attribute record is kept even though the remains of the column values have been wiped out by the rewrite... Is this done because things like indexes, foreign keys and sequences reference the {attrelid,attnum} ? if so then I see why the 2 extra columns are needed and I guess that's where the extra complications come from. So now I'm wondering, with my freshly clustered table which I dropped one of the middle columns from before the cluster... my pg_attributes look something like: relname | attname| attnum -+--+ dropcol | tableoid | -7 dropcol | cmax | -6 dropcol | xmax | -5 dropcol | cmin | -4 dropcol | xmin | -3 dropcol | ctid | -1 dropcol | one | 1 dropcol | pg.dropped.2 | 2 dropcol | three| 3 and I would imagine since the table has just been clustered that the columns are stored like {..., ctid, one,three} In this case how does Postgresql know that attnum 3 is the 2nd user column in that table? Unless I have misunderstood something then there must be some logic in there to skip dropped columns and if so then could it not just grab the attphynum at that location? then just modify the 1-5 places listed above to sort on attlognum? Regards David Rowley
Re: [HACKERS] Fast insertion indexes: why no developments
On 4 November 2013 19:55, Gavin Flower gavinflo...@archidevsys.co.nz wrote: How about having a 'TRANSIENT INDEX' that only exists in memory, so there is no requirement to write it to disk or to replicate directly? This type of index would be very fast and easier to implement. Recovery would involve rebuilding the index, and sharing would involve recreating on a slave. Probably not appropriate for a primary index, but may be okay for secondary indexes used to speed specific queries. This could be useful in some situations now, and allow time to get experience in how best to implement the basic concept. Then a more robust solution using WAL etc can be developed later. I suspect that such a TRANSIENT INDEX would still be useful even when a more robust in memory index method was available. As I expect it would be faster to set up than a robust memory index - which might be good when you need to have one or more indexes for a short period of time, or the size of the index is so small that recreating it requires very little time (total elapsed time might even be less than a disk backed one?). UNLOGGED indexes have been discussed over many years and there is pretty good acceptance of the idea for some use cases. The main tasks are * marking the index so they are unavailable on a hot standby * rebuilding the index in full at the end of recovery - requires an additional process to rebuild them possibly in priority order * make sure the above doesn't violate security * marking the index so it can't be used in the planner until rebuild is complete - subtle stuff -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shave a few instructions from child-process startup sequence
On Mon, Nov 4, 2013 at 12:20 AM, Tom Lane t...@sss.pgh.pa.us wrote: But we're not buying much. A few instructions during postmaster shutdown is entirely negligible. The patch is for ClosePostmasterPorts(), which is called from every child process startup sequence (as $subject also implies), not in postmaster shutdown. I hope that adds some weight to the argument. Best regards, -- Gurjeet Singh gurjeet.singh.im EnterpriseDB Inc. www.enterprisedb.com
Re: [HACKERS] Fast insertion indexes: why no developments
On 30 October 2013 14:34, Yann Fontana yann.font...@gmail.com wrote: On 30 October 2013 11:23, Leonardo Francalanci m_li...@yahoo.it wrote: In terms of generality, do you think its worth a man year of developer effort to replicate what you have already achieved? Who would pay? I work on an application that does exactly what Leonardo described. We hit the exact same problem, and came up with the same exact same solution (down to the 15 minutes interval). But I have also worked on other various datamarts (all using Oracle), and they are all subject to this problem in some form: B-tree indexes slow down bulk data inserts too much and need to be disabled or dropped and then recreated after the load. In some cases this is done easily enough, in others it's more complicated (example: every day, a process imports from 1 million to 1 billion records into a table partition that may contain from 0 to 1 billion records. To be as efficient as possible, you need some logic to compare the number of rows to insert to the number of rows already present, in order to decide whether to drop the indexes or not). Basically, my point is that this is a common problem for datawarehouses and datamarts. In my view, indexes that don't require developers to work around poor insert performance would be a significant feature in a datawarehouse-ready DBMS. Everybody on this thread is advised to look closely at Min Max indexes before starting any further work. MinMax will give us access to many new kinds of plan, plus they are about as close to perfectly efficient, by which I mean almost zero overhead, with regard to inserts as it is possible to get. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers