Re: [HACKERS] RLS restrictive hook policies
On 3 August 2015 at 16:09, Stephen Frost sfr...@snowman.net wrote: Dean, all, * Stephen Frost (sfr...@snowman.net) wrote: Agreed. I'm happy to commit that change and back-patch it to 9.5, barring objections. Given that the only way to have restrictive policies currently is using hooks, I don't believe there's any documentation update required either; of course, I'll update the comment to reflect this discussion/decision and make any necessary changes to test_rls_hooks. Patch attached to make this change, with appropriate comment updates and fixes to the test_rls_hooks modules. Comments? Looks good to me. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
Simon Riggs wrote: * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all, since they aren't critical path activities at that point It is not possible to skip scanning indexes completely, unless no tuples are to be removed from the heap. Otherwise, index tuples become lingering pointers (and when such heap address are later refilled, they become corrupted indexscans). But actually this is an interesting point and I don't think we do this: if in emergency mode, maybe we shouldn't try to remove any dead tuples at all, and instead only freeze very old tuples. That would make such vacuums go much quicker. (More accurately, if the updating xid is older than the freeze point, then remove the tuple, but otherwise keep it.) My point is that emergency vacuums are troublesome for various reasons and it would be better if they did only the minimum possible. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
Hi We've come across a weirdness with ON CONFLICT, where UPSERTing a smallint value produces an error: db=# INSERT INTO brokentab(id, k1,k2,k3,k4,k5,k6,k7, smallval) VALUES (5,0,0,0,1,0,1,0, 0) ON CONFLICT (id, k1,k2,k3,k4,k5,k6,k7) DO UPDATE SET smallval=EXCLUDED.smallval; ERROR: attribute 29 has wrong type DETAIL: Table has type integer, but query expects smallint. If you change the SET to smallval=0 the problem goes away, although using SET smallval=CAST(EXCLUDED.smallval AS smallint) - or indeed AS int - doesn't help at all. If I create a copy of the table using CREATE mytab (LIKE brokentab INCLUDING ALL); INSERT INTO mytab SELECT * FROM brokentab; the new table does not exhibit the same problem (so I'm assuming it's not easily reproducible and giving you a creation script isn't going to help). VACUUM FULL on the table makes no difference. Is there anything you guys can suggest that I can do to help narrow down the problem? Linux Centos 6.5, kernel 2.6.32-431.el6.i686, pgsql alpha1, built from source using gcc 4.4.7. Thanks Geoff
Re: [HACKERS] [PATCH] Microvacuum for gist.
On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova a.lubennik...@postgrespro.ru mailto:a.lubennik...@postgrespro.ru wrote: 1) Test and results are in attachments. Everything seems to work as expected. 2) I dropped these notices. It was done only for debug purposes. Updated patch is attached. 3) fixed Good! Another couple of notes from me: 1) I think gistvacuumpage() and gistkillitems() need function-level comments. 2) ItemIdIsDead() can be used in index scan like it's done in _bt_checkkeys(). -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com http://www.postgrespro.com/ The Russian Postgres Company I've added some comments. ItemIdIsDead() is used now (just skip dead tuples as not matching the quals). And there is one else check of LSN in gistkillitems to make sure that page was not changed between reads. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company *** a/src/backend/access/gist/gist.c --- b/src/backend/access/gist/gist.c *** *** 36,42 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ --- 36,42 bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ *** *** 209,214 gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, --- 209,225 * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* +* If leaf page is full, try at first to delete dead tuples. And then +* check again. +*/ + if ((is_split) GistPageIsLeaf(page) GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ *** *** 1440,1442 freeGISTstate(GISTSTATE *giststate) --- 1451,1522 /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate-scanCxt); } + + /* + * gistvacuumpage() -- try to remove LP_DEAD items from the given page. + */ + static void + gistvacuumpage(Relation rel, Page page, Buffer buffer) + { + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, + minoff, + maxoff; + + /* +* Scan over all items to see which ones need to be deleted according to +* LP_DEAD flags. +*/ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; +offnum = maxoff; +offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* +* Mark the page as not containing any LP_DEAD items. This is not +* certainly true (there might be some that have recently been marked, +* but weren't included in our target-item list), but it will almost +* always be true and it doesn't seem worth an additional page scan to +* check it. Remember that F_HAS_GARBAGE is only a hint anyway. +*/ + GistClearPageHasGarbage(page); + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + recptr = gistXLogUpdate(rel-rd_node, buffer, + deletable, ndeletable, + NULL, 0, InvalidBuffer); + + PageSetLSN(page, recptr); + } + else +
Re: [HACKERS] Planner debug views
Qingqing Zhou wrote: On Thu, Jul 30, 2015 at 2:42 PM, Jim Nasby jim.na...@bluetreble.com wrote: I think a better option would be shoving it into a backend tuplestore and just leaving it there (maybe with a command to clear it for the paranoid). That gives a relation you can query against, insert into another table, etc. This is something I don't know how to do it: in my understanding, a tuplestore is an internal store, which means it has no name exposed. Thus we can't reference it later. Yeah, that doesn't sound the kind of project you should attempt here. EXPLAIN already knows to return tuples, so I was assuming you would return your stuff using that mechanism. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Mon, Aug 3, 2015 at 8:55 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jul 16, 2015 at 9:58 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Here are some minor comments: +ereport(LOG, +(errmsg(ignoring \%s\ file because no \%s\ file exists, +TABLESPACE_MAP, BACKUP_LABEL_FILE), + errdetail(could not rename file \%s\ to \%s\: %m, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); WARNING is better than LOG here because it indicates a problematic case? No, that's not the right distinction. Remember that, when sending messages to the client, WARNING LOG, and when sending messages to the log, LOG WARNING. So messages that a user is more likely to care about than the administrator should be logged at WARNNG; those that the administrator is more likely to care about should be LOG. I think LOG is clearly the appropriate thing here. In detail message, the first word of sentence needs to be capitalized. + errdetail(renamed file \%s\ to \%s\, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); In detail message, basically we should use a complete sentence. So like other similar detail messages in xlog.c, I think that it's better to use \%s\ was renamed to \%s\. as the detail message here. Right, that's what the style guidelines say. I have changed the errdetail messages as per above suggestion. Also, there was one issue (it was displaying 2 messages when rename fails) in patch which I have fixed in the updated version. Thanks! Pushed. BTW, while reading the code related to tablespace_map, I found that CancelBackup() emits the WARNING message online backup mode was not canceled when rename() fails. Isn't this confusing (or incorrect)? ISTM that we can see that the online backup mode has already been canceled if backup_label file is successfully removed whether tablespace_map file remains or not. No? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] track_commit_timestamp and COMMIT PREPARED
Hi, track_commit_timestamp tracks COMMIT PREPARED as expected in standby server, but not in master server. Is this intentional? It should track COMMIT PREPARED even in master? Otherwise, we cannot use commit_timestamp feature to check the replication lag properly while we use 2PC. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On 3 August 2015 at 16:53, Geoff Winkless pgsqlad...@geoff.dj wrote: If I create a copy of the table using CREATE mytab (LIKE brokentab INCLUDING ALL); Of course I meant CREATE TABLE here... finger slippage :)
Re: [HACKERS] optimizing vacuum truncation scans
On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs si...@2ndquadrant.com wrote: On 22 July 2015 at 17:11, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes jeff.ja...@gmail.com wrote: Attached is a patch that implements the vm scan for truncation. It introduces a variable to hold the last blkno which was skipped during the forward portion. Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum. Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime. So if the bit is still set, the page is still empty without needing to inspect it. Urgh. So if we do this, that forever precludes having HOT pruning set the all-visible bit. I wouldn't say forever, as it would be easy to revert the change if something more important came along that conflicted with it. I think what is being said here is that someone is already using this technique, or if not, then we actively want to encourage them to do so as an extension or as a submission to core. In that case, I think the rely-on-VM technique sinks again, sorry Jim, Jeff. Probably needs code comments added. Sure, that sounds like the consensus. The VM method was very efficient, but I agree it is pretty fragile and restricting. That does still leave the prefetch technique, so all is not lost. Can we see a patch with just prefetch, probably with a simple choice of stride? Thanks. I probably won't get back to it this commit fest, so it can be set to returned with feedback. But if anyone has good ideas for how to set the stride (or detect that it is on SSD and so is pointless to try) I'd love to hear about them anytime. Cheers, Jeff
Re: [HACKERS] buffer locking fix for lazy_scan_heap
On Mon, Aug 3, 2015 at 12:03 AM, 高增琦 pgf...@gmail.com wrote: sorry for asking this really old commit. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7ab9b2f3b79177e501a1ef90ed004cc68788abaf I could not understand why releases the lock on the buffer is a problem since vacuum will lock and check the page bit again before set the vm. Did I missed something? Hmm, you might be right. It certainly seems safer to do all the work without releasing the buffer lock temporarily in the middle, but maybe the old way wasn't actually broken. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Mon, Aug 3, 2015 at 2:15 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote: On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote: For instance, if you told me to choose between ShareLock and ShareUpdateExclusiveLock I wouldn't know which one is strongest. I don't it's sensible to have the lock mode compare primitive honestly. I don't have any great ideas to offer ATM sadly. Yes, the thing is that lowering the lock levels is good for concurrency, but the non-monotony of the lock levels makes it impossible to choose an intermediate state correctly. How about simply acquiring all the locks individually of they're different types? These few acquisitions won't matter. As long as this only applies on master, this may be fine... We could basically pass a LOCKMASK to the multiple layers of tablecmds.c instead of LOCKMODE to track all the locks that need to be taken, and all the relations open during operations. Would it be worth having some routines like relation_multi_[open|close]() as well? Like that: Relation[] relation_multi_open(relation Oid, LOCKMASK): void relation_multi_close(Relation[]); If we do something, we may consider patching as well 9.5, it seems to me that tablecmds.c is broken by assuming that lock hierarchy is monotone in AlterTableGetLockLevel(). Hey guys, Are you sure we need to do all this changes just to check the highest locklevel based on the reloptions? Or I misunderstood the whole thing? Maybe we just check the DoLockModeConflict in the new method GetReloptionsLockLevel and if true then fallback to AccessExclusiveLock (as Michael suggested in a previous email). Regards, *** This work is funded by Zenvia Mobile Results (http://www.zenvia.com.br) -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On 2015-08-03 PM 09:24, Ashutosh Bapat wrote: On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas robertmh...@gmail.com wrote: OK, sure. But let's make sure postgres_fdw gets a server-level option to control this. For postgres_fdw it's a boolean server-level option 'twophase_compliant' (suggestions for name welcome). How about just 'twophase'? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Arguable RLS security bug, EvalPlanQual() paranoia
* Peter Geoghegan (p...@heroku.com) wrote: On Mon, Jun 1, 2015 at 12:29 AM, Peter Geoghegan p...@heroku.com wrote: If you're using another well known MVCC database system that has RLS, I imagine when this happens the attacker similarly waits on the conflicting (privileged) xact to finish (in my example in the patch, Bob's xact). However, unlike with the Postgres READ COMMITTED mode, Mallory would then have her malicious UPDATE statement entirely rolled back, and her statement would acquire an entirely new MVCC snapshot, to be used by the USING security barrier qual (and everything else) from scratch. This other system would then re-run her UPDATE with the new MVCC snapshot. This would repeat until Mallory's UPDATE statement completes without encountering any concurrent UPDATEs/DELETEs to her would-be affected rows. In general, with this other database system, an UPDATE must run to completion without violating MVCC, even in READ COMMITTED mode. For that reason, I think we can take no comfort from the presumption that this flexibility in USING security barrier quals (allowing subqueries, etc) works securely in this other system. (I actually didn't check this out, but I imagine it's true). Where are we on this? Discussion during pgCon with Heikki and Andres led me to believe that the issue is acceptable. The issue can be documented to help ensure that user expectation is in line with actual user-visible behavior. Unfortunately, I think that that will be a clunky documentation patch. It's important to realize that this is an issue beyond RLS and that it impacts Security Barrier Views also. One idea which I have for making the documentation patch a bit less clunky is to provide a simple way for users to address the issue. Along those lines, here's what I'd suggest (certainly open for discussion): --- When reducing the set of rows which a user has access to, through modifications to relations referenced by Row-Level Security Policies or Security Barrier Views, be aware that users with a currently open transaction might have a lock on an existing row and be able to see that row after the change. The best approach to avoid any possible leak of information during a reduction of a user's rights is to ensure that the user does not have any open transactions, perhaps by ensuring they have no concurrent sessions running. --- Thoughts? Trying to keep it straight-forward and provide a simple solution for users to be able to address the issue, if they're worried about it. Perhaps this, plus an additional paragraph which goes into more detail about exactly what's going on? Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Arguable RLS security bug, EvalPlanQual() paranoia
On Mon, Aug 3, 2015 at 3:07 PM, Stephen Frost sfr...@snowman.net wrote: Thoughts? Trying to keep it straight-forward and provide a simple solution for users to be able to address the issue, if they're worried about it. Perhaps this, plus an additional paragraph which goes into more detail about exactly what's going on? I'm still thinking about it, but I think you have the right idea here. However, as the docs put it when talking about conventional access controls and SELECT: The use of FOR UPDATE or FOR SHARE requires UPDATE privilege as well [as SELECT privilege] (for at least one column of each table so selected). I wonder if RLS needs to consider this, too. If you can just say that you don't have to worry about this when the user has no right to UPDATE or DELETE the rows in the first place, that makes it more practical to manage the issue. ISTM that as things stand, you can't say that because RLS does not consider SELECT FOR UPDATE special in any way. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Tue, Aug 4, 2015 at 9:12 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: Are you sure we need to do all this changes just to check the highest locklevel based on the reloptions? Well, by looking at the code that's what it looks as. That's a large change not that straight-forward. Or I misunderstood the whole thing? No I think we are on the same page. Maybe we just check the DoLockModeConflict in the new method GetReloptionsLockLevel and if true then fallback to AccessExclusiveLock (as Michael suggested in a previous email). Honestly that's what I would suggest for this patch, and also fix the existing problem of tablecmds.c that does the same assumption. It seems saner to me for now than adding a whole new level of routines and wrappers, and your patch has IMO great value when each ALTER COMMAND is kicked individually. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FSM versus GIN pending list bloat
For a GIN index with fastupdate turned on, both the user backends and autoanalyze routine will clear out the pending list, pushing the entries into the normal index structure and deleting the pages used by the pending list. But those deleted pages will not get added to the freespace map until a vacuum is done. This leads to horrible bloat on insert only tables, as it is never vacuumed and so the pending list space is never reused. And the pending list is very inefficient in space usage to start with, even compared to the old style posting lists and especially compared to the new compressed ones. (If they were aggressively recycled, this inefficient use wouldn't be much of a problem.) Even on a table receiving mostly updates after its initial population (and so being vacuumed regularly) with default autovac setting, there is a lot of bloat. The attached proof of concept patch greatly improves the bloat for both the insert and the update cases. You need to turn on both features: adding the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3). The first of those two things could probably be adopted for real, but the second probably is not acceptable. What is the right way to do this? Could a variant of RecordFreeIndexPage bubble the free space up the map immediately rather than waiting for a vacuum? It would only have to move up until it found a page with freespace already recorded in it, which the vast majority of the time would mean observing up one level and then not writing to it, assuming the pending list pages remain well clustered. Or would a completely different approach be better, like managing the vacated pending list pages directly in the index without going to the fsm? Cheers, Jeff gin_fast_freespace.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] tablecmds.c and lock hierarchy
Hi all, As mentioned in the thread related to lowering locks of autovacuum reloptions in ALTER TABLE SET (http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com), I have noticed the following code in AlterTableGetLockLevel@tablecmds.c: /* * Take the greatest lockmode from any subcommand */ if (cmd_lockmode lockmode) lockmode = cmd_lockmode; The thing is that, as mentioned by Alvaro and Andres on this thread, we have no guarantee that the different relation locks compared have a monotone hierarchy and we may finish by taking a lock that does not behave as you would like to. We are now lucky enough that ALTER TABLE only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and AccessExclusiveLock that actually have a hierarchy so this is not a problem yet. However it may become a problem if we add in the future more lock modes and that are used by ALTER TABLE. One idea mentioned in the ALTER TABLE SET thread was to enforce the lock to AccessExclusiveLock should two locks of any subcommands differ. Another idea was to select all the locks of the subcommands found to be present, and open the different relations with all of them. Tracking all those locks is going to require some heavy rewriting, like for example the use of a LOCKMASK to track the locks that need to be taken, and more extended routines for relation_open. Other possibilities may be to add a comment on top of AlterTableGetLockLevel so as we do not use a lock in ALTER TABLE commands that may result in a choice conflict with the other ones, to simply do nothing because committers are very careful people usually, or to track all the locks taken in AlterTableGetLockLevel and then run DoLockModesConflict and ERROR if there are incompatible locks. I am attaching a patch that implements the first approach mentioned to give a short idea of how things could be improved. Thoughts? -- Michael diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 970abd4..0bcdd76 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -2807,7 +2807,8 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse) * Function is called before and after parsing, so it must give same * answer each time it is called. Some subcommands are transformed * into other subcommand types, so the transform must never be made to a - * lower lock level than previously assigned. All transforms are noted below. + * lock level different than previously assigned. All transforms are noted + * below. * * Since this is called before we lock the table we cannot use table metadata * to influence the type of lock we acquire. @@ -2837,6 +2838,7 @@ AlterTableGetLockLevel(List *cmds) */ ListCell *lcmd; LOCKMODE lockmode = ShareUpdateExclusiveLock; + bool is_init = false; foreach(lcmd, cmds) { @@ -3057,10 +3059,22 @@ AlterTableGetLockLevel(List *cmds) } /* - * Take the greatest lockmode from any subcommand + * At the first command evaluated, take the lock corresponding to + * it. When looking at the next commands, check if the lock taken + * is different than the first one taken and enforce it to + * AccessExclusiveLock, which is the only lock super-seeding all + * the others. */ - if (cmd_lockmode lockmode) + if (!is_init) + { lockmode = cmd_lockmode; + is_init = true; + } + else + { + if (cmd_lockmode != lockmode) +lockmode = AccessExclusiveLock; + } } return lockmode; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind tap test unstable
On Mon, Aug 3, 2015 at 5:35 PM, Christoph Berg m...@debian.org wrote: Re: Michael Paquier 2015-07-28 cab7npqqcpgy3u7cmfo8sqquobsfmeieohueslxwycc64j3g...@mail.gmail.com On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg m...@debian.org wrote: for something between 10% and 20% of the devel builds for apt.postgresql.org (which happen every 6h if there's a git change, so it happens every few days), I'm seeing this: Dubious, test returned 1 (wstat 256, 0x100) Failed 1/8 subtests I don't have the older logs available, but from memory, the subtest failing and the two numbers mentioned are always the same. There should be some output logs in src/bin/pg_rewind/tmp_check/log/*? Could you attach them here if you have them? That would be helpful to understand what is happening. It took me a few attempts to tell the build environment to save a copy on failure and not shred everything right away. So here we go: In test case 2, the failure happens to be that the standby did not have the time to replicate the database beforepromotion that has been created on the master. One possible explanation for this failure is that the standby has been promoted before all the WAL needed for the tests has been replayed, hence we had better be sure that the replay_location of the standby matches pg_current_xlog_location() before promotion. On the buildfarm, hamster, the legendary slowest machine of the whole set, does not complain about that. Is your environment that heavy loaded when you run the tests? Perhaps the attached patch helps? -- Michael diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm index b66ff0d..fce725f 100644 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@ -232,6 +232,13 @@ sub promote_standby Now run the test-specific parts to run after standby has been started # up standby + # Wait until the standby has caught up with the primary, by polling + # pg_stat_replication. + my $caughtup_query = +SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';; + poll_query_until($caughtup_query, $connstr_master) + or die Timed out while waiting for standby to catch up; + # Now promote slave and insert some new data on master, this will put # the master out-of-sync with the standby. Wait until the standby is # out of recovery mode, and is ready to accept read-write connections. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Jul 31, 2015 4:22 AM, Jeremy Harris j...@wizmail.org wrote: On 30/07/15 02:05, Peter Geoghegan wrote: Since heapification is now a big fraction of the total cost of a sort sometimes, even where the heap invariant need not be maintained for any length of time afterwards, it might be worth revisiting the patch to make that an O(n) rather than a O(log n) operation [3]. O(n log n) ? Heapification is O(n) already, whether siftup (existing) or down. They are both linear on average, but the way we currently do it has an NlogN worst case, while the other way is linear even in the worst case. Cheers, Jeff
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Mon, Aug 3, 2015 at 7:44 PM, Fujii Masao masao.fu...@gmail.com wrote: Thanks! Pushed. Thanks to you as well for committing the patch. BTW, while reading the code related to tablespace_map, I found that CancelBackup() emits the WARNING message online backup mode was not canceled when rename() fails. Isn't this confusing (or incorrect)? Yes, it looks confusing. ISTM that we can see that the online backup mode has already been canceled if backup_label file is successfully removed whether tablespace_map file remains or not. No? I think what we should do is that display successful cancellation message only when both the files are renamed. I have drafted a patch (still I needs to verify/test it, I will do that if you think the fix is in right direction) to show what I have in mind. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_rename_backup_and_mapfile_cancel_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] cost_agg() with AGG_HASHED does not account for startup costs
During working on allowing the planner to perform GROUP BY before joining I've noticed that cost_agg() completely ignores input_startup_cost when aggstrategy == AGG_HASHED. I can see at least 3 call to cost_agg() which pass aggstrategy as AGG_HASHED that are also passing a possible non-zero startup cost. The attached changes cost_agg() to include the startup cost, but does nothing for the changed plans in the regression tests. Is this really intended? Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 7069f60..06870a3 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1639,7 +1639,8 @@ cost_agg(Path *path, PlannerInfo *root, else { /* must be AGG_HASHED */ - startup_cost = input_total_cost; + startup_cost = input_startup_cost; + startup_cost += input_total_cost; startup_cost += aggcosts-transCost.startup; startup_cost += aggcosts-transCost.per_tuple * input_tuples; startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
On Tue, Aug 4, 2015 at 2:04 AM, Simon Riggs si...@2ndquadrant.com wrote: On 3 August 2015 at 17:36, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Simon Riggs wrote: * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all, since they aren't critical path activities at that point It is not possible to skip scanning indexes completely, unless no tuples are to be removed from the heap. Right. But actually this is an interesting point and I don't think we do this: if in emergency mode, maybe we shouldn't try to remove any dead tuples at all, and instead only freeze very old tuples. +1 ... not sure if that's what Simon had in mind exactly, but it seems like a correct statement of what he was getting at. Yes, that's what I was thinking, I just didn't say actually it. I'd been thinking about having VACUUM do just Phase 1 for some time, since its so much faster to do that. Will code. Interesting. I'll be happy to have a look at any patch produced, that's surely something we want to improve in emergency mode. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablecmds.c and lock hierarchy
Michael Paquier wrote: As mentioned in the thread related to lowering locks of autovacuum reloptions in ALTER TABLE SET (http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com), I have noticed the following code in AlterTableGetLockLevel@tablecmds.c: /* * Take the greatest lockmode from any subcommand */ if (cmd_lockmode lockmode) lockmode = cmd_lockmode; The thing is that, as mentioned by Alvaro and Andres on this thread, we have no guarantee that the different relation locks compared have a monotone hierarchy and we may finish by taking a lock that does not behave as you would like to. Maybe the solution to this is to add the concept of addition of two lock modes, where the result is another lock mode that conflicts with any lock that would conflict with either of the two operand lock modes. For instance, if you add a lock mode to itself, you get the same lock mode; if you add anything to AccessExclusive, you get AccessExclusive; if you add anything to AccessShare, you end up with that other lock mode. The most interesting case in our current lock table is if you add ShareUpdateExclusive to Share, where you end up with ShareRowExclusive. In essence, holding the result of that operation is the same as holding both locks, which AFAICS is the behavior we want. This can be implemented with a simple table and solves the problem for both ALTER TABLE SET and this existing usage you cite and is not as invasive as your first proposed solution. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson memissemer...@gmail.com wrote: Hello, Just looking at how the 2 differnt methods can be used to set the s_s_names value. 1. For a simple case where quorum is required for a single group the JSON could be: { sync_standby_names: { quorum:2, nodes: [ node1,node2,node3 ] } } or { sync_standby_names: { quorum:2, group: cluster1 }, groups: { cluster1:[node1,node2,node3] } } Language: 2(node1, node2, node3) 2. For having quorum between different groups and node: { sync_standby_names: { quorum:2, nodes: [ {priority:1,nodes:[node0]}, {quorum:2,group: cluster1} ] }, groups: { cluster1:[node1,node2,node3] } } or { sync_standby_names: { quorum:2, nodes: [ {priority:1,group: cluster2}, {quorum:2,group: cluster1} ] }, groups: { cluster1:[node1,node2,node3], cluster2:[node0] } } Language: 2 (node0, cluster1: 2(node1, node2, node3)) Since there will not be many nesting and grouping, I still prefer new language to JSON. I understand one can easily, modify/add groups in JSON using in built functions but I think changes will not be done too often. If we decided to use dedicated language, the syntax checker for that language is needed, via SQL or something. Otherwise we will not be able to know whether the parsing that value will be done correctly, until reloading or restarting server. Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
Why does this cause a core dump? We could consider fixing whatever the problem is rather than capping the value. As far as I experiment with my own evaluation environment using PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch attached. I have confirmed that applying the patch makes 'wal_buffers = 4GB' works fine, while original PostgreSQL-9.4.4 results in core dump (segfault). I'll be happy if anyone reconfirm this. -- Takashi Horikawa NEC Corporation Knowledge Discovery Research Laboratories -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas Sent: Tuesday, August 04, 2015 2:29 AM To: Josh Berkus Cc: PostgreSQL-development Subject: Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes On Fri, Jul 31, 2015 at 8:09 PM, Josh Berkus j...@agliodbs.com wrote: On 07/31/2015 10:43 AM, Robert Haas wrote: On Thu, Jul 30, 2015 at 9:17 PM, Josh Berkus j...@agliodbs.com wrote: In guc.c, the maximum for wal_buffers is INT_MAX. However, wal_buffers is actually measured in 8KB buffers, not in bytes. This means that users are able to set wal_buffers 2GB. When the database is started, this can cause a core dump if the WAL offset is 2GB. Why does this cause a core dump? We could consider fixing whatever the problem is rather than capping the value. The underlying issue is that byte position in wal_buffers is a 32-bit INT, so as soon as you exceed that, core dump. OK. So capping it sounds like the right approach, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers allow_wal_buffer_over_2G.patch Description: Binary data smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Mon, Aug 3, 2015 at 1:36 PM, Robert Haas robertmh...@gmail.com wrote: I don't think that's what Heikki is talking about. He can correct me if I'm wrong, but what I think he's saying is that we should try to exploit the fact that we've already determined which in-memory tuples can be part of the current run and which in-memory tuples must become part of the next run. Suppose half the tuples in memory can become part of the current run and the other half must become part of the next run. If we throw all of those tuples into a single bucket and quicksort it, we're losing the benefit of the comparisons we did to figure out which tuples go in which runs. Instead, we could quicksort the current-run tuples and, separately, quick-sort the next-run tuples. Ignore the current-run tuples completely until the tape is empty, and then merge them with any next-run tuples that remain. Oh. Well, the benefit of the comparisons we did to figure out which tuples go in which runs is that we can determine the applicability of this optimization. Also, by keeping run 1 (if any) usually in memory, and run 0 partially on disk we avoid having to worry about run 1 as a thing that spoils the optimization (in the current single run optimization, dumping all tuples can make us acknowledge run 1 (i.e. currentRun++), preventing single run optimization, which we handily avoid in the patch). Finally, it saves us a bunch of real COMPARETUP() comparisons as HEAPCOMPARE() is called as tuples are inserted into the still-heapified memtuples array. I'm not sure if there's any reason to believe that would be faster than your approach. In general, sorting is O(n lg n) so sorting two arrays that are each half as large figures to be slightly faster than sorting one big array. But the difference may not amount to much. IMV, the smart way of avoiding wasting the comparisons we did to figure out which tuples go in which runs is to rig HEAPCOMPARE() to only do a COMPARETUP() for the currentRun, and make sure that we don't mess up and forget that if we don't end up quicksorting. The second run that is in memory can only consist of whatever tuples were added after heapification that were less than any of those previously observed tuples (a big majority, usually). So like you, I can't see any of these techniques helping much, even my smart technique. Maybe I should look at a case involving text or something to be sure. Thinking about it some more, I don't think it would be easy to maintain a clear separation between run 0 and run 1 in the memtuples array in terms of a cutoff point. It's still a heap at that stage, of course. You'd have to rig each tuple comparator so that COMPARETUP() cared about tupindex before comparing datum1 just for this, which seems rather unappealing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Microvacuum for gist.
On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova a.lubennik...@postgrespro.ru wrote: 1) Test and results are in attachments. Everything seems to work as expected. 2) I dropped these notices. It was done only for debug purposes. Updated patch is attached. 3) fixed Good! Another couple of notes from me: 1) I think gistvacuumpage() and gistkillitems() need function-level comments. 2) ItemIdIsDead() can be used in index scan like it's done in _bt_checkkeys(). -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] pageinspect patch, for showing tuple data
В письме от 3 августа 2015 14:30:46 пользователь Michael Paquier написал: On Mon, Aug 3, 2015 at 1:03 AM, Nikolay Shaplov n.shap...@postgrespro.ru wrote: Hi! I've created a patch for pageinspect that allows to see data stored in the tuple. This patch has two main purposes: 1. Practical: Make manual DB recovery more simple To what are you referring to in this case? Manual manipulation of on-disk data manually? Yes, when DB is broken for example b) I have plenty of sanity check while reading parsing that tuple, for this function I've changed error level from ERROR to WARNING. This function will allow to write proper tests that all these checks work as they were designed (I hope to write these tests sooner or later) +ereport(inter_call_data-error_level, +(errcode(ERRCODE_DATA_CORRUPTED), +errmsg(Data corruption: Iterating over tuple data reached out of actual tuple size))); I don't think that the possibility to raise a WARNING is a good thing in those code paths. If data is corrupted this may crash, and I am not sure that anybody would want that even for educational purposes. Hm... I considered _heap_page_items really very dangerous function, with big red Do not call it if you not sure what are you doing warning. Reading data with not proper attribute descriptor is dangerous any way. But when I wrote that code, I did not have that checks at first, and it was really interesting to subst one data with another and see what will happen. And I thought that may be other explorers will like to do the same. And it is really possible only in warning mode. So I left warnings only in _heap_page_items, and set errors for all other cases. -- Nikolay Shaplov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Microvacuum for gist.
30.07.2015 16:33, Alexander Korotkov пишет: Hi! On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova lubennikov...@gmail.com mailto:lubennikov...@gmail.com wrote: I have written microvacuum support for gist access method. Briefly microvacuum includes two steps: 1. When search tells us that the tuple is invisible to all transactions it is marked LP_DEAD and page is marked as has dead tuples, 2. Then, when insert touches full page which has dead tuples it calls microvacuum instead of splitting page. You can find a kind of review here [1]. [1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120 Patch is in attachements. Please review it. Nice! Some notes about this patch. 1) Could you give same test case demonstrating that microvacuum really work with patch? Finally, we should get index less growing with microvacuum. 2) Generating notices for every dead tuple would be too noisy. I suggest to replace notice with one of debug levels. + elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d, offnum, BufferGetBlockNumber(buffer)); 3) Please, recheck coding style. For instance, this line needs more spaces and open brace should be on the next line. + if ((scan-kill_prior_tuple)(so-curPageData 0)(so-curPageData == so-nPageData)) { -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com http://www.postgrespro.com/ The Russian Postgres Company 1) Test and results are in attachments. Everything seems to work as expected. 2) I dropped these notices. It was done only for debug purposes. Updated patch is attached. 3) fixed // -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company *** a/src/backend/access/gist/gist.c --- b/src/backend/access/gist/gist.c *** *** 36,42 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ --- 36,42 bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ *** *** 209,214 gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, --- 209,225 * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* +* If leaf page is full, try at first to delete dead tuples. And then +* check again. +*/ + if ((is_split) GistPageIsLeaf(page) GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ *** *** 1440,1442 freeGISTstate(GISTSTATE *giststate) --- 1451,1519 /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate-scanCxt); } + + static void + gistvacuumpage(Relation rel, Page page, Buffer buffer) + { + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, + minoff, + maxoff; + + /* +* Scan over all items to see which ones need to be deleted according to +* LP_DEAD flags. +*/ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; +offnum = maxoff; +offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* +* Mark the page as not containing any LP_DEAD items. This is not +* certainly true (there might be some that have recently been marked, +* but weren't included in our target-item list), but it will almost +* always be true and it doesn't seem worth an additional
Re: [HACKERS] pg_rewind tap test unstable
Re: Michael Paquier 2015-07-28 cab7npqqcpgy3u7cmfo8sqquobsfmeieohueslxwycc64j3g...@mail.gmail.com On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg m...@debian.org wrote: for something between 10% and 20% of the devel builds for apt.postgresql.org (which happen every 6h if there's a git change, so it happens every few days), I'm seeing this: Dubious, test returned 1 (wstat 256, 0x100) Failed 1/8 subtests I don't have the older logs available, but from memory, the subtest failing and the two numbers mentioned are always the same. There should be some output logs in src/bin/pg_rewind/tmp_check/log/*? Could you attach them here if you have them? That would be helpful to understand what is happening. It took me a few attempts to tell the build environment to save a copy on failure and not shred everything right away. So here we go: Christoph -- c...@df7cb.de | http://www.df7cb.de/ master.log.gz Description: application/gzip regress_log_001_basic.gz Description: application/gzip regress_log_002_databases.gz Description: application/gzip regress_log_003_extrafiles.gz Description: application/gzip standby.log.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
On 2 August 2015 at 13:13, Michael Paquier michael.paqu...@gmail.com wrote: Hi all, Commit 4046e58c (dated of 2001) has introduced the following comment in vacuumlazy.c: + /* If any tuples need to be deleted, perform final vacuum cycle */ + /* XXX put a threshold on min nuber of tuples here? */ + if (vacrelstats-num_dead_tuples 0) In short, we may want to have a reloption to decide if we do or not the last pass of VACUUM or not depending on a given number of remaining tuples. Is this still something we would like to have? I don't think we want a new user parameter, but we should have an internal limit with a heuristic, similar to how we decide whether to truncate. I would suggest this internal logic... * If its a VACUUM FREEZE then index_scan_threshold = 0, i.e. always scan if needed, since the user is requesting maximum vacuum * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all, since they aren't critical path activities at that point * For normal VACUUMs we should scan indexes only if (num_dead_tuples * 20) (blocks to be scanned in any one index), which allows some index bloat but not much -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Tab completion for CREATE SEQUENCE
On Tue, Jul 7, 2015 at 9:14 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-19 06:41:19 +, Brendan Jurd wrote: I'm marking this Waiting on Author. Once the problems have been corrected, it should be ready for a committer. Vik, are you going to update the patch? Seeing no activity on this thread and as it would be a waste not to do it, I completed the patch as attached. The following things are done: - Fixed code indentation - Removal of RESTART, SET SCHEMA, OWNER TO, and RENAME TO in CREATE SEQUENCE - Added START WITH. - Added TEMP/TEMPORARY in the set of keywords tracked. I am switching at the same time this patch as Ready for committer. Regards, -- Michael diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index ece0515..0748284 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -2467,6 +2467,35 @@ psql_completion(const char *text, int start, int end) pg_strcasecmp(prev_wd, TO) == 0) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables, NULL); +/* CREATE TEMP/TEMPORARY SEQUENCE name */ + else if ((pg_strcasecmp(prev3_wd, CREATE) == 0 + pg_strcasecmp(prev2_wd, SEQUENCE) == 0) || + (pg_strcasecmp(prev4_wd, CREATE) == 0 + (pg_strcasecmp(prev3_wd, TEMP) == 0 || + pg_strcasecmp(prev3_wd, TEMPORARY) == 0) + pg_strcasecmp(prev2_wd, SEQUENCE) == 0)) + { + static const char *const list_CREATESEQUENCE[] = + {INCREMENT BY, MINVALUE, MAXVALUE, RESTART, NO, CACHE, + CYCLE, OWNED BY, START WITH, NULL}; + + COMPLETE_WITH_LIST(list_CREATESEQUENCE); + } +/* CREATE TEMP/TEMPORARY SEQUENCE name NO */ + else if ((pg_strcasecmp(prev4_wd, CREATE) == 0 + pg_strcasecmp(prev3_wd, SEQUENCE) == 0) || + (pg_strcasecmp(prev5_wd, CREATE) == 0 + (pg_strcasecmp(prev4_wd, TEMP) == 0 || + pg_strcasecmp(prev4_wd, TEMPORARY) == 0) + pg_strcasecmp(prev3_wd, SEQUENCE) == 0) + pg_strcasecmp(prev_wd, NO) == 0) + { + static const char *const list_CREATESEQUENCE2[] = + {MINVALUE, MAXVALUE, CYCLE, NULL}; + + COMPLETE_WITH_LIST(list_CREATESEQUENCE2); + } + /* CREATE SERVER name */ else if (pg_strcasecmp(prev3_wd, CREATE) == 0 pg_strcasecmp(prev2_wd, SERVER) == 0) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
Peter Geoghegan writes: On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich seltenre...@gmx.de wrote: sqlsmith triggered the following assertion in master (c188204). Thanks for writing sqlsmith. It seems like a great tool. I wonder, are you just running the tool with assertions enabled when PostgreSQL is built? Right. I have to admit my testing setup is still more tailored towards testing sqlsmith than postgres. If so, it might make sense to make various problems more readily detected. As you may know, Clang has a pretty decent option called AddressSanitizer that can detect memory errors as they occur with an overhead that is not excessive. I didn't known this clang feature yet, thanks for pointing it out. I considered running some instances under valgrind to detect these, but the performance penalty seemed not worth it. One might use the following configure arguments when building PostgreSQL to use AddressSanitizer: ./configure CC=clang CFLAGS='-O1 -g -fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert A quick attempt to sneak these in made my ansible playbooks unhappy due to make check failures and other generated noise. I'll try to have an instance with the AddressSanitizer active soon though. Of course, it remains to be seen if this pays for itself. Apparently the tool has about a 2x overhead [1]. I'm really not sure that you'll find any more bugs this way, but it's certainly possible that you'll find a lot more. Given your success in finding bugs without using AddressSanitizer, introducing it may be premature. Piotr also suggested on IRC to run coverage tests w/ sqlsmith. This could yield valuable hints in which direction to extend sqlsmith's grammar. Thanks, Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS restrictive hook policies
Dean, all, * Stephen Frost (sfr...@snowman.net) wrote: Agreed. I'm happy to commit that change and back-patch it to 9.5, barring objections. Given that the only way to have restrictive policies currently is using hooks, I don't believe there's any documentation update required either; of course, I'll update the comment to reflect this discussion/decision and make any necessary changes to test_rls_hooks. Patch attached to make this change, with appropriate comment updates and fixes to the test_rls_hooks modules. Comments? Thanks! Stephen From 173db8f942636e92f20145844b7dadd04ccac765 Mon Sep 17 00:00:00 2001 From: Stephen Frost sfr...@snowman.net Date: Mon, 3 Aug 2015 10:59:24 -0400 Subject: [PATCH] RLS: Keep deny policy when only restrictive exist Only remove the default deny policy when a permissive policy exists (either from the hook or defined by the user). If only restrictive policies exist then no rows will be visible, as restrictive policies shouldn't make rows visible. To address this requirement, a single USING (true) permissive policy can be created. Update the test_rls_hooks regression tests to create the necessary USING (true) permissive policy. Back-patch to 9.5 where RLS was added. Per discussion with Dean. --- src/backend/rewrite/rowsecurity.c | 14 ++ .../modules/test_rls_hooks/expected/test_rls_hooks.out | 7 +++ src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql | 8 src/test/modules/test_rls_hooks/test_rls_hooks.c | 5 + 4 files changed, 30 insertions(+), 4 deletions(-) diff --git a/src/backend/rewrite/rowsecurity.c b/src/backend/rewrite/rowsecurity.c index 562dbc9..5a81db3 100644 --- a/src/backend/rewrite/rowsecurity.c +++ b/src/backend/rewrite/rowsecurity.c @@ -225,12 +225,18 @@ get_row_security_policies(Query *root, CmdType commandType, RangeTblEntry *rte, } /* - * If the only built-in policy is the default-deny one, and hook policies - * exist, then use the hook policies only and do not apply the + * If the only built-in policy is the default-deny one, and permissive hook + * policies exist, then use the hook policies only and do not apply the * default-deny policy. Otherwise, we will apply both sets below. + * + * Note that we do not remove the defaultDeny policy if only *restrictive* + * policies exist as restrictive policies should only ever be reducing what + * is visible. Therefore, at least one permissive policy must exist which + * allows records to be seen before restrictive policies can remove rows + * from that set. A single true policy can be created to address this + * requirement, if necessary. */ - if (defaultDeny - (hook_policies_restrictive != NIL || hook_policies_permissive != NIL)) + if (defaultDeny hook_policies_permissive != NIL) { rowsec_expr = NULL; rowsec_with_check_expr = NULL; diff --git a/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out b/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out index 3a7a4c3..4587eb0 100644 --- a/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out +++ b/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out @@ -13,6 +13,11 @@ CREATE TABLE rls_test_restrictive ( supervisor name, datainteger ); +-- At least one permissive policy must exist, otherwise +-- the default deny policy will be applied. For +-- testing the only-restrictive-policies from the hook, +-- create a simple 'allow all' policy. +CREATE POLICY p1 ON rls_test_restrictive USING (true); -- initial test data INSERT INTO rls_test_restrictive VALUES ('r1','s1',1); INSERT INTO rls_test_restrictive VALUES ('r2','s2',2); @@ -109,6 +114,8 @@ RESET ROLE; -- Create internal policies, to check that the policies from -- the hooks are combined correctly. CREATE POLICY p1 ON rls_test_permissive USING (data % 2 = 0); +-- Remove the original allow-all policy +DROP POLICY p1 ON rls_test_restrictive; CREATE POLICY p1 ON rls_test_restrictive USING (data % 2 = 0); CREATE POLICY p1 ON rls_test_both USING (data % 2 = 0); SET ROLE r1; diff --git a/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql b/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql index ece4ab9..3071213 100644 --- a/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql +++ b/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql @@ -17,6 +17,12 @@ CREATE TABLE rls_test_restrictive ( datainteger ); +-- At least one permissive policy must exist, otherwise +-- the default deny policy will be applied. For +-- testing the only-restrictive-policies from the hook, +-- create a simple 'allow all' policy. +CREATE POLICY p1 ON rls_test_restrictive USING (true); + -- initial test data INSERT INTO rls_test_restrictive VALUES ('r1','s1',1); INSERT INTO rls_test_restrictive VALUES ('r2','s2',2); @@ -101,6 +107,8 @@ RESET ROLE; -- the hooks are combined correctly. CREATE POLICY
Re: [HACKERS] nodes/*funcs.c inconsistencies
Peter Geoghegan wrote: On Sun, Aug 2, 2015 at 3:07 PM, Peter Geoghegan p...@heroku.com wrote: I'm surprised that this stuff was only ever used for logical decoding infrastructure so far. On second thought, having tried it, one reason is that that breaks things that are considered legitimate for historical reasons. For example, AttrNumber is often used with READ_INT_FIELD(), which is an int16. Whether or not it's worth fixing that by introducing a READ_ATTRNUM_FIELD() (and so on) is not obvious to me. If it allows us to introduce additional checking for new code, I'm all for it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
On Mon, Aug 3, 2015 at 11:52 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Simon Riggs wrote: * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all, since they aren't critical path activities at that point It is not possible to skip scanning indexes completely, unless no tuples are to be removed from the heap. Otherwise, index tuples become lingering pointers (and when such heap address are later refilled, they become corrupted indexscans). Well, if we skip the index scans, we can't do the second heap pass either, but that's OK. I think we're all talking about the same thing here, which is to do only the first heap pass in some cases. That will prune dead tuples to line pointers, freeze old XIDs, and mark pages all-visible where appropriate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote: the new table does not exhibit the same problem (so I'm assuming it's not easily reproducible and giving you a creation script isn't going to help). VACUUM FULL on the table makes no difference. Is there anything you guys can suggest that I can do to help narrow down the problem? Yes. You should enable all of the following settings: debug_print_parse (boolean) debug_print_rewritten (boolean) debug_print_plan (boolean) debug_pretty_print (boolean) which you can do dynamically from psql. Then post the output for both the failing version (on that table where you can reproduce the issue), and the other table where it seems that you ought to be able to reproduce the problem, but you can't. Maybe that'll show where the problem is. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect patch, for showing tuple data
Nikolay Shaplov wrote: This patch adds several new functions, available from SQL queries. All these functions are based on heap_page_items, but accept slightly different arguments and has one additional column at the result set: heap_page_tuples - accepts relation name, and bulkno, and returns usual heap_page_items set with additional column that contain snapshot of tuple data area stored in bytea. I think the API around get_raw_page is more useful generally. You might have table blocks stored in a bytea column for instance, because you extracted from some remote server and inserted into a local table for examination. If you only accept relname/blkno, there's no way to examine data other than blocks directly in the database dir, which is limiting. There is also one strange function: _heap_page_items it is useless for practical purposes. As heap_page_items it accepts page data as bytea, but also get a relation name. It tries to apply tuple descriptor of that relation to that page data. For BRIN, I added something similar (brin_page_items), but it receives the index OID rather than name, and constructs a tuple descriptor to read the data. I think OID is better than name generally. (You can cast the relation name to regclass). It's easy to misuse, but these functions are superuser-only, so there should be no security issue at least. The possibility of a crash remains real, though, so maybe we should try to come up with a way to harden that. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
Piotr Stefaniak postg...@piotr-stefaniak.me writes: How about this one? 1 ERROR: could not find RelOptInfo for given relids That would be a bug, for sure ... It's triggered on 13bba02271dce865cd20b6f49224889c73fed4e7 by this query and the attached one: ... but I can't reproduce it on HEAD with either of these queries. Not clear why you're getting different results. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
On Mon, Aug 3, 2015 at 4:13 AM, Simon Riggs si...@2ndquadrant.com wrote: * For normal VACUUMs we should scan indexes only if (num_dead_tuples * 20) (blocks to be scanned in any one index), which allows some index bloat but not much I think this kind of heuristic is good, but I think we should expose a setting for it. There's no way for us to know without testing whether the right value for that multiplier is 2 or 20 or 200 or 2000, and if we don't make it easy to tweak, we'll never find out. It may even be workload-dependent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nodes/*funcs.c inconsistencies
Peter Geoghegan wrote: Couldn't we adopt AssertVariableIsOfType()/AssertVariableIsOfTypeMacro() to macros like READ_UINT_FIELD()? I'm surprised that this stuff was only ever used for logical decoding infrastructure so far. The reason it's only used there is that Andres is the one who introduced those macros precisely for that code. We've not yet had time to adjust the rest of the code to have more sanity checks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
Alvaro Herrera alvhe...@2ndquadrant.com writes: Simon Riggs wrote: * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all, since they aren't critical path activities at that point It is not possible to skip scanning indexes completely, unless no tuples are to be removed from the heap. Right. But actually this is an interesting point and I don't think we do this: if in emergency mode, maybe we shouldn't try to remove any dead tuples at all, and instead only freeze very old tuples. +1 ... not sure if that's what Simon had in mind exactly, but it seems like a correct statement of what he was getting at. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
On Fri, Jul 31, 2015 at 8:09 PM, Josh Berkus j...@agliodbs.com wrote: On 07/31/2015 10:43 AM, Robert Haas wrote: On Thu, Jul 30, 2015 at 9:17 PM, Josh Berkus j...@agliodbs.com wrote: In guc.c, the maximum for wal_buffers is INT_MAX. However, wal_buffers is actually measured in 8KB buffers, not in bytes. This means that users are able to set wal_buffers 2GB. When the database is started, this can cause a core dump if the WAL offset is 2GB. Why does this cause a core dump? We could consider fixing whatever the problem is rather than capping the value. The underlying issue is that byte position in wal_buffers is a 32-bit INT, so as soon as you exceed that, core dump. OK. So capping it sounds like the right approach, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nodes/*funcs.c inconsistencies
* Noah Misch (n...@leadboat.com) wrote: On Sun, Aug 02, 2015 at 11:31:16PM -0400, Tom Lane wrote: That being the case, it would probably be a good idea to get them done before alpha2, as there may not be a good opportunity afterwards. Freedom to bump catversion after alpha2 will be barely-distinguishable from freedom to do so now. I have planned to leave my usual comment period of a few days, though skipping that would be rather innocuous in this case. This is clearly necessary, of course, and I wouldn't be surprised if we have another necessary bump post-alpha2, but at the same time, it'd certainly be nice if we are able to put out alpha2 and then a beta1 in the future without needing to do a bump. In other words, +1 from me for going ahead and committing this for alpha2, if possible. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types
On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote: If I create a copy of the table using CREATE mytab (LIKE brokentab INCLUDING ALL); INSERT INTO mytab SELECT * FROM brokentab; Also, did you drop any columns from the original brokentab table where the bug can be reproduced? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Thu, Jul 16, 2015 at 9:58 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Here are some minor comments: +ereport(LOG, +(errmsg(ignoring \%s\ file because no \%s\ file exists, +TABLESPACE_MAP, BACKUP_LABEL_FILE), + errdetail(could not rename file \%s\ to \%s\: %m, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); WARNING is better than LOG here because it indicates a problematic case? No, that's not the right distinction. Remember that, when sending messages to the client, WARNING LOG, and when sending messages to the log, LOG WARNING. So messages that a user is more likely to care about than the administrator should be logged at WARNNG; those that the administrator is more likely to care about should be LOG. I think LOG is clearly the appropriate thing here. In detail message, the first word of sentence needs to be capitalized. + errdetail(renamed file \%s\ to \%s\, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); In detail message, basically we should use a complete sentence. So like other similar detail messages in xlog.c, I think that it's better to use \%s\ was renamed to \%s\. as the detail message here. Right, that's what the style guidelines say. I have changed the errdetail messages as per above suggestion. Also, there was one issue (it was displaying 2 messages when rename fails) in patch which I have fixed in the updated version. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com rename_mapfile_if_backupfile_not_present_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jul 31, 2015 at 6:33 AM, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: I'm not hung up on the table-level attribute, but I think having a server-level attribute rather than a global GUC is a good idea. However, I welcome other thoughts on that. The patch supports server level attribute. Let me repeat the relevant description from my earlier mail -- Every FDW needs to register the connection while starting new transaction on a foreign connection (RegisterXactForeignServer()). A foreign server connection is identified by foreign server oid and the local user oid (similar to the entry cached by postgres_fdw). While registering, FDW also tells whether the foreign server is capable of participating in two-phase commit protocol. How to decide that is left entirely to the FDW. An FDW like file_fdw may not have 2PC support at all, so all its foreign servers do not comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW like postgres_fdw can have some of its servers compliant and some not, depending upon server version, configuration (max_prepared_transactions = 0) etc. -- Does that look good? OK, sure. But let's make sure postgres_fdw gets a server-level option to control this. For postgres_fdw it's a boolean server-level option 'twophase_compliant' (suggestions for name welcome). Done, there are three hooks now 1. For preparing a foreign transaction 2. For resolving a prepared foreign transaction 3. For committing/aborting a running foreign transaction (more explanation later) (2) and (3) seem like the same thing. I don't see any further explanation later in your email; what am I missing? In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz' (fill the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION (notice absence of PREPARED and 'xyz'). Oh, OK. But then isn't #3 something we already have? i.e. pgfdw_xact_callback? While transactions are being prepared on the foreign connections, if any prepare fails, we have to abort transactions on the rest of the connections (and abort the prepared transactions). pgfdw_xact_callback wouldn't know, which connections have prepared transactions and which do not have. So, even in case of two-phase commit we need all the three hooks. Since we have to define these three hooks, we might as well centralize all the transaction processing and let the foreign transaction manager decide which of the hooks to invoke. So, the patch moves most of the code in pgfdw_xact_callback in the relevant hook and foreign transaction manager invokes appropriate hook. Only thing that remains in pgfdw_xact_callback now is end of transaction handling like resetting cursor numbering. That seems totally broken. Before PITR, the database might be inconsistent, in which case you can't call any functions at all. Also, you shouldn't be trying to resolve any transactions until the end of recovery, because you don't know when you see that the transaction was prepared whether, at some subsequent time, you will see it resolved. You need to finish recovery and, only after entering normal running, decide whether to resolve any transactions that are still sitting around. That's how it works in the patch for unresolved prepared foreign transactions belonging to xids within the known range. For those belonging to xids in future (beyond of known range of xids after PITR), we can not determine the status of that local transaction (as those do not appear in the xlog) and hence can not decide the fate of prepared foreign transaction. You seem to be suggesting that we should let the recovery finish and mark those prepared foreign transaction as can not be resolved or something like that. A DBA can remove those entries once s/he has dealt with them on the foreign server. There's little problem with that approach. Triplet (xid, serverid, userid) is used to identify the a foreign prepared transaction entry in memory and is used to create unique file name for storing it on the disk. If we allow a future xid after PITR, it might conflict with an xid of a transaction that might take place after PITR. It will cause problem if exactly same foreign server and user participate in the transaction with conflicting xid (rare but possible). Other problem is that the foreign server on which the transaction was prepared (or the user whose mapping was used to prepare the transaction), might have got added in a future time wrt PITR, in which case, we can not even know which foreign server this transaction was prepared on. There should be no situation (short of e.g. OS errors writing the state files) where this stuff makes recovery fail. During PITR, if we encounter a prepared (local) transaction with a future xid, we
[HACKERS] RLS restrictive hook policies
In get_row_security_policies(): /* * If the only built-in policy is the default-deny one, and hook policies * exist, then use the hook policies only and do not apply the * default-deny policy. Otherwise, we will apply both sets below. */ if (defaultDeny (hook_policies_restrictive != NIL || hook_policies_permissive != NIL)) { rowsec_expr = NULL; rowsec_with_check_expr = NULL; } So if the only policies that exist are restrictive policies from an extension, the default-deny policy is not applied and the restrictive policies are applied instead, which is effectively a default-allow situation with each restrictive policy further limiting access to the table. I think this is potentially very dangerous when both kinds of policy exist. Consider for example a situation where initially multiple permissive policies are set up for different users allowing them access to different subsets of the table, and users not covered by any of those permissive policies have no access. Then suppose that later a restrictive policy is added, applying to all users -- the intention being to prevent any user from accessing some particularly sensitive data. The result of adding that restrictive policy would be that all the users who previously had no access at all would suddenly have access to everything not prohibited by the restrictive policy. That goes against everything I would expect from a restrictive policy -- adding more restrictive policies should only ever reduce the number of rows visible, not ever open up more. So I think the test above should only disable the default-deny policy if there are any permissive policies from the extension. Of course that will make life a little harder for people who only want to use restrictive policies, since they will be forced to first add a permissive policy, possibly just one that's always true, but I think it's the safest approach. Possibly if/when we add proper SQL support for defining restrictive policies, we might also add an option to ALTER TABLE ... ENABLE ROW LEVEL SECURITY to allow a table to have RLS enabled in default-allow mode, for users who know they only want to add restrictive policies. However, I think that should be something the user has to explicitly ask for, not an automatic decision based on the existence of restrictive policies. Thoughts? Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Fri, Jul 31, 2015 at 10:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Jul 29, 2015 at 11:48 PM, Andres Freund and...@anarazel.de wrote: On 2015-07-29 12:54:59 -0400, Robert Haas wrote: I would try to avoid changing lwlock.c. It's pretty easy when so doing to create mechanisms that work now but make further upgrades to the general lwlock mechanism difficult. I'd like to avoid that. I'm massively doubtful that re-implementing parts of lwlock.c is the better outcome. Then you have two different infrastructures you need to improve over time. I agree and modified the patch to use 32-bit atomics based on idea suggested by Robert and didn't modify lwlock.c. While looking at patch, I found that the way it was initialising the list to be empty was wrong, it was using pgprocno as 0 to initialise the list, as 0 is a valid pgprocno. I think we should use a number greater that PROCARRAY_MAXPROC (maximum number of procs in proc array). Apart from above fix, I have modified src/backend/access/transam/README to include the information about the improvement this patch brings to reduce ProcArrayLock contention. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com group_xid_clearing_at_trans_end_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 08/03/2015 07:01 AM, Michael Paquier wrote: On Sun, Aug 2, 2015 at 4:01 AM, Heikki Linnakangas wrote: Perhaps it's best if we copy all the WAL files from source in copy-mode, but not in libpq mode. Regarding old WAL files in the target, it's probably best to always leave them alone. They should do no harm, and as a general principle it's best to avoid destroying evidence. It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do that on Monday, unless we come to a different conclusion before that. +1. Both things sound like a good plan to me. I had some trouble implementing that. Recovery seemed to get confused sometimes, when it didn't find some of the WAL files in pg_xlog directory, even though it could fetch them through streaming replication. I'll have to investigate that further, but in the meantime, to have some fix in place for alpha2, I committed an even simpler fix for the immediate issue that pg_xlog is a symlink: just pretend that pg_xlog is a normal directory, even when it's a symlink. I'll continue to investigate what was wrong with my initial attempt. And it would be nice to avoid copying the pre-allocated WAL files from the source, because it's really unnecessary. But this fixes the immediate problem that pg_rewind didn't work at all if pg_xlog was a symlink. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Autonomous Transaction is back
On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi rajeev.rast...@huawei.com wrote: On 31 July 2015 23:10, Robert Haas Wrote: I think we're going entirely down the wrong path here. Why is it ever useful for a backend's lock requests to conflict with themselves, even with autonomous transactions? That seems like an artifact of somebody else's implementation that we should be happy we don't need to copy. IMHO, since most of the locking are managed at transaction level not backend level and we consider main autonomous transaction to be independent transaction, then practically they may conflict right. It is also right as you said that there is no as such useful use-cases where autonomous transaction conflicts with main (parent) transaction. But we cannot take it for granted as user might make a mistake. So at-least we should have some mechanism to handle this rare case, for which as of now I think throwing error from autonomous transaction as one of the solution. Once error thrown from autonomous transaction, main transaction may continue as it is (or abort main transaction also??). hm. OK, what's the behavior of: BEGIN UPDATE foo SET x = x + 1 WHERE foo_id = 1; BEGIN WITH AUTONOMOUS TRANSACTION UPDATE foo SET x = x + 1 WHERE foo_id = 1; END; RAISE EXCEPTION ...; EXCEPTION ... END; Also, *) What do the other candidate implementations do? IMO, compatibility should be the underlying design principle. *) What will the SQL only feature look like? *) Is the SPI interface going to be extended to expose AT? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jul 17, 2015 at 12:28 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. If pg_xlog is simply ignored, some old WAL files may remain in target server. Don't these old files cause the subsequent startup of target server as new standby to fail? That is, it's the case where the WAL file with the same name but different content exist both in target and source. If that's harmfull, pg_rewind also should remove the files in pg_xlog of target server. This would reduce usability. The rewound node will replay WAL from the previous checkpoint where WAL forked up to the minimum recovery point of source node where pg_rewind has been run. Hence if we remove completely the contents of pg_xlog we'd lose a portion of the logs that need to be replayed until timeline is switched on the rewound node when recovering it (while streaming from the promoted standby, whatever). Even if we remove the WAL files in *target server, we don't lose any files in *source server that we will need to replay later. I don't really see why recycled segments would be a problem, as that's perhaps what you are referring to, but perhaps I am missing something. Please imagine the case where the WAL files with the same name were created in both servers after the fork. Their contents may be different. After pg_rewind is executed successfully, the rewound server (i.e., target server) should retrieve and replay that WAL file from the *source* server. But the problem is that the rewound server tries to replay the WAL file from its local since the file exists locally (even if primary_conninfo is specified). Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On 07/28/2015 10:28 PM, Heikki Linnakangas wrote: On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote: Hello. In the attached patch I've made a refactoring for tranches. The prefix for them was extended, and I've did a split of LWLockAssign to two functions (one with tranche and second for user defined LWLocks). This needs some work in order to be maintainable: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. * The base tranches are a bit funny. They all have the same array_base, pointing to MainLWLockArray. If there are e.g. 5 clog buffer locks, I would expect the T_NAME() to return ClogBufferLocks for all of them, and T_ID() to return numbers between 0-4. But in reality, T_ID() will return something like 55-59. Instead of passing a tranche-id to LWLockAssign(), I think it would be more clear to have a new function to allocate a contiguous block of lwlocks as a new tranche. It could then set the base correctly. * Instead of having LWLOCK_INDIVIDUAL_NAMES to name individual locks, how about just giving each one of them a separate tranche? * User manual needs to be updated to explain the new column in pg_stat_activity. - Heikki Hello. Thanks for review. I attached new version of patch. It adds new field in pg_stat_activity that shows current wait in backend. I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only invididual and user defined LWLocks is creating, other LWLocks created by modules who need them. I think that is more logical (user know about module, not module about of all users). It also simplifies LWLocks acquirement. Now each individual LWLock and other groups of LWLocks have their tranche, and can be easily identified. If somebody will add new individual LWLock and forget to add its name, postgres will show a message. Individual LWLocks still allocated in one array but tranches for them point to particular index in main array. Sample: b1=# select pid, wait_event from pg_stat_activity; \watch 1 pid | wait_event --+- 7722 | Storage: READ 7653 | 7723 | Network: WRITE 7725 | Network: READ 7727 | Locks: Transaction 7731 | Storage: READ 7734 | Network: READ 7735 | Storage: READ 7739 | LWLocks: WALInsertLocks 7738 | Locks: Transaction 7740 | LWLocks: BufferMgrLocks 7741 | Network: READ 7742 | Network: READ 7743 | Locks: Transaction -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index e64b7ef..2e4e67e 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -630,6 +630,11 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser entryTrue if this backend is currently waiting on a lock/entry /row row + entrystructfieldwait_event//entry + entrytypetext//entry + entryName of a current wait event in backend/entry +/row +row entrystructfieldstate//entry entrytypetext//entry entryCurrent overall state of this backend. diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 3a58f1e..10c25cf 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -457,7 +457,8 @@ CLOGShmemInit(void) { ClogCtl-PagePrecedes = CLOGPagePrecedes; SimpleLruInit(ClogCtl, CLOG Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - CLogControlLock, pg_clog); + CLogControlLock, pg_clog, + CLogBufferLocks); } /* diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5ad35c0..dd7578f 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -466,7 +466,8 @@ CommitTsShmemInit(void) CommitTsCtl-PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, CommitTs Ctl, CommitTsShmemBuffers(), 0, - CommitTsControlLock, pg_commit_ts); + CommitTsControlLock, pg_commit_ts, + CommitTSBufferLocks); commitTsShared = ShmemInitStruct(CommitTs shared, sizeof(CommitTimestampShared), diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 1933a87..b905c59 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1842,10 +1842,12 @@ MultiXactShmemInit(void) SimpleLruInit(MultiXactOffsetCtl, MultiXactOffset Ctl, NUM_MXACTOFFSET_BUFFERS, 0, - MultiXactOffsetControlLock, pg_multixact/offsets); + MultiXactOffsetControlLock, pg_multixact/offsets, + MultiXactOffsetBufferLocks); SimpleLruInit(MultiXactMemberCtl,
Re: [HACKERS] RLS restrictive hook policies
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: In get_row_security_policies(): /* * If the only built-in policy is the default-deny one, and hook policies * exist, then use the hook policies only and do not apply the * default-deny policy. Otherwise, we will apply both sets below. */ if (defaultDeny (hook_policies_restrictive != NIL || hook_policies_permissive != NIL)) { rowsec_expr = NULL; rowsec_with_check_expr = NULL; } So if the only policies that exist are restrictive policies from an extension, the default-deny policy is not applied and the restrictive policies are applied instead, which is effectively a default-allow situation with each restrictive policy further limiting access to the table. Right, the idea there being that applying the default-deny would render the restrictive policies pointless, since nothing would ever be visible, however.. I think this is potentially very dangerous when both kinds of policy exist. Consider for example a situation where initially multiple permissive policies are set up for different users allowing them access to different subsets of the table, and users not covered by any of those permissive policies have no access. Then suppose that later a restrictive policy is added, applying to all users -- the intention being to prevent any user from accessing some particularly sensitive data. The result of adding that restrictive policy would be that all the users who previously had no access at all would suddenly have access to everything not prohibited by the restrictive policy. That goes against everything I would expect from a restrictive policy -- adding more restrictive policies should only ever reduce the number of rows visible, not ever open up more. So I think the test above should only disable the default-deny policy if there are any permissive policies from the extension. Of course that will make life a little harder for people who only want to use restrictive policies, since they will be forced to first add a permissive policy, possibly just one that's always true, but I think it's the safest approach. Requiring a permissive policy which allows the records first, to avoid the default-deny, makes a ton of sense. Possibly if/when we add proper SQL support for defining restrictive policies, we might also add an option to ALTER TABLE ... ENABLE ROW LEVEL SECURITY to allow a table to have RLS enabled in default-allow mode, for users who know they only want to add restrictive policies. Perhaps... I'm not sure that's really better than simply requiring a 'create policy p1 on t1 using (true);' to be done. However, I think that should be something the user has to explicitly ask for, not an automatic decision based on the existence of restrictive policies. Agreed. I'm happy to commit that change and back-patch it to 9.5, barring objections. Given that the only way to have restrictive policies currently is using hooks, I don't believe there's any documentation update required either; of course, I'll update the comment to reflect this discussion/decision and make any necessary changes to test_rls_hooks. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] RLS restrictive hook policies
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: On 3 August 2015 at 16:09, Stephen Frost sfr...@snowman.net wrote: * Stephen Frost (sfr...@snowman.net) wrote: Agreed. I'm happy to commit that change and back-patch it to 9.5, barring objections. Given that the only way to have restrictive policies currently is using hooks, I don't believe there's any documentation update required either; of course, I'll update the comment to reflect this discussion/decision and make any necessary changes to test_rls_hooks. Patch attached to make this change, with appropriate comment updates and fixes to the test_rls_hooks modules. Comments? Looks good to me. Thanks! Pushed to master and 9.5. Stephen signature.asc Description: Digital signature
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Oh, ok, I was confused on how the heap works. You could still abstract this as in-memory tails of the tapes, but it's more complicated than I thought at first: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. Since you're talking about the case where we must drain all tuples within tuplesort_performsort(), I think you're talking about a distinct idea here (since surely you're not suggesting giving up on my idea of avoiding dumping most tuples, which is a key strength of the patch). That's fine, but I just want to be clear on that. Importantly, my optimization does not care about the fact that the array may have two runs in it, because it quicksorts the array and forgets about it being a heap. It only cares whether or not more than one run made it out to tape, which makes it more widely applicable than it would otherwise be. Also, the fact that much I/O can be avoided is clearly something that can only happen when work_mem is at least ~50% of a work_mem setting that would have resulted in an (entirely) internal sort. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I'm confused that you're talking about doing something clever within tuplesort_performsort(). In the case you're targeting, won't the vast majority of tuple dumping (through calls to dumptuples()) occur within puttuple_common()? I think that my optimization should probably retain it's own state.status even if we do this (i.e. TSS_MEMTAPEMERGE should stay). Hmm, I can see another possible optimization here, in the way the heap is managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with tupindex as the leading key, it would be more cache efficient to keep one heap for the currentRun, and an unsorted array of tuples belonging to currentRun + 1. When the heap becomes empty, and currentRun is incemented, quicksort the unsorted array to become the new heap. That's a completely separate idea from your patch, although if you did it that way, you wouldn't need the extra step to divide the large array into two, as you'd maintain that division all the time. This sounds to me like a refinement of your first idea (the idea that I just wrote about). I think the biggest problem with tuplesort after this patch of mine is committed is that it is still too attached to the idea of incrementally spilling and sifting. It makes sense to some degree where it makes my patch possible...if we hang on to the idea of incrementally spilling tuples on to tape in sorted order for a while, then maybe we can hang on for long enough to quicksort most tuples, *and* to avoid actually dumping most tuples (incremental spills make the latter possible). But when work_mem is only, say, 10% of the setting required for a fully internal sort, then the incremental spilling and sifting starts to look dubious, at least to me, because the TSS_MEMTAPEMERGE optimization added by my patch could not possibly apply, and dumping and merging many times is inevitable. What I think you're getting at here is that we still have a heap, but we don't use the heap to distinguish between tuples within a run. In other words, HEAPCOMPARE() often/always only cares about run number. We quicksort after a deferred period of time, probably just before dumping everything. Perhaps I've misunderstood, but I don't see much point in quicksorting a run before being sure that you're sorting as opposed to heapifying at that point (you're not clear on what we've decided on once we quicksort). I think it could make sense to make HEAPCOMPARE() not care about tuples within a run that is not currentRun, though. I think that anything that gives up on replacement selection's ability to generate large runs, particularly for already sorted inputs will be too hard a sell (not that I think that's what you proposed). That's way, way less of an advantage than it was in the past (back when external sorts took place using actual magnetic tape, it was a huge), but the fact remains that it is an advantage. And so, I've been prototyping an approach where we don't heapify once it is established that this TSS_MEMTAPEMERGE optimization of mine cannot possibly apply. We quicksort in batches rather
Re: [HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Sat, Aug 1, 2015 at 9:49 AM, Jeremy Harris j...@wizmail.org wrote: On 31/07/15 18:31, Robert Haas wrote: On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris j...@wizmail.org wrote: Heapification is O(n) already, whether siftup (existing) or down. That's not my impression, or what Wikipedia says. Source? Measurements done last year: http://www.postgresql.org/message-id/52f35462.3030...@wizmail.org (spreadsheet attachment) http://www.postgresql.org/message-id/52f40ce9.1070...@wizmail.org (measurement procedure and spreadsheet explanation) I don't think that running benchmarks is the right way to establish the asymptotic runtime of an algorithm. I mean, if you test quicksort, it will run in much less than O(n^2) time on almost any input. But that does not mean that the worst-case run time is anything other than O(n^2). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM
On 8/3/15 12:04 PM, Simon Riggs wrote: Yes, that's what I was thinking, I just didn't say actually it. I'd been thinking about having VACUUM do just Phase 1 for some time, since its so much faster to do that. Will code. I'd like to see that exposed as an option as well. There are certain situations where you'd really like to just freeze things as fast as possible, without waiting for a full vacuum. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Sat, Aug 1, 2015 at 9:56 AM, Andres Freund and...@anarazel.de wrote: On 2015-07-31 13:31:54 -0400, Robert Haas wrote: On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris j...@wizmail.org wrote: Heapification is O(n) already, whether siftup (existing) or down. That's not my impression, or what Wikipedia says. Source? Building a binary heap via successive insertions is O(n log n), but building it directly is O(n). Looks like wikipedia agrees too https://en.wikipedia.org/wiki/Binary_heap#Building_a_heap That doesn't really address the sift-up vs. sift-down question. Maybe I'm just confused about the terminology. I think that Wikipedia article is saying that if you iterate from the middle element of an unsorted array towards the beginning, establishing the heap invariant for every item as you reach it, you will take only O(n) time. But that is not what inittapes() does. It instead starts at the beginning of the array and inserts each element one after the other. If this is any different from building the heap via successive insertions, I don't understand how. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan p...@heroku.com wrote: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I don't think that's what Heikki is talking about. He can correct me if I'm wrong, but what I think he's saying is that we should try to exploit the fact that we've already determined which in-memory tuples can be part of the current run and which in-memory tuples must become part of the next run. Suppose half the tuples in memory can become part of the current run and the other half must become part of the next run. If we throw all of those tuples into a single bucket and quicksort it, we're losing the benefit of the comparisons we did to figure out which tuples go in which runs. Instead, we could quicksort the current-run tuples and, separately, quick-sort the next-run tuples. Ignore the current-run tuples completely until the tape is empty, and then merge them with any next-run tuples that remain. I'm not sure if there's any reason to believe that would be faster than your approach. In general, sorting is O(n lg n) so sorting two arrays that are each half as large figures to be slightly faster than sorting one big array. But the difference may not amount to much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers