Re: [HACKERS] [PATCH] postgres_fdw extension support
On Thu, Jul 16, 2015 at 12:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: On Thu, Jul 16, 2015 at 3:43 AM, Paul Ramsey pram...@cleverelephant.ca wrote: Attached is a patch that implements the extension support discussed at PgCon this year during the FDW unconference sesssion. ... Thinking a bit wider, why is this just limited to extensions? The basic issue here is how can a user control which functions/operators can be sent for remote execution?. Yep. While it's certainly true that sometimes you might want function-by-function control of that, Paul's point was that extension-level granularity would be extremely convenient for PostGIS, and probably for other extensions. I don't see anything wrong with that --- and I don't think that we should insist that Paul's patch implement both cases. Somebody else who really needs function-by-function control can do the dogwork of figuring out a reasonable API for that. Well, you could for example pass a JSON string (that's fashionable these days) that sets up a list of authorized objects per category instead, like: authorized_objects = {functions:[foo_oid,foo2_oid], operators:[ope1_oid,ope2_oid]} Disclaimer 1: Paul and I discussed this back at PGCon, and I encouraged him to send in his patch. Disclaimer 2: I haven't read the patch and don't mean to vouch for any implementation details. But the functional spec of allow remote execution of functions belonging to named extensions seems sane to me. Well, I am just questioning the extensibility of the proposed interface, not saying that this is a bad thing :) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] Incremental checksums
On Wed, Jul 15, 2015 at 9:13 PM, David Christensen da...@endpoint.com wrote: On Jul 15, 2015, at 3:18 AM, Amit Kapila amit.kapil...@gmail.com wrote: - pg_disable_checksums(void) = turn checksums off for a cluster. Sets the state to disabled, which means bg_worker will not do anything. - pg_request_checksum_cycle(void) = if checksums are enabled, increment the data_checksum_cycle counter and set the state to enabling. If the cluster is already enabled for checksums, then what is the need for any other action? You are assuming this is a one-way action. No, I was confused by the state (enabling) this function will set. Requesting an explicit checksum cycle would be desirable in the case where you want to proactively verify there is no cluster corruption to be found. Sure, but I think that is different from setting the state to enabling. In your proposal above, in enabling state cluster needs to write checksums, where for such a feature you only need read validation. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hi On Wed, Jul 15, 2015 at 9:27 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jul 16, 2015 at 5:18 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Please find attached updated patch with an interface to calculate command progress in pgstat.c. Thanks for updating the patch! I got the following compiler warning. guc.c:2316: warning: initialization makes pointer from integer without a cast make check-world caused lots of failures in my environment. Yeah, i got couple of warnings with plain make. The following query caused a segmentation fault. It was the typo I believe. I see the problem is with GUC definition in guc.c. There should be NULL between gettext_noop and GUC_UNIT_MS. Regards, Dinesh manojadinesh.blogspot.com SELECT name FROM (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings UNION ALL SELECT 'session authorization' UNION ALL SELECT 'all') ss WHERE substring(name,1,3)='tra'; Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] postgres_fdw extension support
Michael Paquier michael.paqu...@gmail.com writes: On Thu, Jul 16, 2015 at 3:43 AM, Paul Ramsey pram...@cleverelephant.ca wrote: Attached is a patch that implements the extension support discussed at PgCon this year during the FDW unconference sesssion. ... Thinking a bit wider, why is this just limited to extensions? The basic issue here is how can a user control which functions/operators can be sent for remote execution?. While it's certainly true that sometimes you might want function-by-function control of that, Paul's point was that extension-level granularity would be extremely convenient for PostGIS, and probably for other extensions. I don't see anything wrong with that --- and I don't think that we should insist that Paul's patch implement both cases. Somebody else who really needs function-by-function control can do the dogwork of figuring out a reasonable API for that. Disclaimer 1: Paul and I discussed this back at PGCon, and I encouraged him to send in his patch. Disclaimer 2: I haven't read the patch and don't mean to vouch for any implementation details. But the functional spec of allow remote execution of functions belonging to named extensions seems sane to me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
On Thu, Jul 16, 2015 at 5:18 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Please find attached updated patch with an interface to calculate command progress in pgstat.c. Thanks for updating the patch! I got the following compiler warning. guc.c:2316: warning: initialization makes pointer from integer without a cast make check-world caused lots of failures in my environment. The following query caused a segmentation fault. SELECT name FROM (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings UNION ALL SELECT 'session authorization' UNION ALL SELECT 'all') ss WHERE substring(name,1,3)='tra'; Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] optimizing vacuum truncation scans
On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: Attached is a patch that implements the vm scan for truncation. It introduces a variable to hold the last blkno which was skipped during the forward portion. Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum. Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime. So if the bit is still set, the page is still empty without needing to inspect it. One case where this patch can behave differently then current code is, when before taking Exclusive lock on relation for truncation, if some backend clears the vm bit and then inserts-deletes-prune that page. I think with patch it will not consider to truncate pages whereas current code will allow to truncate it as it re-verifies each page. I think such a case would be rare and we might not want to bother about it, but still worth to think about it once. Thanks for your review. corrected the code as instead of returning the blkno after visibility map check failure, the code just continue to verify the contents as earlier approach. Some minor points about patch: count_nondeletable_pages() { .. if (PageIsNew(page) || PageIsEmpty(page)) { /* PageIsNew probably shouldn't happen... */ UnlockReleaseBuffer(buf); continue; } .. } Why vmbuffer is not freed in above loop? In the above check, we are just continuing to the next blkno, at the end of the loop the vmbuffer is freed. count_nondeletable_pages() { .. + /* + * Pages that were inspected and found to be empty by this vacuum run + * must still be empty if the vm bit is still set. Only vacuum sets + * vm bits, and only one vacuum can exist in a table at one time. + */ + trust_vm=vacrelstats-nonempty_pagesvacrelstats-skipped_pages ? + vacrelstats-nonempty_pages : vacrelstats-skipped_pages; .. } I think it is better to have spaces in above assignment statement (near '=' and near '') corrected. Here I attached updated patches, 1) without prefetch logic. 2) with combined vm and prefetch logic. I marked the patch as ready for committer. Regards, Hari Babu Fujitsu Australia vac_trunc_trust_vm_v2.patch Description: Binary data vac_trunc_trust_vm_and_prefetch_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Fri, Jul 3, 2015 at 8:45 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jul 2, 2015 at 7:44 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Can we reduce that to a single one? Maybe the first one could be errdetail or something. I think it is better other way (basically have second one as errdetail). We already have one similar message in xlog.c that way. ereport(LOG, (errmsg(online backup mode canceled), errdetail(\%s\ was renamed to \%s\., BACKUP_LABEL_FILE, BACKUP_LABEL_OLD))); Attached patch consolidates errmsg into one message. This can be tracked either in 9.5 Open Items or for next CF, any opinions? If nobody else has any opinion on this, I will add it to 9.5 Open Items list. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Your current design completely misses the time taken to scan indexes, which is significant. I tried to address this issue in the attached patch. The patch calculates index scan progress by measuring against scanned pages in LVRelStats. It checks for a change current page being scanned and increments the progress counter. When counter reaches scanned pages number in LVRelStats, progress is 100% complete. For now the progress is emitted as a warning (so no config changes needed to see progress) Thoughts? regards Sameer IndexScanProgress.patch http://postgresql.nabble.com/file/n5858109/IndexScanProgress.patch -- View this message in context: http://postgresql.nabble.com/PROPOSAL-VACUUM-Progress-Checker-tp5855849p5858109.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Wed, Jul 15, 2015 at 2:14 PM, Antonin Houska a...@cybertec.at wrote: Amit Kapila amit.kapil...@gmail.com wrote: Attached, find the rebased version of patch. [I haven't read this thread so far, sorry for possibly redundant comment.] I noticed that false is passed for required_outer agrument of create_partialseqscan_path(), while NULL seems to be cleaner in terms of C language. But in terms of semantics, I'm not sure this is correct anyway. Why does create_parallelscan_paths() not accept the actual rel-lateral_relids, just like create_seqscan_path() does? (See set_plain_rel_pathlist().) If there's reason for your approach, I think it's worth a comment. Right, I think this is left over from initial version where parallel seq scan was supported just for single table scan. It should probably do similar to create_seqscan_path() and then pass the same down to create_partialseqscan_path() and get_baserel_parampathinfo(). Thanks, I will fix this in next version of patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] option for psql - short list non template database
Hi terrible often I use pattern psql -c select datname from pg_database where not datistemplate and datallowconn postgres What about introduction new long option that does it? psql -At -list --names --without-templates Regards Pavel
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
On Thu, Jul 16, 2015 at 1:30 PM, Sameer Thakur-2 wrote: Thoughts? regards Sameer IndexScanProgress.patch http://postgresql.nabble.com/file/n5858109/IndexScanProgress.patch I am not really willing to show up as the picky guy here, but could it be possible to receive those patches as attached to emails instead of having them referenced by URL? I imagine that you are directly using the nabble interface. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] postgres_fdw extension support
On 2015-07-16 PM 12:43, Tom Lane wrote: The basic issue here is how can a user control which functions/operators can be sent for remote execution?. While it's certainly true that sometimes you might want function-by-function control of that, Paul's point was that extension-level granularity would be extremely convenient for PostGIS, and probably for other extensions. Perhaps just paranoid but is the extension version number any significant? I guess that if a function name is matched locally and declared safe to send but doesn't really exist on the remote end or does exist but has different behavior, then that can't be expected to work or work correctly. But it seems difficult to incorporate the version number into chosen approach of matching functions anyway. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] postgres_fdw extension support
On Thu, Jul 16, 2015 at 3:43 AM, Paul Ramsey pram...@cleverelephant.ca wrote: Attached is a patch that implements the extension support discussed at PgCon this year during the FDW unconference sesssion. Highlights: * Pass extension operators and functions to the foreign server * Only send ops/funcs if the foreign server is declared to support the relevant extension, for example: CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5432', dbname 'my_db', extensions 'cube, seg'); Github branch is here: https://github.com/pramsey/postgres/tree/fdw-extension-suppport Synthetic pull request for easy browsing/commenting is here: https://github.com/pramsey/postgres/pull/1 + /* * Returns true if given expr is safe to evaluate on the foreign server. */ @@ -177,7 +185,7 @@ is_foreign_expr(PlannerInfo *root, { foreign_glob_cxt glob_cxt; foreign_loc_cxt loc_cxt; - + /* * Check that the expression consists of nodes that are safe to execute * remotely. @@ -207,6 +215,8 @@ is_foreign_expr(PlannerInfo *root, return true; } + + This patch includes some diff noise, it would be better to remove that. -if (!is_builtin(fe-funcid)) +if (!is_builtin(fe-funcid) (!is_in_extension(fe-funcid, fpinfo))) Extra parenthesis are not needed. +if ( (!is_builtin(oe-opno)) (!is_in_extension(oe-opno, fpinfo)) ) ... And this does not respect the project code format. See here for more details for example: http://www.postgresql.org/docs/devel/static/source.html +/* PostGIS metadata */ +List*extensions; +booluse_postgis; +Oid postgis_oid; This addition in PgFdwRelationInfo is surprising. What the point of keeping use_postgis and postgres_oid that are actually used nowhere? (And you could rely on the fact that postgis_oid is InvalidOid to determine if it is defined or not btw). I have a hard time understanding why this refers to PostGIS as well as a postgres_fdw feature. appendStringInfo(buf, ::%s, - format_type_with_typemod(node-consttype, - node-consttypmod)); + format_type_be_qualified(node-consttype)); What's the reason for this change? Thinking a bit wider, why is this just limited to extensions? You may have as well other objects defined locally and remotely like functions or operators that are not defined in extensions, but as normal objects. Hence I think that what you are looking for is not this option, but a boolean option enforcing the behavior of code paths using is_builtin() in foreign_expr_walker such as the sanity checks on existing objects are not limited to FirstBootstrapObjectId but to other objects in the catalogs. That's a risky option because it could lead to inconsistencies among objects, so obviously the default is false but by documenting correctly the risks of using this option we may be able to get something integrated (aka be sure that each object is defined consistently across the servers queried or you'll have surprising results!). In short, it seems to me that you are taking the wrong approach. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assessing parallel-safety
On Thu, Jul 16, 2015 at 2:02 AM, Robert Haas robertmh...@gmail.com wrote: exec_stmt_execsql is called by exec_stmt_open and exec_stmt_forc. Those are cursor operations and thus - I think - parallelism can't be used there. Right, but it also gets called from exec_stmt where a parallel-safe statement could be passed to it. So it seems to me that we should enable parallelism for that path in code. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, I am not really willing to show up as the picky guy here, but could it be possible to receive those patches as attached to emails instead of having them referenced by URL? I imagine that you are directly using the nabble interface. Just configured a new mail client for nabble, did not know how to use it within an existing conversation. Now I can send the patch attached! Thanks Sameer __ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding. IndexScanProgress.patch Description: IndexScanProgress.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
On 2015-07-16 AM 05:18, Rahila Syed wrote: GUC parameter 'pgstat_track_progress' is currently PGC_SUSET in line with 'track_activities' GUC parameter. Naming the GUC pgstat* seems a little inconsistent. It could be called, say, track_maintenance_progress_interval/track_vacuum_progress_interval. That way, it will look similar to existing track_* parameters: #track_activities = on #track_counts = on #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) Also, adding the new GUC to src/backend/utils/misc/postgresql.conf.sample might be helpful. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security labels on databases are bad for dump restore
On 2015-07-14 13:09:26 -0400, Adam Brightwell wrote: All, I won't have time to do anything about this anytime soon, but I think we should fix that at some point. Shall I put this on the todo? Or do we want to create an 'open items' page that's not major version specific? I think adding it to the TODO would be great. Done. It's rather telling that it took me a fair while to find a spot in the todo list where it fits... I'd be willing to look/dive into this one further. Cool. One thing worth mentioning is that arguably the problem is caused by the fact that we're here emitting database level information in pg_dump, normally only done for dumpall. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] Incremental checksums
On 2015-07-15 12:48:40 +0530, Amit Kapila wrote: If during scan of a relation, after doing checksum for half of the blocks in relation, system crashes, then in the above scheme a restart would need to again read all the blocks even though some of the blocks are already checksummed in previous cycle, this is okay if it happens for few small or medium size relations, but assume it happens when multiple large size relations are at same state (half blocks are checksummed) when the crash occurs, then it could lead to much more IO than required. I don't think this is worth worrying about. If you crash frequently enough for this to be a problem you should fix that. Adding complexity for such an uncommon case spreads the cost to many more people. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security labels on databases are bad for dump restore
Andres Freund wrote: One thing worth mentioning is that arguably the problem is caused by the fact that we're here emitting database level information in pg_dump, normally only done for dumpall. ... the reason for which is probably the lack of CURRENT_DATABASE as a database specifier. It might make sense to add the rest of database-level information to pg_dump output, when we get that. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security labels on databases are bad for dump restore
On 2015-07-15 12:04:40 +0300, Alvaro Herrera wrote: Andres Freund wrote: One thing worth mentioning is that arguably the problem is caused by the fact that we're here emitting database level information in pg_dump, normally only done for dumpall. ... the reason for which is probably the lack of CURRENT_DATABASE as a database specifier. It might make sense to add the rest of database-level information to pg_dump output, when we get that. I'm not sure. I mean, it's not that an odd idea to assign a label to a database and then restore data into it, and expect the explicitly assigned label to survive. Not sure if there actually is a good behaviour either way here :/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Wed, Jul 15, 2015 at 3:53 PM, Simon Riggs si...@2ndquadrant.com wrote: pg_replslot has persistent state. We are discussing permanent configuration data for which I don't see the need to create an additional parallel infrastructure just to store a string given stated objection that the string is fairly long. AFAICS its not even that long. ... JSON seems the most sensible format for the string. Inventing a new one doesn't make sense. Most important for me is the ability to programmatically manipulate/edit the config string, which would be harder with a new custom format. ... Group labels are essential. OK, so this is leading us to the following points: - Use a JSON object to define the quorum/priority groups for the sync state. - Store it as a GUC, and use the check hook to validate its format, which is what we have now with s_s_names - Rely on SIGHUP to maintain an in-memory image of the quorum/priority sync state - Have the possibility to define group labels in this JSON blob, and be able to use those labels in a quorum or priority sync definition. - For backward-compatibility, use for example s_s_names = 'json' to switch to the new system. Also, as a first step of the implementation, do we actually need a set of functions to manipulate the JSON blob. I mean, we could perhaps have them in contrib/ but they do not seem mandatory as long as we document correctly how to document a label group and define a quorum or priority group, no? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Could be improved point of UPSERT
For the most cases I mentioned, we don't request a strict gapless sequence for the Invoice ID, the essential requirement is unique. We just hope that there is no obviously gap in most situations. From the test of UPSERT, there are quite a few chances to generate a big gap when UPSERT multi records. However, the result of UPSERT is acceptable, and I do love this function. So, it's a suggestion only. Anyway, thanks a lot for the detail explanation. Regards, Daojing Zhou. On Wed, Jul 15, 2015 at 3:23 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Jul 15, 2015 at 12:01 AM, Yourfriend doudou...@gmail.com wrote: for example, SO201507_1001, PO201503_1280, etc. As these IDs would be the most important attribute to the business, so, we hope there is no gap for the IDs. That's a requirement I've heard a number of times before. If you're relying on a sequence for this purpose, your application is already broken [1]. UPSERT need not be involved at all. [1] http://www.varlena.com/GeneralBits/130.php -- Peter Geoghegan
Re: [HACKERS] Parallel Seq Scan
Amit Kapila amit.kapil...@gmail.com wrote: Attached, find the rebased version of patch. [I haven't read this thread so far, sorry for possibly redundant comment.] I noticed that false is passed for required_outer agrument of create_partialseqscan_path(), while NULL seems to be cleaner in terms of C language. But in terms of semantics, I'm not sure this is correct anyway. Why does create_parallelscan_paths() not accept the actual rel-lateral_relids, just like create_seqscan_path() does? (See set_plain_rel_pathlist().) If there's reason for your approach, I think it's worth a comment. BTW, emacs shows whitespace on otherwise empty line parallelpath.c:57. -- Antonin Houska Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 7 July 2015 at 07:03, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus j...@agliodbs.com wrote: On 07/06/2015 09:56 PM, Michael Paquier wrote: On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus j...@agliodbs.com wrote: On 07/06/2015 06:40 PM, Michael Paquier wrote: On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus j...@agliodbs.com wrote: pro-JSON: * standard syntax which is recognizable to sysadmins and devops. * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make additions/deletions from the synch rep config. * can add group labels (see below) If we go this way, I think that managing a JSON blob with a GUC parameter is crazy, this is way longer in character size than a simple formula because of the key names. Hence, this JSON blob should be in a separate place than postgresql.conf not within the catalog tables, manageable using an SQL interface, and reloaded in backends using SIGHUP. I'm not following this at all. What are you saying here? A JSON string is longer in terms of number of characters than a formula because it contains key names, and those key names are usually repeated several times, making it harder to read in a configuration file. So what I am saying that that we do not save it as a GUC, but as a separate metadata that can be accessed with a set of SQL functions to manipulate it. Where, though? Someone already pointed out the issues with storing it in a system catalog, and adding an additional .conf file with a different format is too horrible to contemplate. Something like pg_syncinfo/ coupled with a LW lock, we already do something similar for replication slots with pg_replslot/. -1 to pg_syncinfo/ pg_replslot has persistent state. We are discussing permanent configuration data for which I don't see the need to create an additional parallel infrastructure just to store a string given stated objection that the string is fairly long. AFAICS its not even that long. ... JSON seems the most sensible format for the string. Inventing a new one doesn't make sense. Most important for me is the ability to programmatically manipulate/edit the config string, which would be harder with a new custom format. ... Group labels are essential. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Could be improved point of UPSERT
In my example, I just give each record a different ID to access it efficiently. In our business cases, some times, we also use some prefix letter like 'SO', PO' combining with the current year, month and then a sequence to make a invoice ID, for example, SO201507_1001, PO201503_1280, etc. As these IDs would be the most important attribute to the business, so, we hope there is no gap for the IDs. Regards, Daojing Zhou. On Wed, Jul 15, 2015 at 2:33 AM, Peter Geoghegan p...@heroku.com wrote: On Sun, Jul 12, 2015 at 4:09 AM, Yourfriend doudou...@gmail.com wrote: Suggestion: When a conflict was found for UPSERT, don't access the sequence, so users can have a reasonable list of ID. This is not technically feasible. What if the arbiter index is a serial PK? The same thing can happen when a transaction is aborted. SERIAL is not guaranteed to be gapless. -- Peter Geoghegan
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 29 June 2015 at 18:40, Josh Berkus j...@agliodbs.com wrote: I'm in favor of a more robust and sophisticated synch rep. But not if nobody not on this mailing list can configure it, and not if even we don't know what it will do in an actual failure situation. That's the key point. Editing the config after a failure is a Failure of Best Practice in an HA system. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Fri, Jun 26, 2015 at 11:16 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote: Let's start with a complex, fully described use case then work out how to specify what we want. Well, one of the most simple cases where quorum commit and this feature would be useful for is that, with 2 data centers: - on center 1, master A and standby B - on center 2, standby C and standby D With the current synchronous_standby_names, what we can do now is ensuring that one node has acknowledged the commit of master. For example synchronous_standby_names = 'B,C,D'. But you know that :) What this feature would allow use to do is for example being able to ensure that a node on the data center 2 has acknowledged the commit of master, meaning that even if data center 1 completely lost for a reason or another we have at least one node on center 2 that has lost no data at transaction commit. I think the way to address this could be via SQL Syntax as that will make users life easier. Create Replication Setup Master A Sync_Priority_Standby B Sync_Group_Any_Standby C,D Sync_Group_Fixed_Standby 2,E,F,G where Sync_Priority_Standby - means same as current setting in synchronous_standby_names Sync_Group_Any_Standby - means if any one in the group has acknowledged commit master can proceed Sync_Group_Fixed_Standby - means fixed number (that will be first parameter following this option) of standby's from this group should commit before master can proceed. The above syntax is just to explain the idea, but I think we can invent better syntax if required. We can define these as options in syntax like we do in some other syntaxes to avoid creating more keywords. We need to ensure that all these option values needs to be persisted. Now, regarding the way to express that, we need to use a concept of node group for each element of synchronous_standby_names. A group contains a set of elements, each element being a group or a single node. And for each group we need to know three things when a commit needs to be acknowledged: - Does my group need to acknowledge the commit? - If yes, how many elements in my group need to acknowledge it? - Does the order of my elements matter? I think with above kind of syntax we can address all these points and even if something is remaining it is easily extendable. That's where the micro-language idea makes sense to use. Micro-language idea is good, but I think if we can provide some syntax or via SQL functions, then it can be convienient for users to specify the replication topology. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: ctidscan as an example of custom-scan (Re: [HACKERS] [v9.5] Custom Plan API)
-Original Message- From: Michael Paquier [mailto:michael.paqu...@gmail.com] Sent: Wednesday, July 15, 2015 3:29 PM To: Kaigai Kouhei(海外 浩平) Cc: Tom Lane; Robert Haas; Alvaro Herrera; hlinnaka; Jim Nasby; Kohei KaiGai; PgHacker; Simon Riggs Subject: Re: ctidscan as an example of custom-scan (Re: [HACKERS] [v9.5] Custom Plan API) On Wed, Jul 15, 2015 at 10:12 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: We never guarantee the interface compatibility between major version up. If we add/modify interface on v9.6, it is duty for developer of extensions to follow the new version, even not specific to custom-scan provider. If v9.6 adds support fine-grained function cost estimation, I also have to follow the feature, but it is natural. Maintaining compatibility across major versions is a best-effort and even if we sometimes break things across major versions, and sometimes even silently (take the example of 9.3's background worker that do not start with 9.4 as long as bgw_notify_pid is not set to 0), the approach is usually taken to have APIs stable and convenient able to cover a maximum set of cases for a given set of plugins, and this serves well in the long term. Now it is true that we cannot assume either that the version of a plugin API will be perfect forever and will be able to cover all the imagined test cases at first shot, still I'd like to think that things are broken only when necessary and with good reasons. A set of APIs changed at each major release tends to be proof that research lacked in the first version, and would surely demotivate its adoption by extension developers. Yep, I also don't want to break the interface compatibility without something reasonable, and also think following-up new core features is a good reason to enhance the interface in the future version. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote: tuplesort.c does its own accounting, and TBH that seems like the right thing to do here, too. The difficulty is, I think, that some transition functions use an internal data type for the transition state, which might not be a single palloc'd chunk. But since we can't spill those aggregates to disk *anyway*, that doesn't really matter. So would it be acceptable to just ignore the memory consumed by internal, or come up with some heuristic? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ctidscan as an example of custom-scan (Re: [HACKERS] [v9.5] Custom Plan API)
On Wed, Jul 15, 2015 at 10:12 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: We never guarantee the interface compatibility between major version up. If we add/modify interface on v9.6, it is duty for developer of extensions to follow the new version, even not specific to custom-scan provider. If v9.6 adds support fine-grained function cost estimation, I also have to follow the feature, but it is natural. Maintaining compatibility across major versions is a best-effort and even if we sometimes break things across major versions, and sometimes even silently (take the example of 9.3's background worker that do not start with 9.4 as long as bgw_notify_pid is not set to 0), the approach is usually taken to have APIs stable and convenient able to cover a maximum set of cases for a given set of plugins, and this serves well in the long term. Now it is true that we cannot assume either that the version of a plugin API will be perfect forever and will be able to cover all the imagined test cases at first shot, still I'd like to think that things are broken only when necessary and with good reasons. A set of APIs changed at each major release tends to be proof that research lacked in the first version, and would surely demotivate its adoption by extension developers. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 2 July 2015 at 19:50, Josh Berkus j...@agliodbs.com wrote: So there's two parts to this: 1. I need to ensure that data is replicated to X places. 2. I need to *know* which places data was synchronously replicated to when the master goes down. My entire point is that (1) alone is useless unless you also have (2). And do note that I'm talking about information on the replica, not on the master, since in any failure situation we don't have the old master around to check. You might *think* you know, but given we are in this situation because of an unexpected failure, it seems strange to specifically avoid checking before you proceed. Bacon not Aristotle. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] [DESIGN] Incremental checksums
On Tue, Jul 14, 2015 at 1:56 AM, David Christensen da...@endpoint.com wrote: For any relation that it finds in the database which is not checksummed, it starts an actual worker to handle the checksum process for this table. Since the state of the cluster is already either enforcing or revalidating, any block writes will get checksums added automatically, so the only thing the bgworker needs to do is load each block in the relation and explicitly mark as dirty (unless that's not required for FlushBuffer() to do its thing). After every block in the relation is visited this way and checksummed, its pg_class record will have rellastchecksum updated. If during scan of a relation, after doing checksum for half of the blocks in relation, system crashes, then in the above scheme a restart would need to again read all the blocks even though some of the blocks are already checksummed in previous cycle, this is okay if it happens for few small or medium size relations, but assume it happens when multiple large size relations are at same state (half blocks are checksummed) when the crash occurs, then it could lead to much more IO than required. ** Function API: Interface to the functionality will be via the following Utility functions: - pg_enable_checksums(void) = turn checksums on for a cluster. Will error if the state is anything but disabled. If this is the first time this cluster has run this, this will initialize ControlFile-data_checksum_version to the preferred built-in algorithm (since there's only one currently, we just set it to 1). This increments the ControlFile-data_checksum_cycle variable, then sets the state to enabling, which means that the next time the bgworker checks if there is anything to do it will see that state, scan all the databases' datlastchecksum fields, and start kicking off the bgworker processes to handle the checksumming of the actual relation files. - pg_disable_checksums(void) = turn checksums off for a cluster. Sets the state to disabled, which means bg_worker will not do anything. - pg_request_checksum_cycle(void) = if checksums are enabled, increment the data_checksum_cycle counter and set the state to enabling. If the cluster is already enabled for checksums, then what is the need for any other action? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Could be improved point of UPSERT
On Wed, Jul 15, 2015 at 12:01 AM, Yourfriend doudou...@gmail.com wrote: for example, SO201507_1001, PO201503_1280, etc. As these IDs would be the most important attribute to the business, so, we hope there is no gap for the IDs. That's a requirement I've heard a number of times before. If you're relying on a sequence for this purpose, your application is already broken [1]. UPSERT need not be involved at all. [1] http://www.varlena.com/GeneralBits/130.php -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Wed, Jul 15, 2015 at 12:57 PM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote: tuplesort.c does its own accounting, and TBH that seems like the right thing to do here, too. The difficulty is, I think, that some transition functions use an internal data type for the transition state, which might not be a single palloc'd chunk. But since we can't spill those aggregates to disk *anyway*, that doesn't really matter. So would it be acceptable to just ignore the memory consumed by internal, or come up with some heuristic? Regards, Jeff Davis I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases. -- Regards, Atri *l'apprenant*
Re: [HACKERS] [PATCH] Generalized JSON output functions
On 07/15/2015 10:52 AM, Robert Haas wrote: On Mon, Jul 13, 2015 at 10:46 AM, Ryan Pedela rped...@datalanche.com wrote: As far as large numbers in JSON, I think Postgres is doing the right thing and should not be changed. It is Javascript that is stupid here, and I don't think it is wise to add something to core just because one client does stupid things with large numbers. In addition, ES7 is introducing value types which will hopefully solve the large number problem in Javascript. FWIW, I don't agree. If it's not easy to read the JSON that PostgreSQL generates using JavaScript, then a lot of people are just going to give up on doing it, and IMO that would be sad. Telling people that they have to parse the JSON using some parser other than the one built into their JavaScript engine, whack it around, and then render it as text and parse it again is not really an acceptable answer. The reason why the logical decoding stuff allows multiple output formats is because Andres, quite correctly, foresaw that different people would need different output formats. He could have designed that system to output only one output format and just said, everybody's got to read and parse this, but that would have been slow. Instead, he tried to set things up so that you could get the output in the format that was most convenient for your client, whatever that is. On this thread, we're back-pedaling from that idea: sorry, you can get JSON output, but if you want JSON output that will be properly interpreted by your JSON parser, you can't have that. Regardless of the details of this particular patch, I can't endorse that approach. If we want people to use our software, we need to meet them where they are at, especially when we are only (IIUC) talking about inserting a few extra quotation marks. The question for me is where is the best place to transform the data. The approach take was both invasive and broken. The approach I suggested, reparsing and transforming it in the logical decoder, would be both fairly simple and completely noninvasive. If someone gives me a test for what is an acceptable number for JS processors, I bet I could write a transformation function in an hour or two, and in a hundred lines or so. I admit that I probably have more experience doing this than anyone else, but the parser API was designed to be fairly simple, and I believe it is. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generalized JSON output functions
On Wed, Jul 15, 2015 at 11:10 AM, Ryan Pedela rped...@datalanche.com wrote: On Wed, Jul 15, 2015 at 8:52 AM, Robert Haas robertmh...@gmail.com wrote: FWIW, I don't agree. If it's not easy to read the JSON that PostgreSQL generates using JavaScript, then a lot of people are just going to give up on doing it, and IMO that would be sad. Telling people that they have to parse the JSON using some parser other than the one built into their JavaScript engine, whack it around, and then render it as text and parse it again is not really an acceptable answer. The vast majority of Javascript users are going to be using Node.js when they connect to Postgres if only for security reasons. If they use Node, they will be using node-postgres [1] or something that builds on top of it. For int64 and numerics in a row, the default is to return a string, and there is a flag you can set to round returned numbers if you prefer. There is also a way to override the default parsing of each Postgres type [2]. So in the case of JSON using my json-bignum module [3], the code looks like this: var pgTypes = require('pg').types; var bignumJSON = require('json-bignum'); types.setTypeParser(JSON_TYPE_OID, function (value) { return bignumJSON.parse(value); }); types.setTypeParser(JSONB_TYPE_OID, function (value) { return bignumJSON.parse(value); }); To me that code is super simple, and no a pain in the ass. In other words, it is not Telling people that they have to parse the JSON using some parser other than the one built into their JavaScript engine, whack it around, and then render it as text and parse it again. Like I said previously, the situation with Javascript will hopefully be remedied in a few years with ES7 anyway. 1. https://github.com/brianc/node-postgres 2. https://github.com/brianc/node-pg-types 3. https://github.com/datalanche/json-bignum On Wed, Jul 15, 2015 at 8:52 AM, Robert Haas robertmh...@gmail.com wrote: The reason why the logical decoding stuff allows multiple output formats is because Andres, quite correctly, foresaw that different people would need different output formats. He could have designed that system to output only one output format and just said, everybody's got to read and parse this, but that would have been slow. Instead, he tried to set things up so that you could get the output in the format that was most convenient for your client, whatever that is. On this thread, we're back-pedaling from that idea: sorry, you can get JSON output, but if you want JSON output that will be properly interpreted by your JSON parser, you can't have that. Regardless of the details of this particular patch, I can't endorse that approach. If we want people to use our software, we need to meet them where they are at, especially when we are only (IIUC) talking about inserting a few extra quotation marks. I would be okay with a generic way to specify output formats if there are many use cases beyond Javascript and JSON. I vaguely remember someone suggesting a FORMAT clause on CREATE TABLE which would specify how a particular column would output from a SELECT. For example, returning a date with a non-ISO format. I liked that idea. However if the only reason for different output formats is Javascript, that is silly. I have a very long list of feature requests that would probably only be beneficial to me or a handful of users. Should we implement them? No, of course not! If we did that Postgres would cease to be the best open-source database. You can't have the best product and say yes to everything. Feature creep is the enemy of quality. If Javascript is the sole reason for supporting multiple output formats, then that is the definition of feature creep in my opinion. If there are many use cases beyond Javascript and JSON, then that is different and a conversation worth having. Bottom line: Large numbers are a pain to deal with in Javascript regardless of where they come from or what format they are in. Adding code to Postgres core will never change that.
[HACKERS] Retrieve the snapshot's LSN
Hello everyone, I would need to start a read repeatable transaction and retrieve the corresponding LSN. I'm looking for pointers or Ideas on how to achieve this. Andres F. suggested me to extend pg_export_snapshot() [1] and call GetLatestSnapshot() [2] while reliably retrieving the current LSN. Should I call GetXLogWriteRecPtr() [3] for that ? What lock(s) could I take to synchronize the two calls? Any other Idea ? A snapshot is exported when creating a logical replication slot [4] and the corresponding LSN is also returned [5]. This is what I need except that I'd rather prefer to not create a replication slot each time I need the snapshot. During slot creation, the snapshot building and exporting code seems highly coupled with the logical decoding stuff. It doesn't seems much reusable to retrieve the snapshot's LSN outside of logical decoding. Thank you for your help, References: [1] pg_export_snapshot() https://github.com/postgres/postgres/blob/aa9eac45ea868e6ddabc4eb076d18be10ce84c6a/src/backend/utils/time/snapmgr.c#L [2] GetLatestSnapshot() https://github.com/postgres/postgres/blob/aa9eac45ea868e6ddabc4eb076d18be10ce84c6a/src/backend/utils/time/snapmgr.c#L259 [3] GetXLogWriteRecPtr() https://github.com/postgres/postgres/blob/7b156c1e0746a46d083d7dbcd28afb303b3484ef/src/backend/access/transam/xlog.c#L10616 [4] Exported snapshot in logical replication slot creation https://github.com/postgres/postgres/blob/aa9eac45ea868e6ddabc4eb076d18be10ce84c6a/src/backend/replication/walsender.c#L815 /* build initial snapshot, might take a while */ DecodingContextFindStartpoint(ctx); /* * Export a plain (not of the snapbuild.c type) snapshot to the user * that can be imported into another session. */ snapshot_name = SnapBuildExportSnapshot(ctx-snapshot_builder); [5] Consistent point LSN in logical replication slot creation: https://github.com/postgres/postgres/blob/aa9eac45ea868e6ddabc4eb076d18be10ce84c6a/src/backend/replication/walsender.c#L831 snprintf(xpos, sizeof(xpos), %X/%X, (uint32) (MyReplicationSlot-data.confirmed_flush 32), (uint32) MyReplicationSlot-data.confirmed_flush); ...cut... /* second field: LSN at which we became consistent */ pq_sendstring(buf, consistent_point); /* col name */ ...cut /* consistent wal location */ pq_sendint(buf, strlen(xpos), 4); /* col2 len */ pq_sendbytes(buf, xpos, strlen(xpos)); -- Florent -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generalized JSON output functions
On Wed, Jul 15, 2015 at 8:52 AM, Robert Haas robertmh...@gmail.com wrote: FWIW, I don't agree. If it's not easy to read the JSON that PostgreSQL generates using JavaScript, then a lot of people are just going to give up on doing it, and IMO that would be sad. Telling people that they have to parse the JSON using some parser other than the one built into their JavaScript engine, whack it around, and then render it as text and parse it again is not really an acceptable answer. The vast majority of Javascript users are going to be using Node.js when they connect to Postgres if only for security reasons. If they use Node, they will be using node-postgres [1] or something that builds on top of it. For int64 and numerics in a row, the default is to return a string, and there is a flag you can set to round returned numbers if you prefer. There is also a way to override the default parsing of each Postgres type [2]. So in the case of JSON using my json-bignum module [3], the code looks like this: var pgTypes = require('pg').types; var bignumJSON = require('json-bignum'); types.setTypeParser(JSON_TYPE_OID, function (value) { return bignumJSON.parse(value); }); types.setTypeParser(JSONB_TYPE_OID, function (value) { return bignumJSON.parse(value); }); To me that code is super simple, and no a pain in the ass. In other words, it is not Telling people that they have to parse the JSON using some parser other than the one built into their JavaScript engine, whack it around, and then render it as text and parse it again. Like I said previously, the situation with Javascript will hopefully be remedied in a few years with ES7 anyway. 1. https://github.com/brianc/node-postgres 2. https://github.com/brianc/node-pg-types 3. https://github.com/datalanche/json-bignum On Wed, Jul 15, 2015 at 8:52 AM, Robert Haas robertmh...@gmail.com wrote: The reason why the logical decoding stuff allows multiple output formats is because Andres, quite correctly, foresaw that different people would need different output formats. He could have designed that system to output only one output format and just said, everybody's got to read and parse this, but that would have been slow. Instead, he tried to set things up so that you could get the output in the format that was most convenient for your client, whatever that is. On this thread, we're back-pedaling from that idea: sorry, you can get JSON output, but if you want JSON output that will be properly interpreted by your JSON parser, you can't have that. Regardless of the details of this particular patch, I can't endorse that approach. If we want people to use our software, we need to meet them where they are at, especially when we are only (IIUC) talking about inserting a few extra quotation marks. I would be okay with a generic way to specify output formats if there are many use cases beyond Javascript and JSON. I vaguely remember someone suggesting a FORMAT clause on CREATE TABLE which would specify how a particular column would output from a SELECT. For example, returning a date with a non-ISO format. I liked that idea. However if the only reason for different output formats is Javascript, that is silly. I have a very long list of feature requests that would probably only be beneficial to me or a handful of users. Should we implement them? No, of course not! If we did that Postgres would cease to be the best open-source database. You can't have the best product and say yes to everything. Feature creep is the enemy of quality. If Javascript is the sole reason for supporting multiple output formats, then that is the definition of feature creep in my opinion. If there are many use cases beyond Javascript and JSON, then that is different and a conversation worth having.
[HACKERS] [PATCH] postgres_fdw extension support
Hi all, Attached is a patch that implements the extension support discussed at PgCon this year during the FDW unconference sesssion. Highlights: * Pass extension operators and functions to the foreign server * Only send ops/funcs if the foreign server is declared to support the relevant extension, for example: CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5432', dbname 'my_db', extensions 'cube, seg'); Github branch is here: https://github.com/pramsey/postgres/tree/fdw-extension-suppport Synthetic pull request for easy browsing/commenting is here: https://github.com/pramsey/postgres/pull/1 Thanks! Paul diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 81cb2b4..bbe3c9d 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -34,11 +34,15 @@ #include postgres_fdw.h +#include access/genam.h #include access/heapam.h #include access/htup_details.h #include access/sysattr.h #include access/transam.h +#include catalog/dependency.h +#include catalog/indexing.h #include catalog/pg_collation.h +#include catalog/pg_depend.h #include catalog/pg_namespace.h #include catalog/pg_operator.h #include catalog/pg_proc.h @@ -49,8 +53,10 @@ #include optimizer/var.h #include parser/parsetree.h #include utils/builtins.h +#include utils/fmgroids.h #include utils/lsyscache.h #include utils/rel.h +#include utils/snapmgr.h #include utils/syscache.h @@ -136,6 +142,7 @@ static void printRemoteParam(int paramindex, Oid paramtype, int32 paramtypmod, deparse_expr_cxt *context); static void printRemotePlaceholder(Oid paramtype, int32 paramtypmod, deparse_expr_cxt *context); +static bool is_in_extension(Oid procid, PgFdwRelationInfo *fpinfo); /* @@ -167,6 +174,7 @@ classifyConditions(PlannerInfo *root, } } + /* * Returns true if given expr is safe to evaluate on the foreign server. */ @@ -177,7 +185,7 @@ is_foreign_expr(PlannerInfo *root, { foreign_glob_cxt glob_cxt; foreign_loc_cxt loc_cxt; - + /* * Check that the expression consists of nodes that are safe to execute * remotely. @@ -207,6 +215,8 @@ is_foreign_expr(PlannerInfo *root, return true; } + + /* * Check if expression is safe to execute remotely, and return true if so. * @@ -229,6 +239,9 @@ foreign_expr_walker(Node *node, Oid collation; FDWCollateState state; + /* Access extension metadata from fpinfo on baserel */ + PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *)(glob_cxt-foreignrel-fdw_private); + /* Need do nothing for empty subexpressions */ if (node == NULL) return true; @@ -361,7 +374,7 @@ foreign_expr_walker(Node *node, * can't be sent to remote because it might have incompatible * semantics on remote side. */ - if (!is_builtin(fe-funcid)) + if (!is_builtin(fe-funcid) (!is_in_extension(fe-funcid, fpinfo))) return false; /* @@ -407,7 +420,7 @@ foreign_expr_walker(Node *node, * (If the operator is, surely its underlying function is * too.) */ - if (!is_builtin(oe-opno)) + if ( (!is_builtin(oe-opno)) (!is_in_extension(oe-opno, fpinfo)) ) return false; /* @@ -445,7 +458,7 @@ foreign_expr_walker(Node *node, /* * Again, only built-in operators can be sent to remote. */ - if (!is_builtin(oe-opno)) + if (!is_builtin(oe-opno) (!is_in_extension(oe-opno, fpinfo))) return false; /* @@ -591,7 +604,7 @@ foreign_expr_walker(Node *node, * If result type of given expression is not built-in, it can't be sent to * remote because it might have incompatible semantics on remote side. */ - if (check_type !is_builtin(exprType(node))) + if (check_type !is_builtin(exprType(node)) (!is_in_extension(exprType(node), fpinfo)) ) return false; /* @@ -643,6 +656,8 @@ foreign_expr_walker(Node *node, return true; } + + /* * Return true if given object is one of PostgreSQL's built-in objects. * @@ -669,6 +684,67 @@ is_builtin(Oid oid) /* + * Returns true if given operator/function is part of an extension declared in the
Re: [HACKERS] Memory Accounting v11
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote: I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases. OK, I will drop this patch and proceed with the HashAgg patch, with a heuristic for internal types. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
On 07/15/2015 07:58 AM, Simon Riggs wrote: For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template That meets the SQL Standard and doesn't contain any visibility problems or need for new internals. The purpose of this feature is to automatically create a temp table with the same definition whenever needed. The discussion of bloat is just wrong. We create exactly the same amount of bloat as if we had typed CREATE TEMP TABLE. Optimising temp table entries in the catalog is another, separate patch, if we care. Sounds fine in general. I'm a bit curious to know what are the locking implications of vivifying the table on access. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
Simon Riggs wrote: JSON seems the most sensible format for the string. Inventing a new one doesn't make sense. Most important for me is the ability to programmatically manipulate/edit the config string, which would be harder with a new custom format. Do we need to keep the value consistent across all the servers in the flock? If not, is the behavior halfway sane upon failover? If we need the DBA to keep the value in sync manually, that's going to be a recipe for trouble. Which is going to bite particularly hard during those stressing moments when disaster strikes and things have to be done in emergency mode. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
On 14 July 2015 at 23:20, Jim Nasby jim.na...@bluetreble.com wrote: On 7/9/15 12:45 AM, Pavel Stehule wrote: 2015-07-09 7:32 GMT+02:00 Zhaomo Yang zhy...@cs.ucsd.edu mailto:zhy...@cs.ucsd.edu: I am not sure, if it is not useless work. I don't understand why an implementation taking approach 2.a would be useless. As I said, its performance will be no worse than current temp tables and it will provide a lot of convenience to users who need to create temp tables in every session. Surely it should be step forward. But you will to have to solve lot of problems with duplicated tables in system catalogue, and still it doesn't solve the main problem with temporary tables - the bloating catalogue - and related performance degradation. That being the main problem is strictly a matter of opinion based on your experience. Many people don't have a performance problem today, but do have to deal with all the pain of handling this manually (as well as all the limitations that go with that). If it's easy to fix the bloat problem at the same time as adding GLOBAL TEMP then great! But there's no reason to reject this just because it doesn't fix that issue. Agreed There are some good arguments for why we need this feature. Pavel's original description of how to do this seem valid, and from the link Tom agreed in 2009. For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template That meets the SQL Standard and doesn't contain any visibility problems or need for new internals. The purpose of this feature is to automatically create a temp table with the same definition whenever needed. The discussion of bloat is just wrong. We create exactly the same amount of bloat as if we had typed CREATE TEMP TABLE. Optimising temp table entries in the catalog is another, separate patch, if we care. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 15 July 2015 at 10:03, Michael Paquier michael.paqu...@gmail.com wrote: OK, so this is leading us to the following points: - Use a JSON object to define the quorum/priority groups for the sync state. - Store it as a GUC, and use the check hook to validate its format, which is what we have now with s_s_names - Rely on SIGHUP to maintain an in-memory image of the quorum/priority sync state - Have the possibility to define group labels in this JSON blob, and be able to use those labels in a quorum or priority sync definition. +1 - For backward-compatibility, use for example s_s_names = 'json' to switch to the new system. Seems easy enough to check to see if it is has a leading { and then treat it as if it is an attempt to use JSON (which may fail), otherwise use the old syntax. Also, as a first step of the implementation, do we actually need a set of functions to manipulate the JSON blob. I mean, we could perhaps have them in contrib/ but they do not seem mandatory as long as we document correctly how to document a label group and define a quorum or priority group, no? Agreed, no specific functions needed to manipulate this field. If we lack the means to manipulate JSON in SQL that can be solved outside of the scope of this patch, because its just JSON. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Implementation of global temporary tables?
2015-07-15 13:58 GMT+02:00 Simon Riggs si...@2ndquadrant.com: On 14 July 2015 at 23:20, Jim Nasby jim.na...@bluetreble.com wrote: On 7/9/15 12:45 AM, Pavel Stehule wrote: 2015-07-09 7:32 GMT+02:00 Zhaomo Yang zhy...@cs.ucsd.edu mailto:zhy...@cs.ucsd.edu: I am not sure, if it is not useless work. I don't understand why an implementation taking approach 2.a would be useless. As I said, its performance will be no worse than current temp tables and it will provide a lot of convenience to users who need to create temp tables in every session. Surely it should be step forward. But you will to have to solve lot of problems with duplicated tables in system catalogue, and still it doesn't solve the main problem with temporary tables - the bloating catalogue - and related performance degradation. That being the main problem is strictly a matter of opinion based on your experience. Many people don't have a performance problem today, but do have to deal with all the pain of handling this manually (as well as all the limitations that go with that). If it's easy to fix the bloat problem at the same time as adding GLOBAL TEMP then great! But there's no reason to reject this just because it doesn't fix that issue. Agreed There are some good arguments for why we need this feature. Pavel's original description of how to do this seem valid, and from the link Tom agreed in 2009. For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template That meets the SQL Standard and doesn't contain any visibility problems or need for new internals. The purpose of this feature is to automatically create a temp table with the same definition whenever needed. The discussion of bloat is just wrong. We create exactly the same amount of bloat as if we had typed CREATE TEMP TABLE. Optimising temp table entries in the catalog is another, separate patch, if we care. The optimization of local temp tables is little bit harder - you cannot to share pg_class and pg_attribute - although some memory entries can be used too. Regards Pavel -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Memory Accounting v11
On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Firstly, do we really have good benchmarks and measurements? I really doubt that. We do have some numbers for REINDEX, where we observed 0.5-1% regression on noisy results from a Power machine (and we've been unable to reproduce that on x86). I don't think that's a representative benchmark, and I'd like to see more thorough measurements. And I agreed to do this, once Jeff comes up with a new version of the patch. Secondly, the question is whether the performance is impacted more by the additional instructions, or by other things - say, random padding, as was explained by Andrew Gierth in [1]. I don't know whether that's happening in this patch, but if it is, it seems rather strange to use this against this patch and not the others (because there surely will be other patches causing similar issues). It matters, at least to me, an awful lot where we're adding lines of code. If you want to add modest amounts of additional code to CREATE TABLE or CHECKPOINT or something like that, I really don't care, because that stuff doesn't execute frequently enough to really matter to performance even if we add a significant chunk of overhead, and we're doing other expensive stuff at the same time, like fsync. The same can't be said about functions like LWLockAcquire and AllocSetAlloc that routinely show up at the top of CPU profiles. I agree that there is room to question whether the benchmarks I did are sufficient reason to think that the abstract concern that putting more code into a function might make it slower translates into a measurable drop in performance in practice. But I think when it comes to these very hot code paths, extreme conservatism is warranted. We can agree to disagree about that. tuplesort.c does its own accounting, and TBH that seems like the right thing to do here, too. The difficulty is, I think, that some transition functions use an internal data type for the transition state, which might not be a single palloc'd chunk. But since we can't spill those aggregates to disk *anyway*, that doesn't really matter. If the transition is a varlena or a fixed-length type, we can know how much space it's consuming without hooking into the memory context framework. I respectfully disagree. Our current inability to dump/load the state has little to do with how we measure consumed memory, IMHO. It's true that we do have two kinds of aggregates, depending on the nature of the aggregate state: (a) fixed-size state (data types passed by value, variable length types that do not grow once allocated, ...) (b) continuously growing state (as in string_agg/array_agg) Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a solution for dump/load of the aggregate stats - which we need to implement anyway for parallel aggregate. I know there was a proposal to force all aggregates to use regular data types as aggregate stats, but I can't see how that could work without a significant performance penalty. For example array_agg() is using internal to pass ArrayBuildState - how do you turn that to a regular data type without effectively serializing/deserializing the whole array on every transition? That is a good point. Tom suggested that his new expanded-format stuff might be able to be adapted to the purpose, but I have no idea how possible that really is. And even if we come up with a solution for array_agg, do we really believe it's possible to do for all custom aggregates? Maybe I'm missing something but I doubt that. ISTM designing ephemeral data structure allows tweaks that are impossible otherwise. What might be possible is extending the aggregate API with another custom function returning size of the aggregate state. So when defining an aggregate using 'internal' for aggregate state, you'd specify transfunc, finalfunc and sizefunc. That seems doable, I guess. And infunc and outfunc. If we don't use the expanded-format stuff for aggregates with a type-internal transition state, we won't be able to use input and output functions to serialize and deserialize them, either. I find the memory accounting as a way more elegant solution, though. I think we're just going to have to agree to disagree on that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Wed, Jul 15, 2015 at 3:27 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote: tuplesort.c does its own accounting, and TBH that seems like the right thing to do here, too. The difficulty is, I think, that some transition functions use an internal data type for the transition state, which might not be a single palloc'd chunk. But since we can't spill those aggregates to disk *anyway*, that doesn't really matter. So would it be acceptable to just ignore the memory consumed by internal, or come up with some heuristic? Either one would be OK with me. I'd probably ignore that for the first version of the patch and just let the hash table grow without bound in that case. Then in a later patch I'd introduce additional infrastructure to deal with that part of the problem. But if something else works out well while coding it up, I'd be OK with that, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace
On Tue, Jul 14, 2015 at 5:57 PM, Jim Nasby jim.na...@bluetreble.com wrote: On 7/7/15 7:07 AM, Andres Freund wrote: On 2015-07-03 18:03:58 -0400, Tom Lane wrote: I have just looked through this thread, and TBH I think we should reject this patch altogether --- not RWF, but no we don't want this. The use-case remains hypothetical: no performance numbers showing a real-world benefit have been exhibited AFAICS. It's not that hard to imagine a performance benefit though? If the toasted column is accessed infrequently/just after filtering on other columns (not uncommon) it could very well be beneficial to put the main table on fast storage (SSD) and the toast table on slow storage (spinning rust). I've seen humoungous toast tables that are barely ever read for tables that are constantly a couple times in the field. +1. I know of one case where the heap was ~8GB and TOAST was over 400GB. Yeah, I think that's a good argument for this. I have to admit that I'm a bit fatigued by this thread - it started out as a learn PostgreSQL project, and we iterated through a few design problems that made me kind of worried about what sort of state the patch was in - and now this patch is more than 4 years old. But if some committer still has the energy to go through it in detail and make sure that all of the problems have been fixed and that the patch is, as Andreas says, in good shape, then I don't see why we shouldn't take it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue
On Thu, Jun 25, 2015 at 8:43 PM, Brendan Jurd dire...@gmail.com wrote: On Fri, 26 Jun 2015 at 06:03 Gurjeet Singh gurj...@singh.im wrote: s/proportion/fraction/ I think of these as synonymous -- do you have any particular reason to prefer fraction? I don't feel strongly about it either way, so I'm quite happy to go with fraction if folks find that more expressive. It just feels better to me in this context. If the number of times used in Postgres code is any measure, 'fraction' wins hands down: proportion : 33, fraction: 620. I don't feel strongly about it, either. I can leave it up to the committer to decide. + * The caller must hold (at least) shared AysncQueueLock. A possibly better wording: The caller must hold AysncQueueLock in (at least) shared mode. Yes, that is more accurate. OK. Unnecessary whitespace changes in pg_proc.h for existing functions. I did group the asynchronous notification functions together, which seemed reasonable as there are now three of them, and changed the tabbing between the function name and namespace ID to match, as is done elsewhere in pg_proc.h. I think those changes improve readability, but again I don't feel strongly about it. Fair enough. +DESCR(get the current usage of the asynchronous notification queue); A possibly better wording: get the fraction of the asynchronous notification queue currently in use I have no objections to your wording. OK. Please send a new patch with the changes you agree to, and I can mark it ready for committer. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] Grouping Sets: Fix unrecognized node type bug
Jeevan == Jeevan Chalke jeevan.cha...@enterprisedb.com writes: Jeevan Hi, Jeevan It looks like we do support nested GROUPING SETS, I mean Sets Jeevan withing Sets, not other types. However this nesting is broken. Good catch, but I'm not yet sure your fix is correct; I'll need to look into that. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Comment fix for miscinit.c
Quick comment fix for edit issue. 0001-Fix-comment.patch Description: Binary data -- David Christensen PostgreSQL Team Manager End Point Corporation da...@endpoint.com 785-727-1171 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Please find attached updated patch with an interface to calculate command progress in pgstat.c. This interface currently implements VACUUM progress tracking . A column named percent_complete has been added in pg_stat_activity to report progress. VACUUM calls the progress calculation interface periodically at an interval specified by pgstat_track_progress GUC in ms. Progress calculation can be disabled by setting pgstat_track_progress as -1. Remaining_time for VACUUM is not included in current patch to avoid cluttering pg_stat_activity with too many columns. But the estimate as seen from previous implementation seems reasonable enough to be included in progress information , may be as an exclusive view for vacuum progress information. GUC parameter 'pgstat_track_progress' is currently PGC_SUSET in line with 'track_activities' GUC parameter. Although IMO, pgstat_track_progress can be made PGC_USERSET in order to provide more flexibility to any user to enable/disable progress calculation provided progress is tracked only if track_activities GUC parameter is enabled. In this patch, index scans are not taken into account for progress calculation as of now . Thank you, Rahila Syed. Vacuum_progress_checker_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
Hi, On 07/15/2015 09:21 PM, Robert Haas wrote: On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Firstly, do we really have good benchmarks and measurements? I really doubt that. We do have some numbers for REINDEX, where we observed 0.5-1% regression on noisy results from a Power machine (and we've been unable to reproduce that on x86). I don't think that's a representative benchmark, and I'd like to see more thorough measurements. And I agreed to do this, once Jeff comes up with a new version of the patch. Secondly, the question is whether the performance is impacted more by the additional instructions, or by other things - say, random padding, as was explained by Andrew Gierth in [1]. I don't know whether that's happening in this patch, but if it is, it seems rather strange to use this against this patch and not the others (because there surely will be other patches causing similar issues). It matters, at least to me, an awful lot where we're adding lines of code. If you want to add modest amounts of additional code to CREATE TABLE or CHECKPOINT or something like that, I really don't care, because that stuff doesn't execute frequently enough to really matter to performance even if we add a significant chunk of overhead, and we're doing other expensive stuff at the same time, like fsync. The same can't be said about functions like LWLockAcquire and AllocSetAlloc that routinely show up at the top of CPU profiles. I agree that there is room to question whether the benchmarks I did are sufficient reason to think that the abstract concern that putting more code into a function might make it slower translates into a measurable drop in performance in practice. But I think when it comes to these very hot code paths, extreme conservatism is warranted. We can agree to disagree about that. No, that is not what I tried to say. I certainly agree that we need to pay attention when adding stuff hot paths - there's no disagreement about this. The problem with random padding is that adding code somewhere may introduce padding that affects other pieces of code. That is essentially the point of Andrew's explanation that I linked in my previous message. And the question is - are the differences we've measured really due to code added to the hot path, or something introduced by random padding due to some other changes (possibly in a code that was not even executed during the test)? I don't know how significant impact this may have in this case, or how serious this is in general, but we're talking about 0.5-1% difference on a noisy benchmark. And if such cases of random padding really are a problem in practice, we've certainly missed plenty of other patches with the same impact. Because effectively what Jeff's last patch did was adding a single int64 counter to MemoryContextData structure, and incrementing it for each allocated block (not chunk). I can't really imagine a solution making it cheaper, because there are very few faster operations. Even opt-in won't make this much faster, because you'll have to check a flag and you'll need two fields (flag + counter). Of course, this assumes local counter (i.e. not updating counters the parent contexts), but I claim that's OK. I've been previously pushing for eagerly updating all the parent contexts, so that finding out the allocated memory is fast even when there are many child contexts - a prime example was array_agg() before I fixed it. But I changed my mind on this and now say screw them. I claim that aggregates using a separate memory context for each group are a lost case - they already suffer a significant overhead, and should be fixed just like we fixed array_agg(). That was effectively the outcome of pgcon discussions - to use the simple int64 counter, do the accounting for all contexts, and update only the local counter. For cases with many child contexts finding out the amount of allocated memory won't be cheap, but well - there's not much we can do about that. I know there was a proposal to force all aggregates to use regular data types as aggregate stats, but I can't see how that could work without a significant performance penalty. For example array_agg() is using internal to pass ArrayBuildState - how do you turn that to a regular data type without effectively serializing/deserializing the whole array on every transition? That is a good point. Tom suggested that his new expanded-format stuff might be able to be adapted to the purpose, but I have no idea how possible that really is. Me neither, and maybe there's a good solution for that, making my concerns unfounded. What might be possible is extending the aggregate API with another custom function returning size of the aggregate state. So when defining an aggregate using 'internal' for aggregate state, you'd specify transfunc, finalfunc and sizefunc. That seems doable, I guess. And infunc and outfunc. If we don't use
Re: [HACKERS] Memory Accounting v11
Hi, On 07/15/2015 07:07 PM, Jeff Davis wrote: On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote: I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases. OK, I will drop this patch and proceed with the HashAgg patch, with a heuristic for internal types. Could someone briefly explain what heuristics are we talking about? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 15 July 2015 at 12:25, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Simon Riggs wrote: JSON seems the most sensible format for the string. Inventing a new one doesn't make sense. Most important for me is the ability to programmatically manipulate/edit the config string, which would be harder with a new custom format. Do we need to keep the value consistent across all the servers in the flock? If not, is the behavior halfway sane upon failover? Mostly, yes. Which means it doesn't change much, so config data is OK. If we need the DBA to keep the value in sync manually, that's going to be a recipe for trouble. Which is going to bite particularly hard during those stressing moments when disaster strikes and things have to be done in emergency mode. Manual config itself is the recipe for trouble, not this particular setting. There are already many other settings that need to be the same on all nodes for example. Nothing here changes that. This is just an enhancement of the current technology. For the future, a richer mechanism for defining nodes and their associated metadata is needed for logical replication and clustering. That is not what is being discussed here though, nor should we begin! -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Making ParameterStatus available for more parameter types?
On Sun, Jul 12, 2015 at 5:30 AM, Shay Rojansky r...@roji.org wrote: The ParameterStatus message is currently sent for a hard-wired set of parameters (http://www.postgresql.org/docs/current/static/protocol-flow.html#PROTOCOL-ASYNC). Just wanted to let you know that making this more flexible would be a great help in driver implementation. Npgsql maintains its own view of the current statement_timeout parameter for various reasons; as long as the standard ADO.NET API is used to change the timeout everything is OK, but if the user ever manually sets statement_timeout things can become quite messy. Being able to subscribe to more parameter changes would help in this case. In the very short term simply adding statement_timeout to the hard-wired set would help me as well. Requests to add more stuff to ParameterStatus messages are becoming a somewhat regular thing around here. Your idea of giving the client the ability to subscribe to the stuff it cares about might be a good way forward, because sending more and more things all the time adds overhead for all clients, even those that don't care. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
On 15 July 2015 at 13:26, Andrew Dunstan and...@dunslane.net wrote: On 07/15/2015 07:58 AM, Simon Riggs wrote: For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template That meets the SQL Standard and doesn't contain any visibility problems or need for new internals. The purpose of this feature is to automatically create a temp table with the same definition whenever needed. The discussion of bloat is just wrong. We create exactly the same amount of bloat as if we had typed CREATE TEMP TABLE. Optimising temp table entries in the catalog is another, separate patch, if we care. Sounds fine in general. I'm a bit curious to know what are the locking implications of vivifying the table on access. We would lock the Global Temp Table at the same lock level for the activity, just as we do for INSERT, SELECT etc.. That prevents concurrent DDL like DROP or ALTER on the Global Temp Table. The Local temp table created is a copy of the Global Temp Table. The Local temp table is only locally locked, so no worries. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Implementation of global temporary tables?
2015-07-15 15:53 GMT+02:00 Zhaomo Yang zhy...@cs.ucsd.edu: there is other question - what is effect of ALTER TABLE of global temp table on instances of this table in active sessions? As I said, we need to first agree on the behaviors of the existing commands. I can think of two options now for ALTER TABLE: 1) only allow ALTER TABLE when there is no other active sessions (this is how Oracle deals with it.) 2) handle it as if session copies inherit from the global copy and ALTER TABLE executes on the global copy. There are two possible kinds of GLOBAL TEMP tables - session related and transation related. Transaction related tables has very short life - and @1 needs outage, @2 requires stronger locks and can slow and less effective - because a) some changes can be invisible in other transactions (depends on isolation levels), b) the structure can be changed, but function code not (without dependency on isolation levels) - so it can be non consistent, c) why to change table if this table will be dropped in next milisecond. For this case the behave like PL functions can be very practical ~ third option for ALTER TABLE Regards Pavel Thanks, Zhaomo
Re: [HACKERS] Implementation of global temporary tables?
On 15 July 2015 at 16:44, Andres Freund and...@anarazel.de wrote: On 2015-07-15 16:36:12 +0100, Simon Riggs wrote: On 15 July 2015 at 16:28, Andres Freund and...@anarazel.de wrote: I think that's generally a fair point. But here we're discussing to add a fair amount of wrinkles with the copy approach. The fact alone that the oid is different will have some ugly consequences. Why? We are creating a local temp table LIKE the global temp table. That is already a supported operation. So there is no different oid. Then your locking against ALTER, DROP etc. isn't going to work. There would be two objects, both locked. The temp table is just nice and simple. No problem. Your optimization may work; I hope it does. My approach definitely will. So we could choose either. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Implementation of global temporary tables?
2015-07-15 15:21 GMT+02:00 Zhaomo Yang zhy...@cs.ucsd.edu: Sounds fine in general. I'm a bit curious to know what are the locking implications of vivifying the table on access. The locking implications depend on how we interpret the existing commands in the context of global temp tables and I think we should discuss and agree on the behaviors of the commands with global temp tables, but I think in general we can follow these rules: If the command executes on the global temp table's metadata, for example an ALTER TABLE command, then we lock the global copy at the same level as we do a regular table. there is other question - what is effect of ALTER TABLE of global temp table on instances of this table in active sessions? If the command executes on the global temp table's data (which is actually stored in the session copy), for example an DML command, then the global copy is locked at the AccessShareLock level to prevent concurrent modifications to the global temp table's definition from other sessions. Thanks, Zhaomo On Wed, Jul 15, 2015 at 4:26 AM, Andrew Dunstan and...@dunslane.net wrote: On 07/15/2015 07:58 AM, Simon Riggs wrote: For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template That meets the SQL Standard and doesn't contain any visibility problems or need for new internals. The purpose of this feature is to automatically create a temp table with the same definition whenever needed. The discussion of bloat is just wrong. We create exactly the same amount of bloat as if we had typed CREATE TEMP TABLE. Optimising temp table entries in the catalog is another, separate patch, if we care. Sounds fine in general. I'm a bit curious to know what are the locking implications of vivifying the table on access. cheers andrew
[HACKERS] Grouping Sets: Fix unrecognized node type bug
Hi, It looks like we do support nested GROUPING SETS, I mean Sets withing Sets, not other types. However this nesting is broken. Here is the simple example where I would expect three rows in the result. But unfortunately it is giving unrecognized node type error. Which is something weird and need a fix. postgres=# create table gstest2 (a integer, b integer, c integer); postgres=# insert into gstest2 values (1,1,1), (1,1,1), (1,1,1), (1,1,1), (1,1,1), (1,1,1), (1,1,2), (1,2,2), (2,2,2); postgres=# select sum(c) from gstest2 group by grouping sets((), grouping sets((), grouping sets(( order by 1 desc; ERROR: unrecognized node type: 926 I spend much time to understand the cause and was looking into transformGroupingSet() and transformGroupClauseList() function. I have tried fixing unrecognized node type: 926 error there, but later it is failing with unrecognized node type: 656. Later I have realized that we have actually have an issue while flattening grouping sets. If we have nested grouping sets like above, then we are getting GroupingSet node inside the list and transformGroupClauseList() does not expect that and end up with this error. I have tried fixing this issue in flatten_grouping_sets(), after flattening grouping sets node, we need to concat the result with the existing list and should not append. This alone does not solve the issue as we need a list when we have ROW expression. Thus there, if not top level, I am creating a list now. Attached patch with few testcases too. Please have a look. Thanks -- Jeevan B Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c index e90e1d6..31d4331 100644 --- a/src/backend/parser/parse_clause.c +++ b/src/backend/parser/parse_clause.c @@ -1779,8 +1779,19 @@ flatten_grouping_sets(Node *expr, bool toplevel, bool *hasGroupingSets) RowExpr*r = (RowExpr *) expr; if (r-row_format == COERCE_IMPLICIT_CAST) - return flatten_grouping_sets((Node *) r-args, - false, NULL); +{ + Node *n1 = flatten_grouping_sets((Node *) r-args, + false, NULL); + + /* + * Make a list for row expression if toplevel is false, + * return flatten list otherwise + */ + if (toplevel) + return (Node *) n1; + else + return (Node *) list_make1(n1); +} } break; case T_GroupingSet: @@ -1805,7 +1816,10 @@ flatten_grouping_sets(Node *expr, bool toplevel, bool *hasGroupingSets) { Node *n2 = flatten_grouping_sets(lfirst(l2), false, NULL); - result_set = lappend(result_set, n2); + if (IsA(n2, List)) + result_set = list_concat(result_set, (List *) n2); + else + result_set = lappend(result_set, n2); } /* diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out index adb39b3..5c47717 100644 --- a/src/test/regress/expected/groupingsets.out +++ b/src/test/regress/expected/groupingsets.out @@ -145,6 +145,112 @@ select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum | | 12 | 36 (6 rows) +-- nesting with grouping sets +select sum(c) from gstest2 + group by grouping sets((), grouping sets((), grouping sets(( + order by 1 desc; + sum +- + 12 + 12 + 12 +(3 rows) + +select sum(c) from gstest2 + group by grouping sets((), grouping sets((), grouping sets(((a, b) + order by 1 desc; + sum +- + 12 + 12 + 8 + 2 + 2 +(5 rows) + +select sum(c) from gstest2 + group by grouping sets(grouping sets(rollup(c), grouping sets(cube(c + order by 1 desc; + sum +- + 12 + 12 + 6 + 6 + 6 + 6 +(6 rows) + +select sum(c) from gstest2 + group by grouping sets(a, grouping sets(a, cube(b))) + order by 1 desc; + sum +- + 12 + 10 + 10 + 8 + 4 + 2 + 2 +(7 rows) + +select sum(c) from gstest2 + group by grouping sets(grouping sets((a, (b + order by 1 desc; + sum +- + 8 + 2 + 2 +(3 rows) + +select sum(c) from gstest2 + group by grouping sets(grouping sets((a, b))) + order by 1 desc; + sum +- + 8 + 2 + 2 +(3 rows) + +select sum(c) from gstest2 + group by grouping sets(grouping sets(a, grouping sets(a), a)) + order by 1 desc; + sum +- + 10 + 10 + 10 + 2 + 2 + 2 +(6 rows) + +select sum(c) from gstest2 + group by grouping sets(grouping sets(a, grouping sets(a, grouping sets(a), ((a)), a, grouping sets(a), (a)), a)) + order by 1 desc; + sum +- + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 10 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 +(16 rows) + -- empty input: first is 0 rows, second 1, third 3 etc. select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a); a | b | sum | count diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql index 0883afd..e478d34 100644 ---
Re: [HACKERS] optimizing vacuum truncation scans
On Mon, Jun 29, 2015 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: Attached is a patch that implements the vm scan for truncation. It introduces a variable to hold the last blkno which was skipped during the forward portion. Any blocks after both this blkno and after the last inspected nonempty page (which the code is already tracking) must have been observed to be empty by the current vacuum. Any other process rendering the page nonempty are required to clear the vm bit, and no other process can set the bit again during the vacuum's lifetime. So if the bit is still set, the page is still empty without needing to inspect it. One case where this patch can behave differently then current code is, when before taking Exclusive lock on relation for truncation, if some backend clears the vm bit and then inserts-deletes-prune that page. I think with patch it will not consider to truncate pages whereas current code will allow to truncate it as it re-verifies each page. I think such a case would be rare and we might not want to bother about it, but still worth to think about it once. Some minor points about patch: count_nondeletable_pages() { .. if (PageIsNew(page) || PageIsEmpty(page)) { /* PageIsNew probably shouldn't happen... */ UnlockReleaseBuffer(buf); continue; } .. } Why vmbuffer is not freed in above loop? count_nondeletable_pages() { .. + /* + * Pages that were inspected and found to be empty by this vacuum run + * must still be empty if the vm bit is still set. Only vacuum sets + * vm bits, and only one vacuum can exist in a table at one time. + */ + trust_vm=vacrelstats-nonempty_pagesvacrelstats-skipped_pages ? + vacrelstats-nonempty_pages : vacrelstats-skipped_pages; .. } I think it is better to have spaces in above assignment statement (near '=' and near '') With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Implementation of global temporary tables?
there is other question - what is effect of ALTER TABLE of global temp table on instances of this table in active sessions? As I said, we need to first agree on the behaviors of the existing commands. I can think of two options now for ALTER TABLE: 1) only allow ALTER TABLE when there is no other active sessions (this is how Oracle deals with it.) 2) handle it as if session copies inherit from the global copy and ALTER TABLE executes on the global copy. Thanks, Zhaomo
Re: [HACKERS] Implementation of global temporary tables?
On 15 July 2015 at 15:57, Andres Freund and...@anarazel.de wrote: On 2015-07-15 16:52:49 +0200, Andres Freund wrote: Why do we need to create that copy? We can just use the relfilenode in all backends by having the backendid in the filename? Yes, there's a some amount of additional code needed, but it's not that much? I actually think it might end up being less additional code than having a copy, because with the copy you'll have two different oids for global entry and the local copy. Hm, yes. Brainfart. Transaction table rewrites/truncations need to change the relfilenode. To fix We could add a backend local mapping table from global temp table id to the backend local relfilenode. The code to lookup the relfilenode is already mostly isolated. It may be possible to do this, though I'm sure there's a wrinkle somewhere. But there doesn't seem to be a need to overload the main feature request with additional requirements. Doing that is just scope creep that prevents us getting features out. Nice, simple patches from newer developers. Later tuning and tweaking from more expert community members. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Implementation of global temporary tables?
Sounds fine in general. I'm a bit curious to know what are the locking implications of vivifying the table on access. The locking implications depend on how we interpret the existing commands in the context of global temp tables and I think we should discuss and agree on the behaviors of the commands with global temp tables, but I think in general we can follow these rules: If the command executes on the global temp table's metadata, for example an ALTER TABLE command, then we lock the global copy at the same level as we do a regular table. If the command executes on the global temp table's data (which is actually stored in the session copy), for example an DML command, then the global copy is locked at the AccessShareLock level to prevent concurrent modifications to the global temp table's definition from other sessions. Thanks, Zhaomo On Wed, Jul 15, 2015 at 4:26 AM, Andrew Dunstan and...@dunslane.net wrote: On 07/15/2015 07:58 AM, Simon Riggs wrote: For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template That meets the SQL Standard and doesn't contain any visibility problems or need for new internals. The purpose of this feature is to automatically create a temp table with the same definition whenever needed. The discussion of bloat is just wrong. We create exactly the same amount of bloat as if we had typed CREATE TEMP TABLE. Optimising temp table entries in the catalog is another, separate patch, if we care. Sounds fine in general. I'm a bit curious to know what are the locking implications of vivifying the table on access. cheers andrew
Re: [HACKERS] Implementation of global temporary tables?
On 2015-07-15 16:24:52 +0100, Simon Riggs wrote: It may be possible to do this, though I'm sure there's a wrinkle somewhere. But there doesn't seem to be a need to overload the main feature request with additional requirements. Doing that is just scope creep that prevents us getting features out. Nice, simple patches from newer developers. Later tuning and tweaking from more expert community members. I think that's generally a fair point. But here we're discussing to add a fair amount of wrinkles with the copy approach. The fact alone that the oid is different will have some ugly consequences. So we add complexity, just to shift it into different places later? I'm not sure that's a good idea. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
On 15 July 2015 at 16:28, Andres Freund and...@anarazel.de wrote: On 2015-07-15 16:24:52 +0100, Simon Riggs wrote: It may be possible to do this, though I'm sure there's a wrinkle somewhere. But there doesn't seem to be a need to overload the main feature request with additional requirements. Doing that is just scope creep that prevents us getting features out. Nice, simple patches from newer developers. Later tuning and tweaking from more expert community members. I think that's generally a fair point. But here we're discussing to add a fair amount of wrinkles with the copy approach. The fact alone that the oid is different will have some ugly consequences. Why? We are creating a local temp table LIKE the global temp table. That is already a supported operation. So there is no different oid. So we add complexity, just to shift it into different places later? I'm not sure that's a good idea. There's no complexity in a simple temp table like. We can do this now with triggers. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Implementation of global temporary tables?
Andres Freund and...@anarazel.de writes: On 2015-07-15 16:24:52 +0100, Simon Riggs wrote: It may be possible to do this, though I'm sure there's a wrinkle somewhere. But there doesn't seem to be a need to overload the main feature request with additional requirements. Doing that is just scope creep that prevents us getting features out. Nice, simple patches from newer developers. Later tuning and tweaking from more expert community members. I think that's generally a fair point. But here we're discussing to add a fair amount of wrinkles with the copy approach. The fact alone that the oid is different will have some ugly consequences. So we add complexity, just to shift it into different places later? I'm not sure that's a good idea. With all due respect, there are features that are beyond the abilities of some newer developers, and reducing the scope isn't a good way to fix that. It just leaves a bigger mess to be cleaned up later. I think Andres' idea of a per-backend filenode mapping table might work. The existing relfilenode mapper solves a somewhat related problem, namely how do you replace the filenode for shared system catalogs whose pg_class entries can't be changed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
Simon Riggs si...@2ndquadrant.com writes: On 15 July 2015 at 16:28, Andres Freund and...@anarazel.de wrote: I think that's generally a fair point. But here we're discussing to add a fair amount of wrinkles with the copy approach. The fact alone that the oid is different will have some ugly consequences. Why? We are creating a local temp table LIKE the global temp table. That is already a supported operation. So there is no different oid. You're presuming a specific implementation decision, one that has not been made yet, and isn't all that attractive because of the catalog bloat issues. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] Incremental checksums
On Jul 15, 2015, at 3:18 AM, Amit Kapila amit.kapil...@gmail.com wrote: - pg_disable_checksums(void) = turn checksums off for a cluster. Sets the state to disabled, which means bg_worker will not do anything. - pg_request_checksum_cycle(void) = if checksums are enabled, increment the data_checksum_cycle counter and set the state to enabling. If the cluster is already enabled for checksums, then what is the need for any other action? You are assuming this is a one-way action. Some people may decide that checksums end up taking too much overhead or similar, we should support disabling of this feature; with this proposed patch the disable action is fairly trivial to handle. Requesting an explicit checksum cycle would be desirable in the case where you want to proactively verify there is no cluster corruption to be found. David -- David Christensen PostgreSQL Team Manager End Point Corporation da...@endpoint.com 785-727-1171 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On 06/30/2015 11:24 PM, Andres Freund wrote: On 2015-06-30 22:19:02 +0300, Heikki Linnakangas wrote: Hm. Right. A recheck of the value after the queuing should be sufficient to fix? That's how we deal with the exact same scenarios for the normal lock state, so that shouldn't be very hard to add. Yeah. It's probably more efficient to not release the spinlock between checking the value and LWLockQueueSelf() though. Right now LWLockQueueSelf() takes spinlocks etc itself, and is used that way in a bunch of callsites... So that'd be harder. Additionally I'm planning to get rid of the spinlocks around queuing (they show up as bottlenecks in contended workloads), so it seems more future proof not to assume that either way. On top of that I think we should, when available (or using the same type of fallback as for 32bit?), use 64 bit atomics for the var anyway. I'll take a stab at fixing this tomorrow. Thanks! Do you mean both or just the second issue? Both. Here's the patch. Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet. This passes make check, but I haven't done any testing beyond that. Does this look sane to you? - Heikki From ce90bd9a1e2c8b5783ebeea83594da7d3a1c63de Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas heikki.linnakangas@iki.fi Date: Wed, 15 Jul 2015 18:33:28 +0300 Subject: [PATCH 1/1] Fix LWLock variable support, broken by the lwlock scalability patch --- src/backend/access/transam/xlog.c | 15 ++--- src/backend/storage/lmgr/lwlock.c | 135 ++ src/include/storage/lwlock.h | 1 - 3 files changed, 83 insertions(+), 68 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1dd31b3..2f34346 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1408,9 +1408,7 @@ WALInsertLockAcquire(void) * The insertingAt value is initially set to 0, as we don't know our * insert location yet. */ - immed = LWLockAcquireWithVar(WALInsertLocks[MyLockNo].l.lock, - WALInsertLocks[MyLockNo].l.insertingAt, - 0); + immed = LWLockAcquire(WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE); if (!immed) { /* @@ -1442,13 +1440,12 @@ WALInsertLockAcquireExclusive(void) */ for (i = 0; i NUM_XLOGINSERT_LOCKS - 1; i++) { - LWLockAcquireWithVar(WALInsertLocks[i].l.lock, - WALInsertLocks[i].l.insertingAt, - PG_UINT64_MAX); + LWLockAcquire(WALInsertLocks[i].l.lock, LW_EXCLUSIVE); + LWLockUpdateVar(WALInsertLocks[i].l.lock, + WALInsertLocks[i].l.insertingAt, + PG_UINT64_MAX); } - LWLockAcquireWithVar(WALInsertLocks[i].l.lock, - WALInsertLocks[i].l.insertingAt, - 0); + LWLockAcquire(WALInsertLocks[i].l.lock, LW_EXCLUSIVE); holdingAllLocks = true; } diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 46cab49..92a1aef 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -11,12 +11,12 @@ * LWLocks to protect its shared state. * * In addition to exclusive and shared modes, lightweight locks can be used - * to wait until a variable changes value. The variable is initially set - * when the lock is acquired with LWLockAcquireWithVar, and can be updated - * without releasing the lock by calling LWLockUpdateVar. LWLockWaitForVar - * waits for the variable to be updated, or until the lock is free. The - * meaning of the variable is up to the caller, the lightweight lock code - * just assigns and compares it. + * to wait until a variable changes value. The variable is initially put + * in a not-set state when the lock is acquired with LWLockAcquire, and + * can be updated without releasing the lock by calling LWLockUpdateVar. + * LWLockWaitForVar waits for the variable to be updated, or until the lock + * is free. The meaning of the variable is up to the caller, the lightweight + * lock code just assigns and compares it. * * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California @@ -99,6 +99,7 @@ extern slock_t *ShmemLock; #define LW_FLAG_HAS_WAITERS ((uint32) 1 30) #define
Re: [HACKERS] Implementation of global temporary tables?
On 2015-07-15 16:36:12 +0100, Simon Riggs wrote: On 15 July 2015 at 16:28, Andres Freund and...@anarazel.de wrote: I think that's generally a fair point. But here we're discussing to add a fair amount of wrinkles with the copy approach. The fact alone that the oid is different will have some ugly consequences. Why? We are creating a local temp table LIKE the global temp table. That is already a supported operation. So there is no different oid. Then your locking against ALTER, DROP etc. isn't going to work. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generalized JSON output functions
On Mon, Jul 13, 2015 at 10:46 AM, Ryan Pedela rped...@datalanche.com wrote: As far as large numbers in JSON, I think Postgres is doing the right thing and should not be changed. It is Javascript that is stupid here, and I don't think it is wise to add something to core just because one client does stupid things with large numbers. In addition, ES7 is introducing value types which will hopefully solve the large number problem in Javascript. FWIW, I don't agree. If it's not easy to read the JSON that PostgreSQL generates using JavaScript, then a lot of people are just going to give up on doing it, and IMO that would be sad. Telling people that they have to parse the JSON using some parser other than the one built into their JavaScript engine, whack it around, and then render it as text and parse it again is not really an acceptable answer. The reason why the logical decoding stuff allows multiple output formats is because Andres, quite correctly, foresaw that different people would need different output formats. He could have designed that system to output only one output format and just said, everybody's got to read and parse this, but that would have been slow. Instead, he tried to set things up so that you could get the output in the format that was most convenient for your client, whatever that is. On this thread, we're back-pedaling from that idea: sorry, you can get JSON output, but if you want JSON output that will be properly interpreted by your JSON parser, you can't have that. Regardless of the details of this particular patch, I can't endorse that approach. If we want people to use our software, we need to meet them where they are at, especially when we are only (IIUC) talking about inserting a few extra quotation marks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
On 2015-07-15 12:58:51 +0100, Simon Riggs wrote: On 14 July 2015 at 23:20, Jim Nasby jim.na...@bluetreble.com wrote: Pavel's original description of how to do this seem valid, and from the link Tom agreed in 2009. For me the design summary is this * CREATE GLOBAL TEMP TABLE creates catalog entries like a normal table but with different relkind * When we see a request to INSERT, DEL, UPD, SEL from the temp table, if it does not exist we create it as a TEMP table of the same name, using the Global's pg_class entry as a template Why do we need to create that copy? We can just use the relfilenode in all backends by having the backendid in the filename? Yes, there's a some amount of additional code needed, but it's not that much? I actually think it might end up being less additional code than having a copy, because with the copy you'll have two different oids for global entry and the local copy. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementation of global temporary tables?
On 2015-07-15 16:52:49 +0200, Andres Freund wrote: Why do we need to create that copy? We can just use the relfilenode in all backends by having the backendid in the filename? Yes, there's a some amount of additional code needed, but it's not that much? I actually think it might end up being less additional code than having a copy, because with the copy you'll have two different oids for global entry and the local copy. Hm, yes. Brainfart. Transaction table rewrites/truncations need to change the relfilenode. To fix We could add a backend local mapping table from global temp table id to the backend local relfilenode. The code to lookup the relfilenode is already mostly isolated. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] First Aggregate Funtion?
On Tue, Jul 14, 2015 at 9:23 AM, sudalai sudala...@gmail.com wrote: The above implementation of first aggregate returns the first non-NULL item value. To get *first row item value* for a column use the below implementation. -- create a function that push at most two element on given array -- push the first row value at second index of the array CREATE OR REPLACE FUNCTION two_value_holder(anyarray, anyelement) returns anyarray as $$ select case when array_length($1,1) 2 then array_append($1,$2) else $1 end ; $$ language sql immutable; -- create a function that returns second element of an array CREATE OR replace FUNCTION second_element (ANYARRAY) RETURNS ANYELEMENT AS $$ SELECT $1[2]; $$ LANGUAGE SQL; -- create first aggregate function that return first_row item value CREATE AGGREGATE first(anyelement)( SFUNC=two_value_holder, STYPE=ANYARRAY, INITCOND='{NULL}', FINALFUNC=second_element ); I hope this work.. I don't think so, because arrays can contain duplicates. rhaas=# select coalesce(first(x.column1), 'wrong') from (values (null), ('correct')) x; coalesce -- wrong (1 row) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: ctidscan as an example of custom-scan (Re: [HACKERS] [v9.5] Custom Plan API)
On Tue, Jul 14, 2015 at 6:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Both you and Andres have articulated the concern that CustomScan isn't actually useful, but I still don't really understand why not. I'm curious, for example, whether CustomScan would have been sufficient to build TABLESAMPLE, and if not, why not. Obviously the syntax has to be in core, ... so you just made the point ... If you're going to set the bar that high, there's no point in anyone ever trying to add extensibility anywhere, because there will always be some other place where there isn't enough extensibility someplace else to do everything that someone might want. The fact that the parser isn't extensible doesn't make custom scans useless any more than the fact that the non-extensibility of WAL-logging makes pg_am useless. but why couldn't the syntax just call an extension-provided callback that returns a custom scan, instead of having a node just for TABLESAMPLE? Because that only works for small values of work. As far as TABLESAMPLE goes, I intend to hold Simon's feet to the fire until there's a less cheesy answer to the problem of scan reproducibility. Assuming we're going to allow sample methods that can't meet the reproducibility requirement, we need something to prevent them from producing visibly broken query results. Ideally, the planner would avoid putting such a scan on the inside of a nestloop. A CustomScan-based implementation could not possibly arrange such a thing; we'd have to teach the core planner about the concern. Well, I think it would be better to change the tablesample methods as you previously proposed, so that they are actually stable. But if that's not possible, then when you (or someone) makes the core planner able to consider such matters, you could add a CUSTOM_PATH_UNSTABLE flag that a custom path can use to notify the core system about the problem. Then tablesample methods that are unstable can set the flag and those that are stable are not penalized. Presumably we'll end up with such a flag in the tablesample-path anyway, and a separate flag in every kind of future scan that needs similar consideration. If we put it in some piece of infrastructure (like the custom-path stuff) that is reusable, we could avoid reinventing the wheel every time. Or, taking the example of a GpuScan node, it's essentially impossible to persuade the planner to delegate any expensive function calculations, aggregates, etc to such a node; much less teach it that that way is cheaper than doing such things the usual way. So yeah, KaiGai-san may have a module that does a few things with a GPU, but it's far from doing all or even very much of what one would want. It's true that there are things he can't do this way, but without the custom-scan stuff, he'd be able to do even less. Now, as part of the upper-planner-rewrite business that I keep hoping to get to when I'm not riding herd on bad patches, it's likely that we might have enough new infrastructure soon that that particular problem could be solved. But there would just be another problem after that; a likely example is not having adequate statistics, or sufficiently fine-grained function cost estimates, to be able to make valid choices once there's more than one way to do such calculations. (I'm not really impressed by the GPU is *always* faster approaches.) Significant improvements of that sort are going to take core-code changes. Probably, but giving people the ability to experiment outside core, even if it's limited, often leads to good things. And when it doesn't, it doesn't really cost us anything. Even worse, if there do get to be any popular custom-scan extensions, we'll break them anytime we make any nontrivial planner changes, because there is no arms-length API there. A trivial example is that even adding or changing any fields in struct Path will necessarily break custom scan providers, because they build Paths for themselves with no interposed API. I agree that will happen, but I don't care. This happens all the time, and every person other than yourself who has commented on this issue has said that they'd rather have an API exposed that will later get whacked around without warning than have no API exposed at all, not just with respect to custom-paths, but with a whole variety of other things as well, like sticking PGDLLIMPORT on all of the variables that back GUCs. It is really far more tedious to try to work around the lack of a PGDLLIMPORT marking (which causes a huge annoyance now) than to adjust the code if and when somebody changes something about the GUC (which causes a small annoyance later, and only if there actually are changes). In large part this is the same as my core concern about the TABLESAMPLE patch: exposing dubiously-designed APIs is soon going to force us to make choices between breaking those APIs or not being able to make
Re: ctidscan as an example of custom-scan (Re: [HACKERS] [v9.5] Custom Plan API)
On Wed, Jul 15, 2015 at 2:28 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jul 15, 2015 at 10:12 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: We never guarantee the interface compatibility between major version up. If we add/modify interface on v9.6, it is duty for developer of extensions to follow the new version, even not specific to custom-scan provider. If v9.6 adds support fine-grained function cost estimation, I also have to follow the feature, but it is natural. Maintaining compatibility across major versions is a best-effort and even if we sometimes break things across major versions, and sometimes even silently (take the example of 9.3's background worker that do not start with 9.4 as long as bgw_notify_pid is not set to 0), the approach is usually taken to have APIs stable and convenient able to cover a maximum set of cases for a given set of plugins, and this serves well in the long term. Now it is true that we cannot assume either that the version of a plugin API will be perfect forever and will be able to cover all the imagined test cases at first shot, still I'd like to think that things are broken only when necessary and with good reasons. A set of APIs changed at each major release tends to be proof that research lacked in the first version, and would surely demotivate its adoption by extension developers. Maybe. If those changes are demanded by extension developers who are trying to do useful stuff with the APIs and finding problems, then changing things will probably make those extension developers happy. If the changes are necessitated by core improvements, then we might get some complaints if those core improvements are perceived to have small value compared to the API break. But I think complaints of that kind are very rare. The only thing I can think of that falls into approximately that category is a gripe about the replacement of relistemp by relpersistence. But I think that was more motivated by the way it broke SQL monitoring queries than by the way it broke C code. There may be other more recent examples; but that's the only one I can think of offhand. I just don't see it as a big issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers