Re: [HACKERS] preprocess_targetlist and inheiritance
On 2015/04/22 0:43, Stephen Frost wrote: On Tuesday, April 21, 2015, Alvaro Herrera mailto:alvhe...@2ndquadrant.com>> wrote: Stephen Frost wrote: > Tom, all, > > Looks like preprocess_targetlist() should have been adjusted with the > changes to ExecBuildAuxRowMark() to support foreign tables being part > of inheritance trees (cb1ca4d800621dcae67ca6c799006de99fa4f0a5) to > also include the tableoid regardless of the rowMark type, if the > relation is the parent of an inheritance tree. Uh, this patch was already submitted separately: https://www.postgresql.org/message-id/552cf0b6.8010...@lab.ntt.co.jp Oh, excellent. Looks like the discussion agrees that it makes sense to do and the RLS case didn't involve any foreign tables yet still ran into the issue. IIUC, I think the issue would be raised when performing SELECT FOR UPDATE/SHARE on an inheritance set the parents and childs of which are all foreign tables, or when performing UPDATE/DELETE involving such an inheritance set as a source table. Probably, I think your new RLS-with-inheritance regression tests include such a query. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On Wed, Apr 22, 2015 at 3:38 PM, Heikki Linnakangas wrote: > On 04/22/2015 03:30 AM, Michael Paquier wrote: >> >> This is going to change a behavior that people are used to for a >> couple of releases. I would not mind having this patch do >> "archive_mode = on during recovery" => archive only segments generated >> by this node + the last partial segment on the old timeline at >> promotion, without renaming to preserve backward compatible behavior. >> If master and standby point to separate archive locations, this way >> the operator can sort out later the one he would want to use. If they >> point to the same location, archive_command scripts already do >> internally such renaming, at least that's what I suspect an >> experienced user would do, now it is true that not many people are >> experienced in this area I see mistakes regarding such things on a >> weekly basis... This .partial segment renaming is something that we >> should let the archive_command manage with its internal logic. > > > Currently, the archive command doesn't know if the segment it's archiving is > partial or not, so you can't put any logic there to manage it. But if we > archive it with the .partial suffix, then you can put logic in the > restore_command to check for .partial files, if you really want to. Well, now you can check as well if there is a file with the same name already archived and append a suffix to the new file copied, keep the two files, and then let restore_command sort things up as it wants with the two segment files it finds. > I feel that the best approach is to archive the last, partial segment, but > with the .partial suffix. I don't see any plausible real-world setup where > the current behavior would be better. I don't really see much need to > archive the partial segment at all, but there's also no harm in doing it, as > long as it's clearly marked with the .partial suffix. Well, as long as it is clearly archived at promotion, even with a suffix, I guess that I am fine... This will need some tweaking on restore_command for existing applications, but as long as it is clearly documented I am fine. Shouldn't this be a different patch though? > BTW, pg_receivexlog also uses a ".partial" file, while it's streaming WAL > from the server. The .partial suffix is removed when the segment is > complete. So there's some precedence to this. pg_receivexlog adds just > ".partial" to the filename, it doesn't add any information of what portion > of the file is valid like I suggested here, though. Perhaps we should follow > pg_receivexlog's example at promotion too, for consistency. Consistency here sounds good to me. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Update docs in fdwhandler.sgml
On 2015/04/22 6:50, Robert Haas wrote: On Tue, Apr 21, 2015 at 5:45 AM, Etsuro Fujita wrote: Since we now allow CHECK constraints to be placed on foreign tables, not only NOT NULL, I think it'd be better to update docs on considerations about constraints on foreign tables in fdwhandler.sgml, so as to provide more general considerations. Please find attached a patch. Looks good to me, so committed. Thanks! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] moving from contrib to bin
On Sun, Apr 12, 2015 at 12:36 PM, Peter Eisentraut wrote: > On 3/11/15 8:21 PM, Michael Paquier wrote: >> Attached is a series of patch rebased on current HEAD, there were some >> conflicts after perl-tidying the refactoring patch for MSVC. Note that >> this series still uses PGXS in the Makefiles, I am fine to update them >> if necessary once this matter is set (already did this stuff upthread >> with a previous version). > > OK, here we go. I have committed the pg_archivecleanup move, with a > complete makefile and your Windows help. Let's see how it builds. All the patches have been committed, finishing the work on this thread. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] moving from contrib to bin
Peter, Michael, On 2015-04-22 16:13:15 +0900, Michael Paquier wrote: > All the patches have been committed, finishing the work on this thread. Many thanks for that effort! Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests
On 04/16/2015 06:51 AM, Alvaro Herrera wrote: Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use? I thought the -w would wait until promotion has taken effect, so there's no need to sleep additional time. -w is not supported with pg_ctl promote. Only start, stop and restart. It's accepted, but it doesn't do anything. Which isn't very nice... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 21/04/15 22:36, Andres Freund wrote: On 2015-04-21 16:26:08 -0400, Robert Haas wrote: On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund wrote: I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : "manually" set the replication progress to a value. Primarily useful for copying values from other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Why are we using functions for this rather than DDL? Unless I miss something the only two we really could use ddl for is pg_replication_origin_create/pg_replication_origin_drop. We could use DDL for them if we really want, but I'm not really seeing the advantage. I think the only value of having DDL for this would be consistency (catalog objects are created via DDL) as it looks like something that will be called only by extensions and not users during normal operation. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests
On 04/20/2015 05:21 AM, Michael Paquier wrote: I have just made a run of the TAP tests of pg_rewind on my raspberry PI 1 (hamster), where the tests are very slow, and I noticed that it takes up to 10s to get confirmation from standby that it has caught up with the changes from master, and 5s to get confirmation that standby has been promoted. So the tests in HEAD would be broken without this patch, and 30s gives visibly enough room. Ok, applied, with some refactoring: I put the sleep-retry loop into a separate function. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests
On Wed, Apr 22, 2015 at 8:40 PM, Heikki Linnakangas wrote: > On 04/20/2015 05:21 AM, Michael Paquier wrote: >> >> I have just made a run of the TAP tests of pg_rewind on my raspberry >> PI 1 (hamster), where the tests are very slow, and I noticed that it >> takes up to 10s to get confirmation from standby that it has caught up >> with the changes from master, and 5s to get confirmation that standby >> has been promoted. So the tests in HEAD would be broken without this >> patch, and 30s gives visibly enough room. > > > Ok, applied, with some refactoring: I put the sleep-retry loop into a > separate function. Thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rounding to even for numeric data type
On Wed, Apr 22, 2015 at 9:30 PM, Pedro Gimeno wrote: > Dean Rasheed wrote, On 2015-03-28 10:01: >> On 28 March 2015 at 05:16, Andrew Gierth wrote: "Tom" == Tom Lane writes: >>> >>> Tom> I think the concern over backwards compatibility here is probably >>> Tom> overblown; but if we're sufficiently worried about it, a possible >>> Tom> compromise is to invent a numeric_rounding_mode GUC, so that >>> Tom> people could get back the old behavior if they really care. >>> >>> I only see one issue with this, but it's a nasty one: do we really want >>> to make all numeric operations that might do rounding stable rather than >>> immutable? >>> >> >> Yeah, making all numeric functions non-immutable seems like a really bad >> idea. > > Would it be possible to make it an unchangeable per-cluster or > per-database setting, kinda like how encoding behaves? Wouldn't that > allow to keep the functions immutable? Rounding is not something that can be enforced at the database or server level but at data type level, see for example the differences already present for double precision and numeric as mentioned upthread. In short, you could keep rounding functions immutable by having one data type with a different rounding method. At least that's an idea. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
At 2015-04-18 12:28:36 +0530, amit.kapil...@gmail.com wrote: > > I think you have missed to address the below point: > > >> 4) prefetch Updated patch attached, as well as the diff against the earlier version just to make it easier to see the exact change I made (which is to copy the skip logic from lazy_scan_heap, as you suggested). I'm not entirely convinced that this change is worthwhile, but it's easy to make and difficult to argue against, so here it is. (I did test, and it seems to work OK after the change.) Amit, Tomas, thanks again for your review comments. -- Abhijit >From 3edb5426292d6097cb66339b865e99bf4f766646 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Fri, 26 Dec 2014 12:37:13 +0530 Subject: Add pgstatbloat to pgstattuple --- contrib/pgstattuple/Makefile | 4 +- contrib/pgstattuple/pgstatbloat.c | 387 + contrib/pgstattuple/pgstattuple--1.2--1.3.sql | 18 + .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} | 18 +- contrib/pgstattuple/pgstattuple.control| 2 +- doc/src/sgml/pgstattuple.sgml | 135 +++ 6 files changed, 560 insertions(+), 4 deletions(-) create mode 100644 contrib/pgstattuple/pgstatbloat.c create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (73%) diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile index 862585c..d7d27a5 100644 --- a/contrib/pgstattuple/Makefile +++ b/contrib/pgstattuple/Makefile @@ -1,10 +1,10 @@ # contrib/pgstattuple/Makefile MODULE_big = pgstattuple -OBJS = pgstattuple.o pgstatindex.o $(WIN32RES) +OBJS = pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES) EXTENSION = pgstattuple -DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql +DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql PGFILEDESC = "pgstattuple - tuple-level statistics" REGRESS = pgstattuple diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c new file mode 100644 index 000..b19459a --- /dev/null +++ b/contrib/pgstattuple/pgstatbloat.c @@ -0,0 +1,387 @@ +/* + * contrib/pgstattuple/pgstatbloat.c + * + * Abhijit Menon-Sen + * Portions Copyright (c) 2001,2002 Tatsuo Ishii (from pgstattuple) + * + * Permission to use, copy, modify, and distribute this software and + * its documentation for any purpose, without fee, and without a + * written agreement is hereby granted, provided that the above + * copyright notice and this paragraph and the following two + * paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, + * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED + * OF THE POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS + * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, + * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + */ + +#include "postgres.h" + +#include "access/visibilitymap.h" +#include "access/transam.h" +#include "access/xact.h" +#include "access/multixact.h" +#include "access/htup_details.h" +#include "catalog/namespace.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "storage/bufmgr.h" +#include "storage/freespace.h" +#include "storage/procarray.h" +#include "storage/lmgr.h" +#include "utils/builtins.h" +#include "utils/tqual.h" +#include "commands/vacuum.h" + +PG_FUNCTION_INFO_V1(pgstatbloat); + +/* + * tuple_percent, dead_tuple_percent and free_percent are computable, + * so not defined here. + */ +typedef struct pgstatbloat_output_type +{ + uint64 table_len; + uint64 tuple_count; + uint64 misc_count; + uint64 tuple_len; + uint64 dead_tuple_count; + uint64 dead_tuple_len; + uint64 free_space; + uint64 total_pages; + uint64 scanned_pages; +} pgstatbloat_output_type; + +static Datum build_output_type(pgstatbloat_output_type *stat, + FunctionCallInfo fcinfo); +static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo); + +/* + * build a pgstatbloat_output_type tuple + */ +static Datum +build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo) +{ +#define NCOLUMNS 10 +#define NCHARS 32 + + HeapTuple tuple; + char *values[NCOLUMNS]; + char values_buf[NCOLUMNS][NCHARS]; + int i; + double tuple_percent; + double dead_tuple_percent; + double free_percent; /* free/reusable space in % */ + double scanned_percent; + TupleDesc tupdesc; + AttInMetadata *attinmeta; + + /* Build a
Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"
Ah sorry, didn't realize I top posted. I'll test this new one. Payal. On Apr 21, 2015 10:23 PM, "Jan de Visser" wrote: > On April 21, 2015 09:34:51 PM Jan de Visser wrote: > > On April 21, 2015 09:01:14 PM Jan de Visser wrote: > > > On April 21, 2015 07:32:05 PM Payal Singh wrote: > ... snip ... > > > > Urgh. It appears you are right. Will fix. > > > > jan > > Attached a new attempt. This was one from the category 'I have no idea how > that ever worked", but whatever. For reference, this is how it looks for me > (magic man-behind-the-curtain postgresql.conf editing omitted): > > jan@wolverine:~/Projects/postgresql$ initdb -D data > ... Bla bla bla ... > jan@wolverine:~/Projects/postgresql$ pg_ctl -D data -l logfile start > server starting > jan@wolverine:~/Projects/postgresql$ tail -5 logfile > LOG: database system was shut down at 2015-04-21 22:03:33 EDT > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload > server signaled > jan@wolverine:~/Projects/postgresql$ tail -5 logfile > LOG: database system was shut down at 2015-04-21 22:03:33 EDT > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > LOG: received SIGHUP, reloading configuration files > jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload > server signaled > pg_ctl: Reload of server with PID 14656 FAILED > Consult the server log for details. > jan@wolverine:~/Projects/postgresql$ tail -5 logfile > LOG: autovacuum launcher started > LOG: received SIGHUP, reloading configuration files > LOG: received SIGHUP, reloading configuration files > LOG: syntax error in file > "/home/jan/Projects/postgresql/data/postgresql.conf" > line 1, near end of line > LOG: configuration file > "/home/jan/Projects/postgresql/data/postgresql.conf" > contains errors; no changes were applied > jan@wolverine:~/Projects/postgresql$
Re: [HACKERS] Row security violation error is misleading
On 21 April 2015 at 22:21, Dean Rasheed wrote: > On 21 April 2015 at 20:50, Stephen Frost wrote: >> Thanks a lot for this. Please take a look at the attached. > > I've given this a quick read-through, and it looks good to me. The > interaction of permissive and restrictive policies from hooks matches > my expections, and it's a definite improvement having tests for RLS > hooks. > > The only thing I spotted was that the file comment for > test_rls_hooks.c needs updating. > So re-reading this, I spotted what looks like another (pre-existing) bug. In process_policies() there's a loop over all the policies, collecting quals and with_check_quals, then a test at the end to use the USING quals for the WITH CHECK quals if there were no with_check_quals. I think we want to instead do that test inside the loop -- i.e., for each policy, if there is no with_check_qual *for that policy*, use it's USING qual instead. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row security violation error is misleading
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: > On 21 April 2015 at 22:21, Dean Rasheed wrote: > > On 21 April 2015 at 20:50, Stephen Frost wrote: > >> Thanks a lot for this. Please take a look at the attached. > > > > I've given this a quick read-through, and it looks good to me. The > > interaction of permissive and restrictive policies from hooks matches > > my expections, and it's a definite improvement having tests for RLS > > hooks. > > > > The only thing I spotted was that the file comment for > > test_rls_hooks.c needs updating. > > So re-reading this, I spotted what looks like another (pre-existing) > bug. In process_policies() there's a loop over all the policies, > collecting quals and with_check_quals, then a test at the end to use > the USING quals for the WITH CHECK quals if there were no > with_check_quals. I think we want to instead do that test inside the > loop -- i.e., for each policy, if there is no with_check_qual *for > that policy*, use it's USING qual instead. Agreed, the USING -> WITH CHECK copy should be happening for all policies independently, not wholesale at the end. I've updated my tree and am testing. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Freeze avoidance of very large table.
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby wrote: > On 4/21/15 3:21 PM, Robert Haas wrote: >> It's possible that we could use this infrastructure to freeze more >> aggressively in other circumstances. For example, perhaps VACUUM >> should freeze any page it intends to mark all-visible. That's not a >> guaranteed win, because it might increase WAL volume: setting a page >> all-visible does not emit an FPI for that page, but freezing any tuple >> on it would, if the page hasn't otherwise been modified since the last >> checkpoint. Even if that were no issue, the freezing itself must be >> WAL-logged. But if we could somehow get to a place where all-visible >> => frozen, then autovacuum would never need to visit all-visible >> pages, a huge win. > > I don't know how bad the extra WAL traffic would be; we'd obviously need to > incur it eventually, so it's a question of how common it is for a page to go > all-visible but then go not-all-visible again before freezing. It would > presumably be far more traffic than some form of a FrozenMap though... Yeah, maybe. The freeze record contains details for each TID, while the freeze map bit would only need to be set once for the whole page. I wonder if the format of that record could be optimized somehow. >> We could also attack the problem from the other end. Instead of >> trying to set the bits on the individual tuples, we could decide that >> whenever a page is marked all-visible, we regard it as frozen >> regardless of the bits set or not set on the individual tuples. >> Anybody who wants to modify the page must freeze any unfrozen tuples >> "for real" before clearing the visibility map bit. This would have >> the same end result as the previous idea: all-visible would >> essentially imply frozen, and autovacuum could ignore those pages >> categorically. > > Pushing what's currently background work onto foreground processes doesn't > seem like a good idea... When you phrase it that way, no, but pushing work that otherwise would need to be done right now off to a future time that may never arrive sounds like a good idea. Today, we freeze the page -- rewriting it -- and then keep scanning those all-frozen pages every X number of transactions to make sure they are really all-frozen. In this system, we'd eliminate the repeated scanning and defer the freeze work until the page actually gets modified again. But that might never happen, in which case we never have to do the work at all. >> I'm not saying those ideas don't have problems, because they do. But >> I think they are worth further exploring. The main reason I gave up >> on that is because Heikki was working on the XID-to-LSN mapping stuff. >> That seemed like a better approach than either of the above, so as >> long as Heikki was working on that, there wasn't much reason to pursue >> more lowbrow approaches. Clearly, though, we need to do something >> about this. Freezing is a big problem for lots of users. > > Did XID-LSN die? I see at the bottom of the thread it was returned with > feedback; I guess Heikki just hasn't had time and there's no major blockers? > From what I remember this is probably a better solution, but if it's not > going to make it into 9.6 then we should probably at least look further into > a FM. Heikki said he'd lost enthusiasm for it, but he wasn't too specific about his reasons, IIRC. I guess maybe just that it got complicated, and he wasn't sure it was correct. >> All that having been said, I don't think adding a new fork is a good >> approach. We already have problems pretty commonly where our >> customers complain about running out of inodes. Adding another fork >> for every table would exacerbate that problem considerably. > > Andres idea of adding this to the VM may work well to handle that. It would > double the size of the VM, but it would still be a ratio of 32,000-1 > compared to heap size, or 2MB for a 64GB table. Yes, that's got some potential. It would mean pg_upgrade would have to remove all existing visibility maps when upgrading to the new version, or rewrite them into the new format. But it otherwise seems promising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On Wed, Mar 25, 2015 at 9:51 PM, Kouhei Kaigai wrote: > The attached patch adds GetForeignJoinPaths call on make_join_rel() only when > 'joinrel' is actually built and both of child relations are managed by same > FDW driver, prior to any other built-in join paths. > I adjusted the hook definition a little bit, because jointype can be > reproduced > using SpecialJoinInfo. Right? > > Probably, it will solve the original concern towards multiple calls of FDW > handler in case when it tries to replace an entire join subtree with a > foreign- > scan on the result of remote join query. > > How about your opinion? A few random cosmetic problems: - The hunk in allpaths.c is useless. - The first hunk in fdwapi.h contains an extra space before the closing parenthesis. And then: + else if (scan->scanrelid == 0 && +(IsA(scan, ForeignScan) || IsA(scan, CustomScan))) + varno = INDEX_VAR; Suppose scan->scanrelid == 0 but the scan type is something else? Is that legal? Is varno == 0 the correct outcome in that case? More later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
Robert Haas wrote: > It's possible that we could use this infrastructure to freeze > more aggressively in other circumstances. For example, perhaps > VACUUM should freeze any page it intends to mark all-visible. > That's not a guaranteed win, because it might increase WAL > volume: setting a page all-visible does not emit an FPI for that > page, but freezing any tuple on it would, if the page hasn't > otherwise been modified since the last checkpoint. Even if that > were no issue, the freezing itself must be WAL-logged. But if we > could somehow get to a place where all-visible => frozen, then > autovacuum would never need to visit all-visible pages, a huge > win. That would eliminate full-table scan vacuums, right? It would do that by adding incremental effort and WAL to the "normal" autovacuum run to eliminate the full table scan and the associated mass freeze WAL-logging? It's hard to see how that would not be an overall win. > We could also attack the problem from the other end. Instead of > trying to set the bits on the individual tuples, we could decide > that whenever a page is marked all-visible, we regard it as > frozen regardless of the bits set or not set on the individual > tuples. Anybody who wants to modify the page must freeze any > unfrozen tuples "for real" before clearing the visibility map > bit. This would have the same end result as the previous idea: > all-visible would essentially imply frozen, and autovacuum could > ignore those pages categorically. Besides putting work into the foreground that could be done in the background, that sounds more complicated. Also, there is no ability to "pace" the freeze load or use scheduled jobs to shift the work to off-peak hours. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow SQL/plpgsql functions to accept record
On 4/20/15 2:04 PM, David G. Johnston wrote: SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v); ERROR: record type has not been registered While it may not be necessary to solve both problems I suspect they have the same underlying root cause - specifically the separation of concerns between the planner and the executor. I don't think they're related at all. C functions have the ability to accept a record, so the executor must be able to support it. It's just that SQL and plpgsql functions don't have that support. I suspect that's just because no one has gotten around to it. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
Etsuro, * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: > postgres=# select * from ft1 for update; > ERROR: could not find junk tableoid1 column > > I think this is a bug. Attached is a patch fixing this issue. Pushed, thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian wrote: > >> But if the entire table is very hot, I think that that is just another of way >> of saying that autovacuum is horribly misconfigured. I think the purpose of > > Well, we have to assume there are many misconfigured configurations --- > autovacuum isn't super-easy to configure, so we can't just blame the > user if this makes things worse. In fact, page pruning was designed > spefically for cases where autovacuum wasn't running our couldn't keep > up. Well autovacuum isn't currently considering HOT pruning part of its job at all. It's hard to call it "misconfigured" when there's literally *no* way to configure it "correctly". If you update less than autovacuum_vacuum_scale_factor fraction of the table and then never update another row autovacuum will never run. Ever. Every select will forevermore need to follow hot chains on that table. Until eventually transaction wraparound forces a vacuum on that table if that ever happens. Possibly autovacuum could be adjusted to count how many selects are happening on the table and decide to vacuum it when the cost of the selects following the dead tuples is balanced by the cost of doing a vacuum. But that's not something included in the design of autovacuum today. The original design of tuple storage was aimed at optimizing the steady state where most tuples were not recently updated. It guaranteed that except for tuples that were in the process of being updated or were recently updated a tuple read didn't have to read the CLOG, didn't have to follow any chains, didn't have to do any I/O or other work other than to read the bits on the tuple itself. When a tuple is updated it's put into a state where everyone who comes along has to do extra work but as soon as practical the hint bits get set and that extra work stops. We had similar discussions about setting hint bits in the past. I'm not sure why HOT pruning is the focus now because I actually think hint bit setting is a larger source of I/O in innocent looking selects even today. And it's a major headache, people are always being surprised that their selects cause lots of I/O and slow down dramatically after a big update or data load has finished. It's characterized as "why is the database writing everything twice" (and saying it's actually writing everything three times doesn't make people feel better). In the new age of checksums with hint bit logging I wonder if it's even a bigger issue. It occurs to me that generating these dirty pages isn't really that expensive individually. It's only that there's a sudden influx of a large number of dirty pages that causes them to get translated immediately into filesystem I/O. Perhaps we should dirty pages on hint bit updates and do HOT pruning only to the extent it can be done without causing I/O. Of course it's hard to tell that in advance but maybe something like "if the current buffer had to be fetched and caused a dirty buffer to be evicted then skip hot pruning and don't dirty it for any hint bit updates" would at least mean that once the select fills up its share of buffers with dirty buffers it stops dirtying more. It would dirty pages only as fast as bgwriter or checkpoints manage to write them out. That sounds a bit weird but I think the right solution should have that combination of properties. It should guarantee that hint bits get set and hot chains pruned within some length of time but that no one select causes a storm of dirty buffers that then need to be flushed to disk. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 4/21/15 4:07 PM, Peter Eisentraut wrote: On 4/21/15 4:45 PM, Jim Nasby wrote: In order for a background worker to keep up with some of the workloads that have been presented as counterexamples, you'd need multiple background workers operating in parallel and preferring to work on certain parts of a table. That would require a lot more sophisticated job management than we currently have for, say, autovacuum. My thought was that the foreground queries would send page IDs to the bgworker via a shmq. If the queries have to do much waiting at all on IO then I'd expect the bgworker to be able to keep pace with a bunch of them since it's just grabbing buffers that are already in the pool (and only those in the pool; it wouldn't make sense for it to pull it back from the kernel, let alone disk). We'd need to code this so that if a queue fills up the query doesn't block; we just skip that opportunity to prune. I think that'd be fine. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Authenticating from SSL certificates
Keenan, * kee...@thebrocks.net (kee...@thebrocks.net) wrote: > I'm looking into connection to postgres using authentication from client > certificates. [1] Nice! Glad to hear of more users of that capability. :) > The documentation states that the common name (aka CN) is read from the > certificate and used as the user's login (aka auth_user). > The problem is the common name is typically the user's full name. A field > like email address would contain a more computer friendly identifier. This is why we have the pg_ident mapping capability.. I realize that file has to be generated, but at that point it's really just a string, no? That said, I'm not against this capability in general, but we'd need to make sure it doesn't lock us into OpenSSL. Heikki's been working on changing the SSL code to allow other libraries to be used, which is great, and I'm slightly worried this might make that more difficult. The other issue is that we'd need to be very cleear in the documentation that any users of this capability have to verify with their CA that they aren't going to end up with the same value in whichever field is used for distinct individuals- otherwise, the CA might unknowingly issue two certs with the same value and you would then be unable to distinguish between those two certs and both certs would have access to the account. That's already an issue in the SSL world when using "real" CAs (that is, ones outside of your own organization) and, really, we would do better to support including *more* fields than just the CN to address that issue. As such, perhaps we should support having a *list* of fields to use and then we combine them in some way in the mapping file. That would allow users to, say, include the issuer and the CN, and perhaps the serial number if they want. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] moving from contrib to bin
On Wed, Apr 22, 2015 at 09:18:40AM +0200, Andres Freund wrote: > Peter, Michael, > > On 2015-04-22 16:13:15 +0900, Michael Paquier wrote: > > All the patches have been committed, finishing the work on this thread. > > Many thanks for that effort! And pg_upgrade thanks you. ;-) -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row security violation error is misleading
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: > On 21 April 2015 at 22:21, Dean Rasheed wrote: > > On 21 April 2015 at 20:50, Stephen Frost wrote: > >> Thanks a lot for this. Please take a look at the attached. > > > > I've given this a quick read-through, and it looks good to me. The > > interaction of permissive and restrictive policies from hooks matches > > my expections, and it's a definite improvement having tests for RLS > > hooks. > > > > The only thing I spotted was that the file comment for > > test_rls_hooks.c needs updating. > > So re-reading this, I spotted what looks like another (pre-existing) > bug. In process_policies() there's a loop over all the policies, > collecting quals and with_check_quals, then a test at the end to use > the USING quals for the WITH CHECK quals if there were no > with_check_quals. I think we want to instead do that test inside the > loop -- i.e., for each policy, if there is no with_check_qual *for > that policy*, use it's USING qual instead. Pushed with those changes, please take a look and test! Thanks again for all of your help with this. I'm going to be looking over that second patch with an eye towards getting it in very soon, it's been kicking around for far longer than it should have been and that's my fault, apologies about that. Stephen signature.asc Description: Digital signature
Re: [HACKERS] Freeze avoidance of very large table.
On Wed, Apr 22, 2015 at 11:09 AM, Kevin Grittner wrote: > Robert Haas wrote: >> It's possible that we could use this infrastructure to freeze >> more aggressively in other circumstances. For example, perhaps >> VACUUM should freeze any page it intends to mark all-visible. >> That's not a guaranteed win, because it might increase WAL >> volume: setting a page all-visible does not emit an FPI for that >> page, but freezing any tuple on it would, if the page hasn't >> otherwise been modified since the last checkpoint. Even if that >> were no issue, the freezing itself must be WAL-logged. But if we >> could somehow get to a place where all-visible => frozen, then >> autovacuum would never need to visit all-visible pages, a huge >> win. > > That would eliminate full-table scan vacuums, right? It would do > that by adding incremental effort and WAL to the "normal" > autovacuum run to eliminate the full table scan and the associated > mass freeze WAL-logging? It's hard to see how that would not be an > overall win. Yes and yes. In terms of an overall win, this design loses when the tuples that have been recently marked all-visible are going to get updated again in the near future. In that case, the effort we spend to freeze them is wasted. I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or alternatively followed by "VACUUM FREEZE". The VACUUM generated 4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that is, 113 times more. So changing every VACUUM to act like VACUUM FREEZE would be quite expensive. We'll still come out ahead if those tuples are going to stick around long enough that they would have eventually gotten frozen anyway, but if they get deleted again the loss is pretty significant. Incidentally, the reason for the large difference is that when Heikki created the visibility map, it wasn't necessary for the WAL records that set the visibility map bits to bump the page LSN, because it was just a hint anyway. When I made the visibility-map crash-safe, I went to some pains to preserve that property. Therefore, a regular VACUUM does not emit full page images for the heap pages - it does for the visibility map pages themselves, but there aren't very many of those. In this example, the relation itself was 512MB, so you can see that adding freezing to the mix roughly doubles the I/O cost. Either way we have to write half a gig of dirty data pages, but in one case we also have to write an additional half a gig of WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Tue, Apr 21, 2015 at 04:36:53PM -0400, Robert Haas wrote: > > Keep in mind there's a disconnect between dirtying a page and writing it > > to storage. A page could remain dirty for a long time in the buffer > > cache. This writing of sequential pages would occur at checkpoint time > > only, which seems the wrong thing to optimize. If some other process > > needs to evict pages to make room to read some other page in, surely > > it's going to try one page at a time, not write "many sequential dirty > > pages." > > Well, for a big sequential scan, we use a ring buffer, so we will > typically be evicting the pages that we ourselves read in moments > before. So in this case we would do a lot of sequential writes of > dirty pages. Ah, yes, this again supports the prune-then-skip approach, rather than doing the first X% pruneable pages seen. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow SQL/plpgsql functions to accept record
On 04/22/2015 11:29 AM, Jim Nasby wrote: On 4/20/15 2:04 PM, David G. Johnston wrote: SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v); ERROR: record type has not been registered While it may not be necessary to solve both problems I suspect they have the same underlying root cause - specifically the separation of concerns between the planner and the executor. I don't think they're related at all. C functions have the ability to accept a record, so the executor must be able to support it. It's just that SQL and plpgsql functions don't have that support. I suspect that's just because no one has gotten around to it. Well, that's assuming everyone else thinks it would be a good idea. Maybe they do, but I wouldn't assume it. The answer in the past has been to use more dynamically typed languages such as perl for which this problem is well suited. There are actually several problems: first, given an arbitrary record plpgsql has no easy and efficient way of finding out what field names it has. Second, even if it has such knowledge it has no way of using it - it's not like JavaScript where you can use a text value as a field name. And third, it has no way of creating variables of the right type to hold extracted values. All of these could possibly be overcome, but it would not be a small piece of work, I suspect. Given that plperl buys you all of that already (just try this, for example) people might think it not worth the extra trouble. create function rkeys(record) returns text[] language plperl as $$ my $rec = shift; return [ keys %$rec ]; $$; select unnest(rkeys(r)) from (select * from pg_class limit 1) r; cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On Tue, Apr 21, 2015 at 10:33 PM, Kouhei Kaigai wrote: > [ new patch ] A little more nitpicking: ExecInitForeignScan() and ExecInitCustomScan() could declare currentRelation inside the if (scanrelid > 0) block instead of in the outer scope. I'm not too excited about the addition of GetFdwHandlerForRelation, which is a one-line function used in one place. It seems like we don't really need that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 04/22/2015 05:33 PM, Robert Haas wrote: On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby wrote: On 4/21/15 3:21 PM, Robert Haas wrote: I'm not saying those ideas don't have problems, because they do. But I think they are worth further exploring. The main reason I gave up on that is because Heikki was working on the XID-to-LSN mapping stuff. That seemed like a better approach than either of the above, so as long as Heikki was working on that, there wasn't much reason to pursue more lowbrow approaches. Clearly, though, we need to do something about this. Freezing is a big problem for lots of users. Did XID-LSN die? I see at the bottom of the thread it was returned with feedback; I guess Heikki just hasn't had time and there's no major blockers? From what I remember this is probably a better solution, but if it's not going to make it into 9.6 then we should probably at least look further into a FM. Heikki said he'd lost enthusiasm N it, but he wasn't too specific about his reasons, IIRC. I guess maybe just that it got complicated, and he wasn't sure it was correct. I'd like to continue working on that when I get around to it. Or even better if someone else continues it :-). The thing that made me nervous about that approach is that it made the LSN of each page critical information. If you somehow zeroed out the LSN, you could no longer tell which pages are frozen and which are not. I'm sure it could be made to work - and I got it working to some degree anyway - but it's a bit scary. It's similar to the multixid changes in 9.3: multixids also used to be data that you can just zap at restart, and when we changed the rules so that you lose data if you lose multixids, we got trouble. Now, LSNs are much simpler, and there wouldn't be anything like the multioffset/member SLRUs that you'd have to keep around forever or vacuum, but still.. I would feel safer if we added a completely new "epoch" counter to the page header, instead of reusing LSNs. But as we all know, changing the page format is a problem for in-place upgrade, and takes some space too. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Row security violation error is misleading
On 22 April 2015 at 17:02, Stephen Frost wrote: > Pushed with those changes, please take a look and test! > Excellent, thanks! Will test. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests
On April 22, 2015 11:14:08 AM Heikki Linnakangas wrote: > On 04/16/2015 06:51 AM, Alvaro Herrera wrote: > > Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use? I > > thought the -w would wait until promotion has taken effect, so there's > > no need to sleep additional time. > > -w is not supported with pg_ctl promote. Only start, stop and restart. > It's accepted, but it doesn't do anything. Which isn't very nice... I'm futzing with the pg_ctl cmd line parameters for my patch notifying pg_ctl that restart didn't work. I'll fix this as well. > > - Heikki jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 4/22/15 11:37 AM, Jim Nasby wrote: > On 4/21/15 4:07 PM, Peter Eisentraut wrote: >> On 4/21/15 4:45 PM, Jim Nasby wrote: >> In order for a background worker to keep up with some of the workloads >> that have been presented as counterexamples, you'd need multiple >> background workers operating in parallel and preferring to work on >> certain parts of a table. That would require a lot more sophisticated >> job management than we currently have for, say, autovacuum. > > My thought was that the foreground queries would send page IDs to the > bgworker via a shmq. If the queries have to do much waiting at all on IO > then I'd expect the bgworker to be able to keep pace with a bunch of > them since it's just grabbing buffers that are already in the pool (and > only those in the pool; it wouldn't make sense for it to pull it back > from the kernel, let alone disk). > > We'd need to code this so that if a queue fills up the query doesn't > block; we just skip that opportunity to prune. I think that'd be fine. I think a "to-clean-up map" would work better. But basically we need a way to remember where to clean up later if we're not going to do it in the foreground. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Tue, Apr 21, 2015 at 05:07:52PM -0400, Peter Eisentraut wrote: > On 4/21/15 4:45 PM, Jim Nasby wrote: > > This comment made me wonder... has anyone considered handing the pruning > > work off to a bgworker, at least for SELECTs? That means the selects > > themselves wouldn't be burdened by the actual prune work, only in > > notifying the bgworker. While that's not going to be free, presumably > > it's a lot cheaper... > > The nice thing about having foreground queries to the light cleanup is > that they can work in parallel and naturally hit the interesting parts > of the table first. > > In order for a background worker to keep up with some of the workloads > that have been presented as counterexamples, you'd need multiple > background workers operating in parallel and preferring to work on > certain parts of a table. That would require a lot more sophisticated > job management than we currently have for, say, autovacuum. Well, the visibility map tells us where _not_ to clean up, so using another map to tell use _where_ to cleanup might make sense. However, the density of the map might be low enough that a list makes more sense, as you suggested. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: tested, failed Spec compliant: not tested Documentation:tested, failed Error in postgresql.conf gives the expected result on pg_ctl reload, although errors in pg_hba.conf file don't. Like before, reload completes fine without any information that pg_hba failed to load and only information is present in the log file. I'm assuming pg_ctl reload should prompt user if file fails to load irrespective of which file it is - postgresql.conf or pg_hba.conf. There is no documentation change so far, but I guess that's not yet necessary. gmake check passed all tests. The new status of this patch is: Waiting on Author -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On Fri, Mar 13, 2015 at 8:02 PM, Tom Lane wrote: > Robert Haas writes: >> On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane wrote: >>> I don't object to the concept, but I think that is a pretty bad place >>> to put the hook call: add_paths_to_joinrel is typically called multiple >>> (perhaps *many*) times per joinrel and thus this placement would force >>> any user of the hook to do a lot of repetitive work. > >> Interesting point. I guess the question is whether a some or all >> callers are going to actually *want* a separate call for each >> invocation of add_paths_to_joinrel(), or whether they'll be happy to >> operate on the otherwise-complete path list. > > Hmm. You're right, it's certainly possible that some users would like to > operate on each possible pair of input relations, rather than considering > the joinrel "as a whole". Maybe we need two hooks, one like your patch > and one like I suggested. Let me attempt to summarize subsequent discussion on this thread by saying the hook location that you proposed (just before set_cheapest) has not elicited any enthusiasm from anyone else. In a nutshell, the problem is that a single callback for a large join problem is just fine if there are no special joins involved, but in any other scenario, nobody knows how to use a hook at that location for anything useful. To push down a join to the remote server, you've got to figure out how to emit an SQL query for it. To execute it with a custom join strategy, you've got to know which of those joins should have inner join semantics vs. left join semantics. A hook/callback in make_join_rel() or in add_paths_to_joinrel() makes that relatively straightforward. Otherwise, it's not clear what to do, short of copy-and-pasting join_search_one_level(). If you have a suggestion, I'd like to hear it. If not, I'm going to press forward with the idea of putting the relevant logic in either add_paths_to_joinrel(), as previously proposed, or perhaps up oe level in make_one_rel(). Either way, if you don't need to be called multiple times per joinrel, you can stash a flag inside whatever you hang off of the joinrel's fdw_private and return immediately on every call after the first. I think that's cheap enough that we shouldn't get too stressed about it: for FDWs, we only call the hook at all if everything in the joinrel uses the same FDW, so it won't get called at all except for joinrels where it's likely to win big; for custom joins, multiple calls are quite likely to be useful and necessary, and if the hook burns too much CPU time for the query performance you get out of it, that's the custom-join provider's fault, not ours. The current patch takes this approach one step further and attempts FDW pushdown only once per joinrel. It does that because, while postgres_fdw DOES need the jointype and a valid innerrel/outerrel breakdown to figure out what query to generate, it does NOT every possible breakdown; rather, the first one is as good as any other. But this might not be true for a non-PostgreSQL remote database. So I think it's better to call the hook every time and let the hook return without doing anything if it wants. I'm still not totally sure whether make_one_rel() is better than add_paths_to_joinrel(). The current patch attempts to split the difference by doing FDW pushdown from make_one_rel() and custom joins from add_paths_to_joinrel(). I dunno why; if possible, those two things should happen in the same place. Doing it in make_one_rel() makes for fewer arguments and fewer repetitive calls, but that's not much good if you would have had a use for the extra arguments that aren't computed until we get down to add_paths_to_joinrel(). I'm not sure whether that's the case or not. The latest version of the postgres_fdw patch doesn't seem to mind not having extra_lateral_rels, but I'm wondering if that's busted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
Robert Haas wrote: > I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or > alternatively followed by "VACUUM FREEZE". The VACUUM generated > 4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that > is, 113 times more. Essentially a bulk load. OK, so if you bulk load data and then vacuum it before updating 100% of it, this approach will generate a lot more WAL than we currently do. Of course, if you don't VACUUM FREEZE after a bulk load and then are engaged in a fairly normal OLTP workload with peak and off-peak cycles, you are currently almost certain to hit a point during peak OLTP load where you begin to sequentially scan all tables, rewriting them in place, with WAL logging. Incidentally, this tends to flush a lot of your "hot" data out of cache, increasing disk reads. The first time I hit this "interesting" experience in production it was so devastating, and generated so many user complaints, that I never again considered a bulk load complete until I had run VACUUM FREEZE on it -- although I was sometimes able to defer that to an off-peak window of time. In other words, for the production environments I managed, the only value of that number is in demonstrating the importance of using unlogged COPY followed by VACUUM FREEZE for bulk-loading and capturing a fresh base backup upon completion. A better way to use pgbench to measure WAL size cost might be to initialize, VACUUM FREEZE to set a "long term baseline", and do a reasonable length run with crontab running VACUUM FREEZE periodically (including after the run was complete) versus doing the same with plain VACUUM (followed by a VACUUM FREEZE at the end?). Comparing the total WAL sizes generated following the initial load and VACUUM FREEZE would give a more accurate picture of the impact on an OLTP load, I think. > We'll still come out ahead if those tuples are going to stick > around long enough that they would have eventually gotten frozen > anyway, but if they get deleted again the loss is pretty > significant. Perhaps my perception is biased by having worked in an environment where the vast majority of tuples (both in terms of tuple count and byte count) were never updated and were only eligible for deletion after a period of years. Our current approach is pretty bad in such an environment, at least if you try to leave all vacuuming to autovacuum. I'll admit that we were able to work around the problems by running VACUUM FREEZE every night for most databases. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas wrote: > The thing that made me nervous about that approach is that it made the LSN > of each page critical information. If you somehow zeroed out the LSN, you > could no longer tell which pages are frozen and which are not. I'm sure it > could be made to work - and I got it working to some degree anyway - but > it's a bit scary. It's similar to the multixid changes in 9.3: multixids > also used to be data that you can just zap at restart, and when we changed > the rules so that you lose data if you lose multixids, we got trouble. Now, > LSNs are much simpler, and there wouldn't be anything like the > multioffset/member SLRUs that you'd have to keep around forever or vacuum, > but still.. LSNs are already pretty critical. If they're in the future, you can't flush those pages. Ever. And if they're wrong in either direction, crash recovery is broken. But it's still worth thinking about ways that we could make this more robust. I keep coming back to the idea of treating any page that is marked as all-visible as frozen, and deferring freezing until the page is again modified. The big downside of this is that if the page is set as all-visible and then immediately thereafter modified, it sucks to have to freeze when the XIDs in the page are still present in CLOG. But if we could determine from the LSN that the XIDs in the page are new enough to still be considered valid, then we could skip freezing in those cases and only do it when the page is "old". That way, if somebody zeroed out the LSN (why, oh why?) the worst that would happen is that we'd do some extra freezing when the page was next modified. > I would feel safer if we added a completely new "epoch" counter to the page > header, instead of reusing LSNs. But as we all know, changing the page > format is a problem for in-place upgrade, and takes some space too. Yeah. We have a serious need to reduce the size of our on-disk format. On a TPC-C-like workload Jan Wieck recently tested, our data set was 34% larger than another database at the beginning of the test, and 80% larger by the end of the test. And we did twice the disk writes. See "The Elephants in the Room.pdf" at https://sites.google.com/site/robertmhaas/presentations -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Wed, Apr 22, 2015 at 2:23 PM, Kevin Grittner wrote: > Robert Haas wrote: >> I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or >> alternatively followed by "VACUUM FREEZE". The VACUUM generated >> 4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that >> is, 113 times more. > > Essentially a bulk load. OK, so if you bulk load data and then > vacuum it before updating 100% of it, this approach will generate a > lot more WAL than we currently do. Of course, if you don't VACUUM > FREEZE after a bulk load and then are engaged in a fairly normal > OLTP workload with peak and off-peak cycles, you are currently > almost certain to hit a point during peak OLTP load where you begin > to sequentially scan all tables, rewriting them in place, with WAL > logging. Incidentally, this tends to flush a lot of your "hot" > data out of cache, increasing disk reads. The first time I hit > this "interesting" experience in production it was so devastating, > and generated so many user complaints, that I never again > considered a bulk load complete until I had run VACUUM FREEZE on it > -- although I was sometimes able to defer that to an off-peak > window of time. > > In other words, for the production environments I managed, the only > value of that number is in demonstrating the importance of using > unlogged COPY followed by VACUUM FREEZE for bulk-loading and > capturing a fresh base backup upon completion. A better way to use > pgbench to measure WAL size cost might be to initialize, VACUUM > FREEZE to set a "long term baseline", and do a reasonable length > run with crontab running VACUUM FREEZE periodically (including > after the run was complete) versus doing the same with plain VACUUM > (followed by a VACUUM FREEZE at the end?). Comparing the total WAL > sizes generated following the initial load and VACUUM FREEZE would > give a more accurate picture of the impact on an OLTP load, I > think. Sure, that would be a better test. But I'm pretty sure the impact will still be fairly substantial. >> We'll still come out ahead if those tuples are going to stick >> around long enough that they would have eventually gotten frozen >> anyway, but if they get deleted again the loss is pretty >> significant. > > Perhaps my perception is biased by having worked in an environment > where the vast majority of tuples (both in terms of tuple count and > byte count) were never updated and were only eligible for deletion > after a period of years. Our current approach is pretty bad in > such an environment, at least if you try to leave all vacuuming to > autovacuum. I'll admit that we were able to work around the > problems by running VACUUM FREEZE every night for most databases. Yeah. And that breaks down when you have very big databases with a high XID consumption rate, because the mostly-no-op VACUUM FREEZE runs for longer than you can tolerate. I'm not saying we don't need to fix this problem; we clearly do. I'm just saying that we've got to be careful not to harm other scenarios in the process. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas wrote: > Note that it's a bit complicated to set up that scenario today. Archiving is > never enabled in recovery mode, so you'll need to use a custom cron job or > something to maintain the archive that C uses. The files will not > automatically flow from B to the second archive. With the patch we're > discussing, however, it would be easy: just set archive_mode='always' in B. Hmm, I see. But if C never replays the last, partial segment from the old timeline, how does it follow the timeline switch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CINE for ALTER TABLE ... ADD COLUMN
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation:not tested Seeing this when trying to apply the patch: Patching file src/backend/commands/tablecmds.c using Plan A... Hunk #1 FAILED at 328. Hunk #2 succeeded at 2294 (offset 11 lines). Hunk #3 FAILED at 3399. Hunk #4 FAILED at 3500. Hunk #5 succeeded at 4658 with fuzz 1 (offset 65 lines). Hunk #6 succeeded at 4753 (offset 66 lines). Hunk #7 succeeded at 4989 with fuzz 2 (offset 66 lines). Hunk #8 succeeded at 5003 (offset 69 lines). Hunk #9 succeeded at 5017 (offset 69 lines). Hunk #10 succeeded at 5033 (offset 69 lines). The new status of this patch is: Waiting on Author -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Greg Stark wrote: > And it's a major headache, people are always being surprised that > their selects cause lots of I/O and slow down dramatically after > a big update or data load has finished. It's characterized as > "why is the database writing everything twice" (and saying it's > actually writing everything three times doesn't make people feel > better). When I looked at the life-cycle of a heap tuple in a database I was using, I found that (ignoring related index access and ignoring WAL-file copying, etc., for our backups), each tuple that existed long enough to freeze and be eventually deleted caused a lot of writes. (1) WAL log the insert. (2) Write the tuple. (3) Hint and rewrite the tuple. (4) WAL log the freeze of the tuple. (5) Rewrite the frozen tuple. (6) WAL-log the delete. (7) Rewrite the deleted tuple. (8) Prune and rewrite the page. (9) Free line pointers and rewrite the page. If I was lucky some of the writes could be combined in cache because they happened close enough together. Also, one could hope that not too much of the WAL-logging involved full page writes to the WAL -- again, keeping steps close together in time helps with that. If all of (1) through (5) are done in quick succession, you save two physical writes of the heap page and save one full page write to WAL. If steps (7) through (9) are done in quick succession, you save two more physical writes to the heap. This is part of what makes the aggressive incremental freezing being discussed on a nearby thread appealing -- at least for some workloads. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On Tue, Apr 21, 2015 at 8:30 PM, Michael Paquier wrote: > This .partial segment renaming is something that we > should let the archive_command manage with its internal logic. This strikes me as equivalent to saying "we don't know how to make this work right, but maybe our users will know". That never works out. As things stand, we have a situation where the archive_command examples in our documentation are known to be flawed. They don't fsync the file, and they'll write a partial file and then, when rerun, fail to copy the full file because there's already something there. Efforts have been made to fix these problems (see the pg_copy thread), but they haven't been completed yet, nor have we even documented the issues with the commands recommended by the documentation. Let's please not throw anything else on the pile of things we're expecting users to somehow "get right". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On 04/22/2015 09:30 PM, Robert Haas wrote: On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas wrote: Note that it's a bit complicated to set up that scenario today. Archiving is never enabled in recovery mode, so you'll need to use a custom cron job or something to maintain the archive that C uses. The files will not automatically flow from B to the second archive. With the patch we're discussing, however, it would be easy: just set archive_mode='always' in B. Hmm, I see. But if C never replays the last, partial segment from the old timeline, how does it follow the timeline switch? At timeline switch, we copy the old segment to the new timeline, and start writing where we left off. So the WAL from the old timeline is found in the segment nominally belonging to the new timeline. For example, imagine that perform point-in-time recovery to WAL position 0/1237E568, on timeline 1. That falls within segment 00010012. Then we end recovery, and switch to timeline 2. After the switch, and some more WAL-logged actions, we'll have these files in pg_xlog: 00010011 00010012 00020012 00020013 00020014 Note that there are two segments ending in "12". They both have the same point up to offset 0x37E568, corresponding to the switch point 0/1237E568. After that, the contents diverge: the segment on the new timeline contains a checkpoint/end-of-recovery record at that point, followed by new WAL belonging to the new timeline. Recovery knows about that, so that if you set recovery target to timeline 2, and it needs the WAL at the beginning of segment 12 (still belonging to timeline 1), it will try to restoring both "00010012" and "00020012". - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow SQL/plpgsql functions to accept record
On Wed, Apr 22, 2015 at 11:20 AM, Andrew Dunstan wrote: > > On 04/22/2015 11:29 AM, Jim Nasby wrote: >> >> On 4/20/15 2:04 PM, David G. Johnston wrote: >>> >>> >>> SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v); >>> ERROR: record type has not been registered >>> >>> While it may not be necessary to solve both problems I suspect they have >>> the same underlying root cause - specifically the separation of concerns >>> between the planner and the executor. >> >> I don't think they're related at all. C functions have the ability to >> accept a record, so the executor must be able to support it. It's just that >> SQL and plpgsql functions don't have that support. I suspect that's just >> because no one has gotten around to it. > > Well, that's assuming everyone else thinks it would be a good idea. Maybe > they do, but I wouldn't assume it. > > The answer in the past has been to use more dynamically typed languages such > as perl for which this problem is well suited. I've never really been satisfied with this answer. The two languages with really good core support are perl and python, neither of which are my cup of tea. Also, there is no chance of inlining any of the dynamic languages which has serious performance ramifications. In a perfect world, pl/v8 would be a good choice for a general purpose database support language as javascript has a number of properties that make it attractive for integration. Even if we had that though (and it's unlikely), a large percentage of postgres devs, including myself, dislike coding in any language except sql plus extensions. That being said, I think json types with their associated API, given that they are core types, will ultimately handle these types of problems. That way, at least, we can avoid adding syntax and functionality that will basically do the same thing. This reminds me a little bit of the json_build() vs enhanced row() syntax we discussed some time back. I didn't say so at the time, but for posterity, I think you were right...json_build() is working fine for building arbitrary record types and moving a record to json and deconstructing it should work just as well. merlin merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas wrote: > On 04/22/2015 09:30 PM, Robert Haas wrote: >> On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas >> wrote: >>> >>> Note that it's a bit complicated to set up that scenario today. Archiving >>> is >>> never enabled in recovery mode, so you'll need to use a custom cron job >>> or >>> something to maintain the archive that C uses. The files will not >>> automatically flow from B to the second archive. With the patch we're >>> discussing, however, it would be easy: just set archive_mode='always' in >>> B. >> >> >> Hmm, I see. But if C never replays the last, partial segment from the >> old timeline, how does it follow the timeline switch? > > At timeline switch, we copy the old segment to the new timeline, and start > writing where we left off. So the WAL from the old timeline is found in the > segment nominally belonging to the new timeline. Check. > For example, imagine that perform point-in-time recovery to WAL position > 0/1237E568, on timeline 1. That falls within segment > 00010012. Then we end recovery, and switch to timeline 2. > After the switch, and some more WAL-logged actions, we'll have these files > in pg_xlog: > > 00010011 > 00010012 > 00020012 > 00020013 > 00020014 Is the 00010012 file a "partial" segment of the sort you're proposing to no longer achive? > Note that there are two segments ending in "12". They both have the same > point up to offset 0x37E568, corresponding to the switch point 0/1237E568. > After that, the contents diverge: the segment on the new timeline contains a > checkpoint/end-of-recovery record at that point, followed by new WAL > belonging to the new timeline. Check. > Recovery knows about that, so that if you set recovery target to timeline 2, > and it needs the WAL at the beginning of segment 12 (still belonging to > timeline 1), it will try to restoring both "00010012" and > "00020012". What if you set the recovery target to timeline 3? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache lookup error for shell type creation with incompatible output function (DDL deparsing bug)
Michael Paquier wrote: > Hi all, > > I just bumped into the following problem in HEAD (1c41e2a): > =# create type my_array_float (INPUT = array_in, OUTPUT = array_out, > ELEMENT = float4, INTERNALLENGTH = 32); > ERROR: XX000: cache lookup failed for type 0 > LOCATION: format_type_internal, format_type.c:135 Argh. > The fix consists in being sure that typoid uses the OID of the type > shell created, aka the OID stored in adress.ObjectID. Attached is a > patch with a regression test checking for shell creation with > incompatible input/output functions (failure caused by output function > here though) able to check this code path. Thanks, pushed. I changed the line to be just below TypeShellMake, which seems slightly better aligned to the comment just below, and in fact it matches what DefineRange already uses. I also modified a couple of other places involving TypeCreate: one did not have the "typoid = addr.objectId" assignment at all; while the other did, it actually seems misplaced because at that spot we expect that typoid is already correct. So I added an assert to the other place and changed that assignment to an assert, too. I scanned the rest of the bogus commit and couldn't find any other place on which I made the same mistake. Thanks for reporting, -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 19/04/15 01:24, Michael Paquier wrote: On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier wrote: On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote: On 10/04/15 06:46, Michael Paquier wrote: 13) Some regression tests with pg_tablesample_method would be welcome. Not sure what you mean by that. I meant a sanity check on pg_tablesample_method to be sure that tsminit, tsmnextblock and tsmnexttuple are always defined as they are mandatory functions. So the idea is to add a query like and and to be sure that it returns no rows: SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL; Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I am sure you guessed it that way already.. Yes I guessed that and it's very reasonable request, I guess it should look like the attached (I don't want to send new version of everything just for this). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services >From d059c792a1e864e38f321cea51169ea0b3c5caab Mon Sep 17 00:00:00 2001 From: Petr Jelinek Date: Wed, 22 Apr 2015 21:28:29 +0200 Subject: [PATCH 7/7] tablesample: add catalog regression test --- src/test/regress/expected/tablesample.out | 15 +++ src/test/regress/sql/tablesample.sql | 13 + 2 files changed, 28 insertions(+) diff --git a/src/test/regress/expected/tablesample.out b/src/test/regress/expected/tablesample.out index 271638d..04e5eb8 100644 --- a/src/test/regress/expected/tablesample.out +++ b/src/test/regress/expected/tablesample.out @@ -209,6 +209,21 @@ SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5); ERROR: syntax error at or near "TABLESAMPLE" LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL... ^ +-- catalog sanity +SELECT * +FROM pg_tablesample_method +WHERE tsminit IS NULL + OR tsmseqscan IS NULL + OR tsmpagemode IS NULL + OR tsmnextblock IS NULL + OR tsmnexttuple IS NULL + OR tsmend IS NULL + OR tsmreset IS NULL + OR tsmcost IS NULL; + tsmname | tsmseqscan | tsmpagemode | tsminit | tsmnextblock | tsmnexttuple | tsmexaminetuple | tsmend | tsmreset | tsmcost +-++-+-+--+--+-++--+- +(0 rows) + -- done DROP TABLE test_tablesample CASCADE; NOTICE: drop cascades to 2 other objects diff --git a/src/test/regress/sql/tablesample.sql b/src/test/regress/sql/tablesample.sql index 2f4b7de..7b3eb9b 100644 --- a/src/test/regress/sql/tablesample.sql +++ b/src/test/regress/sql/tablesample.sql @@ -57,5 +57,18 @@ SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1); SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5); +-- catalog sanity + +SELECT * +FROM pg_tablesample_method +WHERE tsminit IS NULL + OR tsmseqscan IS NULL + OR tsmpagemode IS NULL + OR tsmnextblock IS NULL + OR tsmnexttuple IS NULL + OR tsmend IS NULL + OR tsmreset IS NULL + OR tsmcost IS NULL; + -- done DROP TABLE test_tablesample CASCADE; -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and WAL archive interactions
On 04/22/2015 10:21 PM, Robert Haas wrote: On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas wrote: For example, imagine that perform point-in-time recovery to WAL position 0/1237E568, on timeline 1. That falls within segment 00010012. Then we end recovery, and switch to timeline 2. After the switch, and some more WAL-logged actions, we'll have these files in pg_xlog: 00010011 00010012 00020012 00020013 00020014 Is the 00010012 file a "partial" segment of the sort you're proposing to no longer achive? If you did pure archive recovery, with no streaming replication involved, then no. If it was created by streaming replication, and the replication had not filled the whole segment yet, then yes, it would be a partial segment. Note that there are two segments ending in "12". They both have the same point up to offset 0x37E568, corresponding to the switch point 0/1237E568. After that, the contents diverge: the segment on the new timeline contains a checkpoint/end-of-recovery record at that point, followed by new WAL belonging to the new timeline. Check. Recovery knows about that, so that if you set recovery target to timeline 2, and it needs the WAL at the beginning of segment 12 (still belonging to timeline 1), it will try to restoring both "00010012" and "00020012". What if you set the recovery target to timeline 3? It depends how timeline 3 was created. If timeline 3 was forked off from timeline 2, then recovery would find it. If it was forked off directly from timeline 1, then no. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache lookup error for shell type creation with incompatible output function (DDL deparsing bug)
Alvaro Herrera wrote: > Michael Paquier wrote: > > Hi all, > > > > I just bumped into the following problem in HEAD (1c41e2a): > > =# create type my_array_float (INPUT = array_in, OUTPUT = array_out, > > ELEMENT = float4, INTERNALLENGTH = 32); > > ERROR: XX000: cache lookup failed for type 0 > > LOCATION: format_type_internal, format_type.c:135 I also wanted to point out, but forgot, that this command is not really creating a shell type -- it's creating a full-blown type, because there are args. Shell types are created when no args are given. This happens to fail due to the internal creation of the shell type because of the failure to look up the i/o functions, but as far as I can see the original code should fail in any type creation in pretty much the same way (didn't actually test that); not completely sure why this wasn't more visible in regression tests. I simply removed the word "shell" in the provided test case in the committed version, anyway. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 20/04/15 17:50, Heikki Linnakangas wrote: On 03/15/2015 09:07 PM, Petr Jelinek wrote: Slightly updated version of the patch. Mainly rebased against current master (there were several conflicts) and fixed some typos, no real functional change. I also attached initial version of the API sgml doc. I finally got around to have another round of review on this. I fixed a couple of little bugs, did some minor edition on comments etc. See attached. It is also available in my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, branch "seqam", if you want to look at individual changes. It combines your patches 1 and 4, I think those need to be applied together. I haven't looked at the DDL changes yet. Thanks! I'm fairly happy with the alloc API now. I'm not sure it's a good idea for the AM to access the sequence tuple directly, though. I would seem cleaner if it only manipulated the amdata Datum. But it's not too bad as it is. Yeah, I was thinking about this myself I just liked sending 10 parameters to the function less than this. The division of labour between sequence.c and the AM, in the init and the get/set_state functions, is a bit more foggy: * Do we really need a separate amoptions() method and an init() method, when the only caller to amoptions() is just before the init() method? The changes in extractRelOptions suggest that it would call am_reloptions for sequences too, but that doesn't actually seem to be happening. Hmm yes it should and I am sure it did at some point, must have messed it during one of the rebases :( And it's the reason why we need separate API function. postgres=# create sequence fooseq using local with (garbage=option); CREATE SEQUENCE Invalid options probably should throw an error. * Currently, the AM's init function is responsible for basic sanity checking, like min < max. It also has to extract the RESTART value from the list of options. I think it would make more sense to move that to sequence.c. The AM should only be responsible for initializing the 'amdata' portion, and for handling any AM-specific options. If the AM doesn't support some options, like MIN and MAX value, it can check them and throw an error, but it shouldn't be responsible for just passing them through from the arguments to the sequence tuple. Well then we need to send restart as additional parameter to the init function as restart is normally not stored anywhere. The checking is actually if new value is withing min/max but yes that can be done inside sequence.c I guess. * It might be better to form the sequence tuple before calling the init function, and pass the ready-made tuple to it. All the other API functions deal with the tuple (after calling sequence_read_tuple), so that would be more consistent. The init function would need to deconstruct it to modify it, but we're not concerned about performance here. Right, this is actually more of a relic of the custom columns implementation where I didn't want to build the tuple with NULLs for columns that might have been specified as NOT NULL, but with the amdata approach it can create the tuple safely. * The transformations of the arrays in get_state() and set_state() functions are a bit complicated. The seqam_get_state() function returns two C arrays, and pg_sequence_get_state() turns them into a text[] array. Why not construct the text[] array directly in the AM? I guess it's a bit more convenient to the AM, if the pg_sequence_get_state() do that, but if that's an issue, you could create generic helper function to turn two C string arrays into text[], and call that from the AM. Yeah that was exactly the reasoning. Helper function works for me (will check what Álvaro's suggested, maybe it can be moved somewhere and reused). seq = (FormData_pg_sequence *) GETSTRUCT(sequence_read_tuple(seqh)); for (i = 0; i < count; i++) { if (pg_strcasecmp(keys[i], "last_value") == 0) seq->last_value = DatumGetInt64(DirectFunctionCall1(int8in, CStringGetDatum(values[i]))); else if (pg_strcasecmp(keys[i], "is_called") == 0) seq->is_called = DatumGetBool(DirectFunctionCall1(boolin, CStringGetDatum(values[i]))); else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid state key \"%s\" for local sequence", keys[i]))); } sequence_save_tuple(seqh, NULL, true); If that error happens after having already processed a "last_value" or "is_called" entry, you have already modified the on-disk tuple. AFAICS that's the only instance of that bug, but sequence_read_tuple - modify tuple in place - sequence_save_tuple pattern is quite unsafe in general. If you modify a tuple directly in a Buffer, you should have a critical section around it. It would make sense to start a critical section in sequence_read_tuple(), except that not all callers want to modify the
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote: > On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian wrote: > > Well, we have to assume there are many misconfigured configurations --- > > autovacuum isn't super-easy to configure, so we can't just blame the > > user if this makes things worse. In fact, page pruning was designed > > spefically for cases where autovacuum wasn't running our couldn't keep > > up. > > Well autovacuum isn't currently considering HOT pruning part of its > job at all. It's hard to call it "misconfigured" when there's > literally *no* way to configure it "correctly". Good point, but doesn't vacuum remove the need for pruning as it removes all the old rows? > If you update less than autovacuum_vacuum_scale_factor fraction of the > table and then never update another row autovacuum will never run. > Ever. Every select will forevermore need to follow hot chains on that > table. Until eventually transaction wraparound forces a vacuum on that > table if that ever happens. Yes, that is a very good point, and it matches my concerns. Of course, Simon's concern is to avoid overly-aggressive pruning where the row is being pruned but will soon be modified, making the prune, and its WAL volume, undesirable. We have to consider both cases in any final solution. > Possibly autovacuum could be adjusted to count how many selects are > happening on the table and decide to vacuum it when the cost of the > selects following the dead tuples is balanced by the cost of doing a > vacuum. But that's not something included in the design of autovacuum > today. Well, autovacuum is also going to clean indexes, which seem like overkill for pruning HOT updates. > The original design of tuple storage was aimed at optimizing the > steady state where most tuples were not recently updated. It > guaranteed that except for tuples that were in the process of being > updated or were recently updated a tuple read didn't have to read the > CLOG, didn't have to follow any chains, didn't have to do any I/O or > other work other than to read the bits on the tuple itself. When a > tuple is updated it's put into a state where everyone who comes along > has to do extra work but as soon as practical the hint bits get set > and that extra work stops. Yes, Simon is right that doing everything as-soon-as-possible is not optimal. I think the trick is knowing when we should give up waiting for something else to dirty the page and prune it. > We had similar discussions about setting hint bits in the past. I'm > not sure why HOT pruning is the focus now because I actually think > hint bit setting is a larger source of I/O in innocent looking selects > even today. And it's a major headache, people are always being > surprised that their selects cause lots of I/O and slow down > dramatically after a big update or data load has finished. It's > characterized as "why is the database writing everything twice" (and > saying it's actually writing everything three times doesn't make > people feel better). In the new age of checksums with hint bit logging > I wonder if it's even a bigger issue. What would be the downside of only doing pruning during SELECT hint bit setting? Hinting is delayed by long-running transactions, but so is pruning. I assume you can do more pruning than setting all_visible hints because the old prunable rows are older by definition, but I am unclear how much older they are. FYI, while hint bit setting causes page writes, it does not cause WAL writes unless you have wal_log_hints set or page-level checksums are enabled. By doing pruning at the same time as hint bit setting, you are sharing the same page write, but are generating more WAL. Of course, if you are setting all-visible, then you are by definition waiting longer to prune than before, and this might be enough to make it a win for all use cases. You wouldn't never-prune in a read-only workload because your hint bits would eventually cause the pruning. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila wrote: > I have implemented this idea (note that I have to expose a new API > shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and > we sum_mq* to call shm_mq_detach) and apart this I have fixed other > problems reported on this thread: > > 1. Execution of initPlan by master backend and then pass the > required PARAM_EXEC parameter values to workers. > 2. Avoid consuming dsm's by freeing the parallel context after > the last tuple is fetched. > 3. Allow execution of Result node in worker backend as that can > be added as a gating filter on top of PartialSeqScan. > 4. Merged parallel heap scan descriptor patch > > To apply the patch, please follow below sequence: > > HEAD Commit-Id: 4d930eee > parallel-mode-v9.patch [1] > assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl in > the patch) > parallel_seqscan_v14.patch (Attached with this mail) Thanks, this version looks like an improvement. However, I still see some problems: - I believe the separation of concerns between ExecFunnel() and ExecEndFunnel() is not quite right. If the scan is shut down before it runs to completion (e.g. because of LIMIT), then I think we'll call ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I think you probably need to create a static subroutine that is called both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each case cleaning up whatever resources remain. - InitializeParallelWorkers() still mixes together general parallel executor concerns with concerns specific to parallel sequential scan (e.g. EstimatePartialSeqScanSpace). We have to eliminate everything that assumes that what's under a funnel will be, specifically, a partial sequential scan. To make this work properly, I think we should introduce a new function that recurses over the plan tree and invokes some callback for each plan node. I think this could be modeled on this code from ExplainNode(), beginning around line 1593: /* initPlan-s */ if (planstate->initPlan) ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es); /* lefttree */ if (outerPlanState(planstate)) ExplainNode(outerPlanState(planstate), ancestors, "Outer", NULL, es); /* righttree */ if (innerPlanState(planstate)) ExplainNode(innerPlanState(planstate), ancestors, "Inner", NULL, es); /* special child plans */ switch (nodeTag(plan)) { /* a bunch of special cases */ } /* subPlan-s */ if (planstate->subPlan) ExplainSubPlans(planstate->subPlan, ancestors, "SubPlan", es); The new function would do the same sort of thing, but instead of explaining each node, it would invoke a callback for each node. Possibly explain.c could use it instead of having hard-coded logic. Possibly it should use the same sort of return-true convention as expression_tree_walker, query_tree_walker, and friends. So let's call it planstate_tree_walker. Now, instead of directly invoking logic specific to parallel sequential scan, it should call planstate_tree_walker() on its lefttree and pass a new function ExecParallelEstimate() as the callback. That function ignores any node that's not parallel aware, but when it sees a partial sequential scan (or, in the future, some a parallel bitmap scan, parallel sort, or what have you) it does the appropriate estimation work. When ExecParallelEstimate() finishes, we InitializeParallelDSM(). Then, we call planstate_tree_walker() on the lefttree again, and this time we pass another new function ExecParallelInitializeDSM(). Like the previous one, that ignores the callbacks from non-parallel nodes, but if it hits a parallel node, then it fills in the parallel bits (i.e. ParallelHeapScanDesc for a partial sequential scan). - shm_mq_from_handle() is probably reasonable, but can we rename it shm_mq_get_queue()? - It's hard to believe this is right: + if (parallelstmt->inst_options) + receiver = None_Receiver; Really? Flush the tuples if there are *any instrumentation options whatsoever*? At the very least, that doesn't look too future-proof, but I'm suspicious that it's outright incorrect. - I think ParallelStmt probably shouldn't be defined in parsenodes.h. That file is included in a lot of places, and adding all of those extra #includes there doesn't seem like a good idea for modularity reasons even if you don't care about partial rebuilds. Something that includes a shm_mq obviously isn't a "parse" node in any meaningful sense anyway. - I don't think you need both setup cost and startup cost. Starting up more workers isn't particularly more expensive than starting up fewer of them, because most of the overhead is in waiting for them to actually start, and the number of workers is reasonable, then they're all be doing that in parallel with each other. I suggest removing parallel_startup_cost and keeping para
Re: [HACKERS] Streaming replication and WAL archive interactions
On Wed, Apr 22, 2015 at 3:34 PM, Heikki Linnakangas wrote: > On 04/22/2015 10:21 PM, Robert Haas wrote: >> On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas >> wrote: >>> For example, imagine that perform point-in-time recovery to WAL position >>> 0/1237E568, on timeline 1. That falls within segment >>> 00010012. Then we end recovery, and switch to timeline 2. >>> After the switch, and some more WAL-logged actions, we'll have these >>> files >>> in pg_xlog: >>> >>> 00010011 >>> 00010012 >>> 00020012 >>> 00020013 >>> 00020014 >> >> >> Is the 00010012 file a "partial" segment of the sort >> you're proposing to no longer achive? > > If you did pure archive recovery, with no streaming replication involved, > then no. If it was created by streaming replication, and the replication had > not filled the whole segment yet, then yes, it would be a partial segment. Why the difference? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On Tue, Apr 14, 2015 at 8:35 PM, Kyotaro HORIGUCHI wrote: > Before suppressing the symptom, I doubt the necessity and/or > validity of giving foreign tables an ability to be a parent. Is > there any reasonable usage for the ability? Gee, I don't see why that would be unreasonable or invalid -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Bruce Momjian wrote: > On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote: > > On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian wrote: > > > Well, we have to assume there are many misconfigured configurations --- > > > autovacuum isn't super-easy to configure, so we can't just blame the > > > user if this makes things worse. In fact, page pruning was designed > > > spefically for cases where autovacuum wasn't running our couldn't keep > > > up. > > > > Well autovacuum isn't currently considering HOT pruning part of its > > job at all. It's hard to call it "misconfigured" when there's > > literally *no* way to configure it "correctly". > > Good point, but doesn't vacuum remove the need for pruning as it removes > all the old rows? Sure. The point, I think, is to make autovacuum runs of some sort that don't actually vacuum but only do HOT-pruning. Maybe this is a reasonable solution to the problem that queries don't prune anymore after Simon's patch. If we made autovac HOT-prune periodically, we could have read-only queries prune only already-dirty pages. Of course, that would need further adjustments to default number of autovac workers, I/O allocation, etc. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rounding to even for numeric data type
Dean Rasheed wrote, On 2015-03-28 10:01: > On 28 March 2015 at 05:16, Andrew Gierth wrote: >>> "Tom" == Tom Lane writes: >> >> Tom> I think the concern over backwards compatibility here is probably >> Tom> overblown; but if we're sufficiently worried about it, a possible >> Tom> compromise is to invent a numeric_rounding_mode GUC, so that >> Tom> people could get back the old behavior if they really care. >> >> I only see one issue with this, but it's a nasty one: do we really want >> to make all numeric operations that might do rounding stable rather than >> immutable? >> > > Yeah, making all numeric functions non-immutable seems like a really bad idea. Would it be possible to make it an unchangeable per-cluster or per-database setting, kinda like how encoding behaves? Wouldn't that allow to keep the functions immutable? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote: > > Good point, but doesn't vacuum remove the need for pruning as it removes > > all the old rows? > > Sure. The point, I think, is to make autovacuum runs of some sort that > don't actually vacuum but only do HOT-pruning. Maybe this is a > reasonable solution to the problem that queries don't prune anymore > after Simon's patch. If we made autovac HOT-prune periodically, we > could have read-only queries prune only already-dirty pages. Of course, > that would need further adjustments to default number of autovac > workers, I/O allocation, etc. Do we really want to make vacuum more complex for this? vacuum does have the delay settings we would need though. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 4/22/15 1:24 PM, Robert Haas wrote: I keep coming back to the idea of treating any page that is marked as all-visible as frozen, and deferring freezing until the page is again modified. The big downside of this is that if the page is set as all-visible and then immediately thereafter modified, it sucks to have to freeze when the XIDs in the page are still present in CLOG. But if we could determine from the LSN that the XIDs in the page are new enough to still be considered valid, then we could skip freezing in those cases and only do it when the page is "old". That way, if somebody zeroed out the LSN (why, oh why?) the worst that would happen is that we'd do some extra freezing when the page was next modified. Maybe freezing a page as part of making it not all-visible wouldn't be that horrible, even without LSN. For one, we already know that every tuple is visible, so no MVCC checks needed. That's probably a significant savings over current freezing. If we're marking a page as no longer all-visible, that means we're already dirtying it and generating WAL for it (likely including a FPI). We may be able to consolidate all of this into a new WAL record that's a lot more efficient than what we currently do for freezing. I suspect we wouldn't need to log each TID we're freezing, for starters. Even if we did though, we could at least combine all that into one WAL message that just contains an array of TIDs or LPs. I think we could actually proactively freeze tuples during vacuum too, at least if we're about to mark the page as all-visible. Though, with Robert's HEAP_XMIN_FROZEN change we could be a lot more aggressive about freezing during VACUUM, certainly for pages we're already dirtying, especially if we can keep the WAL cost of that down. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow SQL/plpgsql functions to accept record
On 4/22/15 2:12 PM, Merlin Moncure wrote: That being said, I think json types with their associated API, given that they are core types, will ultimately handle these types of problems. That way, at least, we can avoid adding syntax and functionality that will basically do the same thing. This reminds me a little bit of the json_build() vs enhanced row() syntax we discussed some time back. I didn't say so at the time, but for posterity, I think you were right...json_build() is working fine for building arbitrary record types and moving a record to json and deconstructing it should work just as well. The one part I don't care for in that is it seems rather inefficient to cast something to JSON just so we can do things we really should be able to do with a record. But perhaps it's not all that costly. As for allowing SQL and plpgsql functions to accept a record, I think our JSON functionality just provided plenty of reason we should allow accepting them, even if you can't do much with it: you *can* hand it to row_to_json(), which does allow you to do something useful with it. So it seems reasonable to me that we should at least accept it as a function argument. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On Tue, Apr 21, 2015 at 7:57 AM, Andres Freund wrote: > I'm not 100% sure Heikki and I am on exactly the same page here :P > > I'm looking at git diff $(git merge-base upstream/master HEAD).. where > HEAD is e1a5822d164db0. Merged your stuff into my Github branch. Heikki is pushing changes there directly now. > * The logical stuff looks much saner. Cool. > * Please add tests for the logical decoding stuff. Probably both a plain > regression and and isolationtester test in > contrib/test_decoding. Including one that does spooling to disk. Working on it...hope to push that to Github soon. > * I don't like REORDER_BUFFER_CHANGE_INTERNAL_INSERT/DELETE as names. Why not > _SPECINSERT and _SPECDELETE or such? Changed that on Github. > * Iff we're going to have the XLOG_HEAP_AFFIRM record, I'd rather have > that guide the logical decoding code. Seems slightly cleaner. I thought that you didn't think that would always work out? That in some possible scenario it could break? > * Still not a fan of the name 'arbiter' for the OC indexes. What do you prefer? Seems to describe what we're talking about reasonably well to me. > * Gram.y needs a bit more discussion: > * Can anybody see a problem with changing the precedence of DISTINCT & > ON to nonassoc? Right now I don't see a problem given both are > reserved keywords already. Why did you push code that solved this in a roundabout way, but without actually reverting my original nonassoc changes (which would now not result in shift/reduce conflicts?). What should we do about that? Seems your unsure (otherwise you'd have reverted my thing). I don't like that you seem to have regressed diagnostic output [1]. Surely it's simpler to use the nonassoc approach? I think this works by giving the relevant keywords an explicit priority lower than '(', so that a rule with ON CONFLICT '(' will shift rather than reducing a conflicting rule (CONFLICT is an unreserved keyword). I wrote the code so long ago that I can't really remember why I thought it was the right thing, though. > * UpdateInsertStmt is a horrible name. OnConflictUpdateStmt maybe? Done on Github. > * '(' index_params where_clause ')' is imo rather strange. The where > clause is inside the parens? That's quite different from the > original index clause. I don't know. Maybe I was lazy about fixing shift/reduce conflicts. :-) I'll look at this some more. > * SPEC_IGNORE, /* INSERT of "ON CONFLICT IGNORE" */ looks like > a wrongly copied comment. Not sure what you mean here. Please clarify. > * The indentation in RoerderBufferCommit is clearly getting out of hand, > I've queued up a commit cleaning this up in my repo, feel free to merge. Done on Github. > * I don't think we use the term 'ordinary table' in error messages so > far. Fixed on Github. > * I still think it's unacceptable to redefine > XLOG_HEAP_LAST_MULTI_INSERT as XLOG_HEAP_SPECULATIVE_TUPLE like you > did. I'll try to find something better. I did what you suggested in your follow-up e-mail (changes are on Github [2]). > * I wonder if we now couldn't avoid changing heap_delete's API - we can > always super delete if we find a speculative insertion now. It'd be > nice not to break out of core callers if not necessary. Maybe, but if there is a useful way to break out only a small subset of heap_delete(), I'm not seeing it. Most of the callers that need a new NULL argument are heap_insert() callers, actually. There are now 3 heap_delete() callers, up from 2. > * breinbaas on IRC just mentioned that it'd be nice to have upsert as a > link in the insert. Given that that's the pervasive term that doesn't > seem absurd. Not sure what you mean. Where would the link appear? It is kind of hard to categorize that text so that we're strictly either talking about INSERT or UPSERT. Might be possible, though. > I think this is getting closer to a commit. Let's get this done. Great! The blockers to committing the IGNORE patch I see are: * We need to tweak some of the logical decoding stuff a bit more, I feel. Firm up on the details of whether we treat a confirmation record as a logical decoding change that is involved in the new dance during transaction reassembly. * We need to sort out those issues with the grammar, since that only really applies to the inference specification. Maybe the WHERE clause that the inference specification accepts can be broken out. No ON CONFLICT UPDATE specific issues left there, AFAICT though. That really seems totally doable in just a couple of days. The blockers for committing the ON CONFLICT UPDATE patch are larger, but doable: * We need to figure out the tuple lock strength details. I think this is doable, but it is the greatest challenge to committing ON CONFLICT UPDATE at this point. Andres feels that we should require no greater lock strength than an equivalent UPDATE. I suggest we get to this after committing the IGNORE variant. We probably need to
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 4/22/15 1:51 PM, Kevin Grittner wrote: (1) WAL log the insert. (2) Write the tuple. (3) Hint and rewrite the tuple. (4) WAL log the freeze of the tuple. (5) Rewrite the frozen tuple. (6) WAL-log the delete. (7) Rewrite the deleted tuple. (8) Prune and rewrite the page. (9) Free line pointers and rewrite the page. If I was lucky some of the writes could be combined in cache because they happened close enough together. Also, one could hope that not too much of the WAL-logging involved full page writes to the WAL -- again, keeping steps close together in time helps with that. This is why I like the idea of methods that tell us where we need to do cleanup... they provide us with a rough ability to track what tuples are in what part of their lifecycle. The VM helps with this a small amount, but really it only applies after 1 and 6; it doesn't help us with any other portions. Having a way to track recently created tuples would allow us to be much more efficient with 1-3, and with aggressive freezing, 1-5. A way to track recently deleted tuples would help with 6-7, possibly 6-9 if no indexes. If we doubled the size of the VM, that would let us track 4 states for each page: - Page has newly inserted tuples - Page has newly deleted tuples - Page is all visible - Page is frozen though as discussed elsewhere, we could probably combine all visible and frozen. The win from doing this would be easily knowing what pages need hinting (newly inserted) and pruning (newly deleted). Unfortunately we still wouldn't know whether we could do real work without visiting the page itself, but I suspect that for many workloads just having newly inserted/deleted would be a serious win. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Tue, Apr 21, 2015 at 08:39:37AM +0200, Andres Freund wrote: > On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote: > > Didn't you think any of the TODO threads had workable solutions? And > > don't expect adding an additional file per relation will be zero cost > > --- added over the lifetime of 200M transactions, I question if this > > approach would be a win. > > Note that normally you'd not run with a 200M transaction freeze max age > on a busy server. Rather around a magnitude more. > > Think about this being used on a time partionioned table. Right now all > the partitions have to be fully rescanned on a regular basis - quite > painful. With something like this normally only the newest partitions > will have to be. My point is that for the life of 200M transactions, you would have the overhead of an additional file per table in the file system, and updates of that. I just don't know if the overhead over the long time period would be smaller than the VACUUM FREEZE. It might be fine --- I don't know. People seem to focus on the big activities, while many small activities can lead to larger slowdowns. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 4/22/15 6:12 PM, Bruce Momjian wrote: My point is that for the life of 200M transactions, you would have the overhead of an additional file per table in the file system, and updates of that. I just don't know if the overhead over the long time period would be smaller than the VACUUM FREEZE. It might be fine --- I don't know. People seem to focus on the big activities, while many small activities can lead to larger slowdowns. Ahh. This wouldn't be for the life of 200M transactions; it would be a permanent fork, just like the VM is. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On Wed, Apr 22, 2015 at 3:23 PM, Peter Geoghegan wrote: > * We need to sort out those issues with the grammar, since that only > really applies to the inference specification. Maybe the WHERE clause > that the inference specification accepts can be broken out. No ON > CONFLICT UPDATE specific issues left there, AFAICT though. I pushed some code that deals with the predicate being within parenthesis: https://github.com/petergeoghegan/postgres/commit/358854645279523310f998dfc9cb3fe3e165ce1e (it now follows the attributes/expressions indexes, in the style of CREATE INDEX). We still need to reconcile these changes to the grammar with your own, Andres. I'm going to wait to hear back on what you think about that. Note that this removal: -%nonassoc DISTINCT -%nonassoc ON was incidental to the commit (this is the code you could have removed when you modified the grammar, adding a new production, but didn't). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
Hello, At Thu, 16 Apr 2015 14:43:33 -0700, David Fetter wrote in <20150416214333.ga...@fetter.org> > On Wed, Apr 15, 2015 at 09:35:05AM +0900, Kyotaro HORIGUCHI wrote: > > Hi, > > > > Before suppressing the symptom, I doubt the necessity and/or > > validity of giving foreign tables an ability to be a parent. Is > > there any reasonable usage for the ability? ... > I have a use case for having foreign tables as non-leaf nodes in a > partitioning hierarchy, namely geographic. Ah, I see. I understood the case of intermediate nodes. I agree that it is quite natural. > One might have a table at > HQ called foo_world, then partitions under it called foo_jp, foo_us, > etc., in one level, foo_us_ca, foo_us_pa, etc. in the next level, and > on down, each in general in a separate data center. > > Is there something essential about having non-leaf nodes as foreign > tables that's a problem here? No. I'm convinced of the necessity. Sorry for the noise. At Wed, 22 Apr 2015 17:00:10 -0400, Robert Haas wrote in > Gee, I don't see why that would be unreasonable or invalid Hmm. Yes, as mentioned above, there's no reason to refuse non-leaf foregin tables. I didn't understood the real cause of the problem and thought that not allowing foreign *root* tables seem better than tweaking elsewhere. But that thought found to be totally a garbage :( regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On Thu, Apr 23, 2015 at 4:31 AM, Petr Jelinek wrote: > On 19/04/15 01:24, Michael Paquier wrote: >> >> On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier >> wrote: >>> >>> On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote: On 10/04/15 06:46, Michael Paquier wrote: > > 13) Some regression tests with pg_tablesample_method would be welcome. Not sure what you mean by that. >>> >>> >>> I meant a sanity check on pg_tablesample_method to be sure that >>> tsminit, tsmnextblock and tsmnexttuple are always defined as they are >>> mandatory functions. So the idea is to add a query like and and to be >>> sure that it returns no rows: >>> SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR >>> tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL; >> >> >> Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I >> am sure you guessed it that way already.. >> > > Yes I guessed that and it's very reasonable request, I guess it should look > like the attached (I don't want to send new version of everything just for > this). Thanks. That's exactly the idea. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote: > On 4/22/15 6:12 PM, Bruce Momjian wrote: > >My point is that for the life of 200M transactions, you would have the > >overhead of an additional file per table in the file system, and updates > >of that. I just don't know if the overhead over the long time period > >would be smaller than the VACUUM FREEZE. It might be fine --- I don't > >know. People seem to focus on the big activities, while many small > >activities can lead to larger slowdowns. > > Ahh. This wouldn't be for the life of 200M transactions; it would be > a permanent fork, just like the VM is. Right. My point is that either you do X 2M times to maintain that fork and the overhead of the file existance, or you do one VACUUM FREEZE. I am saying that 2M is a large number and adding all those X's might exceed the cost of a VACUUM FREEZE. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?
On Tue, Apr 21, 2015 at 05:36:41PM -0400, Robert Haas wrote: > On Mon, Apr 20, 2015 at 5:41 PM, Bruce Momjian wrote: > > On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote: > >> On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian wrote: > >> > Slightly improved patch applied. > >> > >> Is it? > > > > The patch has a slightly modified 'if' statement to check a constant > > before calling a function, and use elseif: > > > > < + if (!interpretOidsOption(stmt->options, true) && > > cxt.hasoids) > > --- > > > + if (cxt.hasoids && !interpretOidsOption(stmt->options, > > true)) > > 47c57 > > < + if (interpretOidsOption(stmt->options, true) && > > !cxt.hasoids) > > --- > > > + else if (!cxt.hasoids && interpretOidsOption(stmt->options, > > true)) > > > > I realize the change is subtle. > > What I meant was - I didn't see an attachment on that message. I didn't attach it as people have told me they can just as easily see the patch via git, and since it was so similar, I didn't repost it. Should I have? I can easily do that. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type
Moving thread to -hackers. On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis wrote: > That example was just for illustration. My other example didn't require > creating a table at all: > > SELECT a=b FROM (SELECT ''::text, ' ') x(a,b); > > it's fine with me if we want that to fail, but I don't think we're > failing in the right place, or with the right error message. > > I'm not clear on what rules we're applying such that the above query > should fail, but: > > SELECT ''::text=' '; > > should succeed. Unknown literals are OK, but unknown column references > are not? If that's the rule, can we catch that earlier, and throw an > error like 'column reference "b" has unknown type'? Is the behavior of unknown literals vs. unknown column references documented anywhere? I tried looking here: http://www.postgresql.org/docs/devel/static/typeconv.html, but it doesn't seem to make the distinction between how unknown literals vs. unknown column references are handled. My understanding until now has been that unknown types are a placeholder while still inferring the types. But that leaves a few questions: 1. Why do we care whether the unknown is a literal or a column reference? 2. Unknown column references seem basically useless, because we will almost always encounter the "failure to find conversion" error, so why do we allow them? 3. If unknowns are just a placeholder, why do we return them to the client? What do we expect the client to do with it? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tablespaces inside $PGDATA considered harmful
On Fri, Jan 30, 2015 at 01:26:22PM -0800, Josh Berkus wrote: > Robert, Stephen, etc.: > > Apparently you can create a tablespace in the tablespace directory: > > josh=# create tablespace tbl location '/home/josh/pg94/data/pg_tblspc/'; > CREATE TABLESPACE > josh=# create table test_tbl ( test text ) tablespace tbl; > CREATE TABLE > josh=# \q > josh@Radegast:~/pg94/data/pg_tblspc$ ls > 17656 PG_9.4_201409291 > josh@Radegast:~/pg94/data/pg_tblspc$ ls -l > total 4 > lrwxrwxrwx 1 josh josh 30 Jan 30 13:02 17656 -> > /home/josh/pg94/data/pg_tblspc > drwx-- 3 josh josh 4096 Jan 30 13:02 PG_9.4_201409291 > josh@Radegast:~/pg94/data/pg_tblspc$ > > In theory if I could guess the next OID, I could cause a failure there, > but that appears to be obscure enough to be not worth bothering about. > > What is a real problem is that we don't block creating tablespaces > anywhere at all, including in obviously problematic places like the > transaction log directory: > > josh=# create tablespace tbl2 location '/home/josh/pg94/data/pg_xlog/'; > CREATE TABLESPACE > > It really seems like we ought to block *THAT*. Of course, if we block > tablespace creation in PGDATA generally, then that's covered. I have developed the attached patch to warn about creating tablespaces inside the data directory. The case this doesn't catch is referencing a symbolic link that points to the same directory. We can't make it an error so people can use pg_upgrade these setups. This would be for 9.5 only. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c new file mode 100644 index fd22612..4ec1aff *** a/src/backend/commands/tablespace.c --- b/src/backend/commands/tablespace.c *** CreateTableSpace(CreateTableSpaceStmt *s *** 288,293 --- 288,299 errmsg("tablespace location \"%s\" is too long", location))); + /* Warn if the tablespace is in the data directory. */ + if (path_is_prefix_of_path(DataDir, location)) + ereport(WARNING, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("tablespace location should not be inside the data directory"))); + /* * Disallow creation of tablespaces named "pg_xxx"; we reserve this * namespace for system purposes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On 2015/04/23 0:34, Stephen Frost wrote: * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: postgres=# select * from ft1 for update; ERROR: could not find junk tableoid1 column I think this is a bug. Attached is a patch fixing this issue. Pushed, thanks! Thank you. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
Hi Robert, Thanks for your comments. > A few random cosmetic problems: > > - The hunk in allpaths.c is useless. > - The first hunk in fdwapi.h contains an extra space before the > closing parenthesis. > OK, it's my oversight. > And then: > > + else if (scan->scanrelid == 0 && > +(IsA(scan, ForeignScan) || IsA(scan, CustomScan))) > + varno = INDEX_VAR; > > Suppose scan->scanrelid == 0 but the scan type is something else? Is > that legal? Is varno == 0 the correct outcome in that case? > Right now, no other scan type has capability to return a tuples with flexible type/attributes more than static definition. I think it is a valid restriction that only foreign/custom-scan can have scanrelid == 0. I checked overall code again. One point doubtful was ExecScanFetch(). If estate->es_epqTuple is not NULL, it tries to save a tuple from a particular scanrelid (larger than zero). IIUC, es_epqTuple is used only when fetched tuple is updated then visibility checks are applied on writer operation again. So, it should work for CPS with underlying actual scan node on base relations, however, I need code investigation if FDW/CSP replaced an entire join subtree by an alternative relation scan (like a materialized view). > > [ new patch ] > > A little more nitpicking: > > ExecInitForeignScan() and ExecInitCustomScan() could declare > currentRelation inside the if (scanrelid > 0) block instead of in the > outer scope. > OK, > I'm not too excited about the addition of GetFdwHandlerForRelation, > which is a one-line function used in one place. It seems like we > don't really need that. > OK, > On Fri, Mar 13, 2015 at 8:02 PM, Tom Lane wrote: > > Robert Haas writes: > >> On Fri, Mar 13, 2015 at 2:31 PM, Tom Lane wrote: > >>> I don't object to the concept, but I think that is a pretty bad place > >>> to put the hook call: add_paths_to_joinrel is typically called multiple > >>> (perhaps *many*) times per joinrel and thus this placement would force > >>> any user of the hook to do a lot of repetitive work. > > > >> Interesting point. I guess the question is whether a some or all > >> callers are going to actually *want* a separate call for each > >> invocation of add_paths_to_joinrel(), or whether they'll be happy to > >> operate on the otherwise-complete path list. > > > > Hmm. You're right, it's certainly possible that some users would like to > > operate on each possible pair of input relations, rather than considering > > the joinrel "as a whole". Maybe we need two hooks, one like your patch > > and one like I suggested. > > Let me attempt to summarize subsequent discussion on this thread by > saying the hook location that you proposed (just before set_cheapest) > has not elicited any enthusiasm from anyone else. In a nutshell, the > problem is that a single callback for a large join problem is just > fine if there are no special joins involved, but in any other > scenario, nobody knows how to use a hook at that location for anything > useful. To push down a join to the remote server, you've got to > figure out how to emit an SQL query for it. To execute it with a > custom join strategy, you've got to know which of those joins should > have inner join semantics vs. left join semantics. A hook/callback in > make_join_rel() or in add_paths_to_joinrel() makes that relatively > straightforward. Otherwise, it's not clear what to do, short of > copy-and-pasting join_search_one_level(). If you have a suggestion, > I'd like to hear it. > Nothing I have. Once I tried to put a hook just after the set_cheapest(), the largest problem was that we cannot extract a set of left and right relations from a set of joined relations, like an extraction of apple and orange from mix juice. > If not, I'm going to press forward with the idea of putting the > relevant logic in either add_paths_to_joinrel(), as previously > proposed, or perhaps up oe level in make_one_rel(). Either way, if > you don't need to be called multiple times per joinrel, you can stash > a flag inside whatever you hang off of the joinrel's fdw_private and > return immediately on every call after the first. I think that's > cheap enough that we shouldn't get too stressed about it: for FDWs, we > only call the hook at all if everything in the joinrel uses the same > FDW, so it won't get called at all except for joinrels where it's > likely to win big; for custom joins, multiple calls are quite likely > to be useful and necessary, and if the hook burns too much CPU time > for the query performance you get out of it, that's the custom-join > provider's fault, not ours. The current patch takes this approach one > step further and attempts FDW pushdown only once per joinrel. It does > that because, while postgres_fdw DOES need the jointype and a valid > innerrel/outerrel breakdown to figure out what query to generate, it > does NOT every possible breakdown; rather, the first one is as good as > any other. But
Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type
My apologies if much of this is already assumed knowledge by most -hackers...I'm trying to learn from observation instead of, largely, reading code in a foreign language. On Wed, Apr 22, 2015 at 6:40 PM, Jeff Davis wrote: > Moving thread to -hackers. > > On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis wrote: > > That example was just for illustration. My other example didn't require > > creating a table at all: > > > > > > > SELECT a=b FROM (SELECT ''::text, ' ') x(a,b); > > > > it's fine with me if we want that to fail, but I don't think we're > > failing in the right place, or with the right error message. > > > > I'm not clear on what rules we're applying such that the above query > > should fail, but: > > > > > > SELECT ''::text=' '; > > > should succeed. Unknown literals are OK, but unknown column references > > are not? If that's the rule, can we catch that earlier, and throw an > > error like 'column reference "b" has unknown type'? > But the fact that column "b" has the data type "unknown" is only a warning - not an error. This seems to be a case of the common problem (or, at least recently mentioned) where type conversion only deals with data and not context. http://www.postgresql.org/message-id/CADx9qBmVPQvSH3+2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com Additional hinting regarding the column containing the offending data would be welcomed by the community - but I suspect it is a non-trivial endeavor. > Is the behavior of unknown literals vs. unknown column references > documented anywhere? I tried looking here: > http://www.postgresql.org/docs/devel/static/typeconv.html, but it > doesn't seem to make the distinction between how unknown literals vs. > unknown column references are handled. > > My understanding until now has been that unknown types are a > placeholder while still inferring the types. But that leaves a few > questions: > > 1. Why do we care whether the unknown is a literal or a column reference? > Apparently the difference is in when non-implicit casts can be used for coercion - or, rather, when input functions can be used instead of casting functions. in SELECT ' '::text = 'a' the explicit cast between the implicit unknown and text is used while going through the subquery forces the planner to locate an implicit cast between the explicit unknown and text. The following fails no matter what you try because no casts exist from unknown to integer: SELECT a::int=b FROM (SELECT '1', 1) x(a,b); but this too works - which is why the implicit cast concept above fails (I'm leaving it since the thought process may help in understanding): SELECT 1 = '1'; >From which I infer that an unknown literal is allowed to be fed directly into a type's input function to facilitate a direct coercion. Writing this makes me wish for more precise terminology...is there something already established here? "untyped" versus "unknown" makes sense to me. untyped literals only exist within the confines of a single node and can be passed through a type's input function to make them take on a type. If the untyped reference passes through the node without having been assigned an explicit type it is assigned the unknown type. 2. Unknown column references seem basically useless, because we will > almost always encounter the "failure to find conversion" error, so why > do we allow them? > At this point...backward compatibility? I do get a warning in psql (9.3.6) from your original -bugs example create table a(u) as select '1'; WARNING: "column "u" has type "unknown" DETAIL: Proceeding with relation creation anyway. Related question: was there ever a time when the above failed instead of just supplying a warning? My git-fu is not super strong but the above warning was last edited by Tom Lane back in 2003 (d8528630) though it was just a refactor - the warning was already present. I suppose after 12 years the "why" doesn't matter so much... create table b(i int); insert into b select u from a; ERROR: failed to find conversion function from unknown to integer Text appears to have a cast defined: SELECT u::text FROM a; > 3. If unknowns are just a placeholder, why do we return them to the > client? What do we expect the client to do with it? We do? I suspect that it is effectively treated as if it were text by client libraries. My gut reaction is if you feel strongly enough to add some additional documentation or warnings/hints/details related to this topic they probably would get put in; but disallowing "unknown" as first-class type is likely to fail to pass a cost-benefit evaluation. Distinguishing between "untyped" literals and "unknown type" literals seems promising concept to aid in understanding the difference in the face of not being able (or wanting) to actually change the behavior. David J.
[HACKERS] Code paths where LWLock should be released on failure
Hi all, After looking at bug #13128, I have been looking at the code around LWLockAcquire/Release to see if there are similar issues elsewhere. Here are my findings: 1) SimpleLruReadPage() holds a control lock at entry that will be held at exit as well. However SlruReportIOError() can report an error, letting the lock hold. Shouldn't we release the control lock when a failure happens? 2) The patch attached to #13128 fixes MarkAsPreparing(), but actually twophase.c also does not release some locks in LockGXact(). 3) PreCommit_Notify@async.c should release AsyncQueueLock on failure I guess because it is called at transaction commit. At least it looks safer this way. 4) TablespaceCreateDbspace does not release TablespaceCreateLock but LWLockReleaseAll would do it when aborting its transaction, so no changes done there (?). 5) In ReplicationSlotCreate@slot.c, I would think that ReplicationSlotAllocationLock should be released when all the locks are in use. Similarly, in ReplicationSlotDropAcquired, ReplicationSlotAllocationLock should be released when !fail_softly. SaveSlotToPath could also be made safer when a file could not be created. 6) In dsm.c, dsm_create does not release DynamicSharedMemoryControlLock when Error'ing when there are too many shared memory segments. 7) In shmem.c, ShmemInitStruct does not release ShmemIndexLock on OOM. I guess that's fine in bootstrap mode, still... 8) In lock.c, LockRelease() does not release partitionLock when a shared lock cannot be found. Similar thing with FastPathGetRelationLockEntry(). 9) In predicate.c, CreatePredicateLock() forgets to release SerializablePredicateLockListLock and partitionLock in case of an OOM. There is a similar issue with ReleaseOneSerializableXact(), CheckForSerializableConflictOut() and predicatelock_twophase_recover(). 10) In relcache.c, RelCacheInitLock is not released in RelationCacheInitFilePreInvalidate() after unlink() failure. 11) In AlterSystemSetConfigFile(), AutoFileLock should be released on failure. All those things give the patch attached. Comments are welcome. Regards, -- Michael diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index b85a666..e8d6754 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -350,6 +350,7 @@ MarkAsPreparing(TransactionId xid, const char *gid, gxact = TwoPhaseState->prepXacts[i]; if (strcmp(gxact->gid, gid) == 0) { + LWLockRelease(TwoPhaseStateLock); ereport(ERROR, (errcode(ERRCODE_DUPLICATE_OBJECT), errmsg("transaction identifier \"%s\" is already in use", @@ -359,11 +360,14 @@ MarkAsPreparing(TransactionId xid, const char *gid, /* Get a free gxact from the freelist */ if (TwoPhaseState->freeGXacts == NULL) + { + LWLockRelease(TwoPhaseStateLock); ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("maximum number of prepared transactions reached"), errhint("Increase max_prepared_transactions (currently %d).", max_prepared_xacts))); + } gxact = TwoPhaseState->freeGXacts; TwoPhaseState->freeGXacts = gxact->next; @@ -497,16 +501,22 @@ LockGXact(const char *gid, Oid user) /* Found it, but has someone else got it locked? */ if (gxact->locking_backend != InvalidBackendId) + { + LWLockRelease(TwoPhaseStateLock); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("prepared transaction with identifier \"%s\" is busy", gid))); + } if (user != gxact->owner && !superuser_arg(user)) + { + LWLockRelease(TwoPhaseStateLock); ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("permission denied to finish prepared transaction"), errhint("Must be superuser or the user that prepared the transaction."))); + } /* * Note: it probably would be possible to allow committing from @@ -515,10 +525,13 @@ LockGXact(const char *gid, Oid user) * someone gets motivated to make it work. */ if (MyDatabaseId != proc->databaseId) + { + LWLockRelease(TwoPhaseStateLock); ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("prepared transaction belongs to another database"), errhint("Connect to the database where the transaction was prepared to finish it."))); + } /* OK for me to lock it */ gxact->locking_backend = MyBackendId; diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 2826b7e..1224e4d 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -839,9 +839,12 @@ PreCommit_Notify(void) LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE); asyncQueueFillWarning(); if (asyncQueueIsFull()) + { +LWLockRelease(AsyncQueueLock); ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("too many notifications in the NOTIFY queue"))); + } nextNotify = asyncQueueAddEntries(nextNotify); LWLockRelease(AsyncQueue
Re: [HACKERS] Code paths where LWLock should be released on failure
Michael Paquier writes: > After looking at bug #13128, I have been looking at the code around > LWLockAcquire/Release to see if there are similar issues elsewhere. > Here are my findings: IIRC, we automatically release all LWLocks at the start of error recovery, so I rather doubt that any of this is necessary. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bogus WAL segments archived after promotion
On Mon, Apr 13, 2015 at 11:57 PM, Heikki Linnakangas wrote: > On 04/01/2015 07:12 PM, Bruce Momjian wrote: >> >> On Fri, Dec 19, 2014 at 10:26:34PM +0200, Heikki Linnakangas wrote: >>> >>> On 12/19/2014 02:55 PM, Heikki Linnakangas wrote: I'm thinking that we should add a step to promotion, where we scan pg_xlog for any segments higher than the timeline switch point, and remove them, or mark them with .done so that they are not archived. There might be some real WAL that was streamed from the primary, but not yet applied, but such WAL is of no interest to that server anyway, after it's been promoted. It's a bit disconcerting to zap WAL that's valid, even if doesn't belong to the current server's timeline history, because as a general rule it's good to avoid destroying evidence that might be useful in debugging. There isn't much difference between removing them immediately and marking them as .done, though, because they will eventually be removed/recycled anyway if they're marked as .done. >>> >>> >>> This is what I came up with. This patch removes the suspect segments >>> at timeline switch. The alternative of creating .done files for them >>> would preserve more evidence for debugging, but OTOH it would also >>> be very confusing to have valid-looking WAL segments in pg_xlog, >>> with .done files, that in fact contain garbage. >>> >>> The patch is a bit longer than it otherwise would be, because I >>> moved the code to remove a single file from RemoveOldXlogFiles() to >>> a new function. I think that makes it more readable in any case, >>> simply because it was so deeply indented in RemoveOldXlogFiles. >> >> >> Where are we on this? > > > I didn't hear any better ideas, so committed this now. Finally looking at that... The commit log of b2a5545 is a bit misleading. Segment files that were recycled during archive recovery are not necessarily removed, they could be recycled as well during promotion on the new timeline in line with what RemoveOldXlogFiles() does. Hence I think that the comment on top of RemoveNonParentXlogFiles() should be updated to reflect that like in the patch attached. Something minor: perhaps we could refactor xlogarchive.c to have XLogArchiveCheckDone() and XLogArchiveIsBusy() use the new XLogArchiveIsReady(). Regards, -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2580996..e0a8b81 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3596,7 +3596,8 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) } /* - * Remove WAL files that are not part of the given timeline's history. + * Recycle or remove WAL files that are not part of the given timeline's + * history. * * This is called during recovery, whenever we switch to follow a new * timeline, and at the end of recovery when we create a new timeline. We -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Code paths where LWLock should be released on failure
On Thu, Apr 23, 2015 at 2:46 PM, Tom Lane wrote: > Michael Paquier writes: >> After looking at bug #13128, I have been looking at the code around >> LWLockAcquire/Release to see if there are similar issues elsewhere. >> Here are my findings: > > IIRC, we automatically release all LWLocks at the start of error recovery, > so I rather doubt that any of this is necessary. Perhaps this is purely cosmetic for most those code, but not all... if you have a look at #13128, not releasing TwoPhaseStateLock causes a deadlock on startup when max_prepared_transactions does not have enough slots. I have also been surprised by the inconsistencies particularly in predicate.c or other places regarding LWLock releases. Sometimes they are released on failure, sometimes not. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type
On Wed, 2015-04-22 at 20:35 -0700, David G. Johnston wrote: > But the fact that column "b" has the data type "unknown" is only a > warning - not an error. > I get an error: postgres=# SELECT ' '::text = 'a'; ?column? -- f (1 row) postgres=# SELECT a=b FROM (SELECT ''::text, ' ') x(a,b); ERROR: failed to find conversion function from unknown to text So that means the column reference "b" is treated differently than the literal. Here I don't mean a reference to an actual column of a real table, just an identifier ("b") that parses as a columnref. Creating the table gives you a warning (not an error), but I think that was a poor example for me to choose, and not important to my point. > > This seems to be a case of the common problem (or, at least recently > mentioned) where type conversion only deals with data and not context. > > > http://www.postgresql.org/message-id/CADx9qBmVPQvSH3 > +2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com > > I think that is a different problem. That's a runtime type conversion error (execution time), and I'm talking about something happening at parse analysis time. > > but this too works - which is why the implicit cast concept above > fails (I'm leaving it since the thought process may help in > understanding): > > > SELECT 1 = '1'; > > > From which I infer that an unknown literal is allowed to be fed > directly into a type's input function to facilitate a direct coercion. Yes, I believe that's what's happening. When we use an unknown literal, it's acting more like a value constructor and will pass it to the type input function. When it's a columnref, even if unknown, it tries to cast it and fails. But that is very confusing. In the example at the top of this email, it seems like the second query should be equivalent to the first, or even that postgres should be able to rewrite the second into the first. But the second query fails where the first succeeds. > At this point...backward compatibility? Backwards compatibility of what queries? I guess the ones that return unknowns to the client or create tables with unknown columns? > create table a(u) as select '1'; > > > WARNING: "column "u" has type "unknown" > DETAIL: Proceeding with relation creation anyway. > > > Related question: was there ever a time when the above failed instead > of just supplying a warning? Not that I recall. > My gut reaction is if you feel strongly enough to add some additional > documentation or warnings/hints/details related to this topic they > probably would get put in; but disallowing "unknown" as first-class > type is likely to fail to pass a cost-benefit evaluation. I'm not proposing that we eliminate unknown. I just think columnrefs and literals should behave consistently. If we really don't want unknown columnrefs, it seems like we could at least throw a better error. If we were starting from scratch, I'd also not return unknown to the client, but we have to worry about the backwards compatibility. > Distinguishing between "untyped" literals and "unknown type" literals > seems promising concept to aid in understanding the difference in the > face of not being able (or wanting) to actually change the behavior. Not sure I understand that proposal, can you elaborate? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers