inserts into partitioned table may cause crash
I've run into what seems to be a bug in ExecInsert() that causes a crash when inserting multiple rows into a partitioned table that each go into different partitions with different tuple descriptors. Crash occurs if ExecInsert() returns without resetting estate->es_result_relation_info back to the root table's resultRelInfo. For example, if a BR trigger on a partition returns NULL for a row. Crashing example: create table p (a int, b text) partition by list (a); create table p12 (b text, a int); -- p12 has different tuple descriptor than p alter table p attach partition p12 for values in (1, 2); create table p4 partition of p for values in (4); create function blackhole () returns trigger as $$ begin return NULL; end; $$ language plpgsql; create trigger blackhole before insert on p12 for each row execute procedure blackhole(); insert into p values (1, 'b'), (4, 'a'); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Crash is caused because we enter into ExecFindPartition with p12's resultRelInfo as if it correponds to the root table. That happens because we didn't reset estate->es_result_relation_info, which had been set to p12's resultRelInfo to point back to the original resultRelInfo (that is, p's) before returning like below: slot = ExecIRInsertTriggers(estate, resultRelInfo, slot); if (slot == NULL) /* "do nothing" */ return NULL; There are other places where we prematurely return like that. Attached a patch to fix that, which would need to be back-patched to 10. Thanks, Amit diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index c32928d9bd..0d6326d1de 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -281,7 +281,7 @@ ExecInsert(ModifyTableState *mtstate, /* * get information on the (current) result relation */ - resultRelInfo = estate->es_result_relation_info; + saved_resultRelInfo = resultRelInfo = estate->es_result_relation_info; /* Determine the partition to heap_insert the tuple into */ if (mtstate->mt_partition_tuple_routing) @@ -408,7 +408,10 @@ ExecInsert(ModifyTableState *mtstate, slot = ExecBRInsertTriggers(estate, resultRelInfo, slot); if (slot == NULL) /* "do nothing" */ - return NULL; + { + result = NULL; + goto restore_estate_result_rel; + } /* trigger might have changed tuple */ tuple = ExecMaterializeSlot(slot); @@ -421,7 +424,10 @@ ExecInsert(ModifyTableState *mtstate, slot = ExecIRInsertTriggers(estate, resultRelInfo, slot); if (slot == NULL) /* "do nothing" */ - return NULL; + { + result = NULL; + goto restore_estate_result_rel; + } /* trigger might have changed tuple */ tuple = ExecMaterializeSlot(slot); @@ -439,7 +445,10 @@ ExecInsert(ModifyTableState *mtstate, planSlot); if (slot == NULL) /* "do nothing" */ - return NULL; + { + result = NULL; + goto restore_estate_result_rel; + } /* FDW might have changed tuple */ tuple = ExecMaterializeSlot(slot); @@ -495,7 +504,7 @@ ExecInsert(ModifyTableState *mtstate, * No need though if the tuple has been routed, and a BR trigger * doesn't exist. */ - if (saved_resultRelInfo != NULL && + if (saved_resultRelInfo != resultRelInfo && !(resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->trig_insert_before_row)) check_partition_constr = false; @@ -544,7 +553,8 @@ ExecInsert(ModifyTableState *mtstate, estate, canSetTag, )) { InstrCountFiltered2(>ps, 1); - return returning; + result = returning; + goto restore_estate_result_rel; } else goto vlock; @@ -559,7 +569,8 @@
Re: Wait event names mismatch: oldserxid
On Tue, Feb 27, 2018 at 02:44:37PM -0500, Robert Haas wrote: > On Fri, Feb 9, 2018 at 8:53 AM, Michael Paquierwrote: >> So the docs look correct to me on this side. What I find weird is the >> phrasing to define oldserxid. Instead of that, the current description: >> Waiting to I/O on an oldserxid buffer. >> I would suggest "waiting *for* I/O" >> >> A small patch is attached. > > Agreed. Committed. Thanks, Robert. -- Michael signature.asc Description: PGP signature
Re: inserts into partitioned table may cause crash
On 2018/02/28 17:36, Amit Langote wrote: > I've run into what seems to be a bug in ExecInsert() that causes a crash > when inserting multiple rows into a partitioned table that each go into > different partitions with different tuple descriptors. Crash occurs if > ExecInsert() returns without resetting estate->es_result_relation_info > back to the root table's resultRelInfo. For example, if a BR trigger on a > partition returns NULL for a row. > > Crashing example: > > create table p (a int, b text) partition by list (a); > create table p12 (b text, a int); > > -- p12 has different tuple descriptor than p > alter table p attach partition p12 for values in (1, 2); > > create table p4 partition of p for values in (4); > > create function blackhole () returns trigger as $$ begin return NULL; end; > $$ language plpgsql; > create trigger blackhole before insert on p12 for each row execute > procedure blackhole(); > > insert into p values (1, 'b'), (4, 'a'); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > Crash is caused because we enter into ExecFindPartition with p12's > resultRelInfo as if it correponds to the root table. That happens because > we didn't reset estate->es_result_relation_info, which had been set to > p12's resultRelInfo to point back to the original resultRelInfo (that is, > p's) before returning like below: > >slot = ExecIRInsertTriggers(estate, resultRelInfo, slot); > > if (slot == NULL) /* "do nothing" */ > return NULL; > > There are other places where we prematurely return like that. > > Attached a patch to fix that, which would need to be back-patched to 10. BTW, this won't crash if the table descriptors match, but one would get an unintuitive error like this: create table p (a int, b text) partition by list (a); create table p12 partition of p for values in (1, 2); create table p4 partition of p for values in (4); create trigger blackhole before insert on p12 for each row execute procedure blackhole(); insert into p values (1, 'a'), (4, 'a'); ERROR: new row for relation "p12" violates partition constraint DETAIL: Failing row contains (4, a). That is, after trying to insert (4, 'a') into p12 as if it were the root. Still, it's a bug all the same. Thanks, Amit
Re: [HACKERS] path toward faster partition pruning
On 27 February 2018 at 22:33, Amit Langotewrote: > Attached an updated version in which I incorporated some of the revisions > that David Rowley suggested to OR clauses handling (in partprune.c) that > he posted as a separate patch on the run-time pruning thread [1]. Thanks for fixing that up and including it. Micro review of v34: 1. Looks like you've renamed the parttypid parameter in the definition of partkey_datum_from_expr and partition_cmp_args, but not updated the declaration too. +static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum *value); +static bool +partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value) +static bool partition_cmp_args(Oid parttypid, Oid partopfamily, +PartClause *pc, PartClause *leftarg, PartClause *rightarg, +bool *result); +static bool +partition_cmp_args(Oid partopcintype, Oid partopfamily, +PartClause *pc, PartClause *leftarg, PartClause *rightarg, +bool *result) 2. In prune_append_rel_partitions(), it's not exactly illegal, but int i is declared twice in different scopes. Looks like there's no need for the inner one. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Function to track shmem reinit time
Attached patch introduces a new function pg_shmem_init_time(), which returns the time shared memory was last (re)initialized. It is created for use by monitoring tools to track backend crashes. Currently, if the 'restart_after_crash' option is on, postgres will just restart. And the only way to know that it happened is to regularly parse logfile or monitor it, catching restart messages. This approach is really inconvenient for users, who have gigabytes of logs. This new function can be periodiacally called by a monitoring agent, and, if /shmem_init_time/ doesn't match /pg_postmaster_start_time,/ we know that server crashed-restarted, and also know the exact time, when. Also, working on this patch, I noticed a bit of dead code and some discordant comments in postmaster.c. I see no reason to leave it as is. So there is a small remove_dead_shmem_reinit_code_v0.patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company commit d396ef1ccf12be031c1cf90cc5a357da4933a3b8 Author: AnastasiaDate: Wed Feb 28 13:50:18 2018 +0300 Remove some dead code related to shmem reinit. Fix outdated comments. diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index f3ddf82..a397260 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -223,11 +223,10 @@ static char ExtraOptions[MAXPGPATH]; /* * These globals control the behavior of the postmaster in case some * backend dumps core. Normally, it kills all peers of the dead backend - * and reinitializes shared memory. By specifying -s or -n, we can have + * and reinitializes shared memory. By specifying -T, we can have * the postmaster stop (rather than kill) peers and not reinitialize - * shared data structures. (Reinit is currently dead code, though.) + * shared data structures. It lets us to collect core dumps of all processes. */ -static bool Reinit = true; static int SendStop = false; /* still more option variables */ @@ -738,11 +737,6 @@ PostmasterMain(int argc, char *argv[]) SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV); break; - case 'n': -/* Don't reinit shared mem after abnormal exit */ -Reinit = false; -break; - case 'O': SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV); break; @@ -3427,7 +3421,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) * * SIGQUIT is the special signal that says exit without proc_exit * and let the user know what's going on. But if SendStop is set - * (-s on command line), then we send SIGSTOP instead, so that we + * (-T on command line), then we send SIGSTOP instead, so that we * can get core dumps from all backends by hand. * * We could exclude dead_end children here, but at least in the commit 13817de092ee53f2af2d73270ddc3e02556d7c0c Author: Anastasia Date: Wed Feb 28 14:04:00 2018 +0300 Add function pg_shmem_init_time() which returns the time shared memory was last initialized diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2f59af2..bebdddf 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -15924,6 +15924,12 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n); + pg_shmem_init_time() + timestamp with time zone + shared memory initialization time + + + pg_current_logfile(text) text Primary log file name, or log in the requested format, diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index a397260..7cc9780 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -512,6 +512,7 @@ typedef struct pid_t PostmasterPid; TimestampTz PgStartTime; TimestampTz PgReloadTime; + TimestampTz PgShmemInitTime; pg_time_t first_syslogger_file_time; bool redirection_done; bool IsBinaryUpgrade; @@ -2555,6 +2556,8 @@ reset_shared(int port) * objects if the postmaster crashes and is restarted. */ CreateSharedMemoryAndSemaphores(false, port); + + PgShmemInitTime = GetCurrentTimestamp(); } @@ -6058,6 +6061,7 @@ save_backend_variables(BackendParameters *param, Port *port, param->PostmasterPid = PostmasterPid; param->PgStartTime = PgStartTime; param->PgReloadTime = PgReloadTime; + param->PgShmemInitTime = PgShmemInitTime; param->first_syslogger_file_time = first_syslogger_file_time; param->redirection_done = redirection_done; @@ -6290,6 +6294,7 @@ restore_backend_variables(BackendParameters *param, Port *port) PostmasterPid = param->PostmasterPid; PgStartTime = param->PgStartTime; PgReloadTime = param->PgReloadTime; + PgShmemInitTime = param->PgShmemInitTime; first_syslogger_file_time = param->first_syslogger_file_time; redirection_done =
Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
On 28 February 2018 at 05:51, Thomas Munrowrote: > On Wed, Jan 3, 2018 at 4:31 AM, Shubham Barai > wrote: > > I have created new isolation tests. Please have a look at > > updated patch. > > Hi Shubham, > > Could we please have a rebased version of the gin one? > Sure. I have attached a rebased version Regards, Shubham Predicate-Locking-in-gin-index_v5.patch Description: Binary data
Re: [HACKERS] path toward faster partition pruning
On Tue, Feb 27, 2018 at 8:12 PM, Amit Langotewrote: > Ah, OK. I was missing that there is no need to have both parttypcoll and > partcollation in PartitionSchemeData, as the Vars in rel->partexprs are > built from a bare PartitionKey (not PartitionSchemeData), and after that > point, parttypcoll no longer needs to kept around. > > I noticed that there is a typo in the patch. > > + memcpy(part_scheme->partcollation, partkey->parttypcoll, > > s/parttypcoll/partcollation/g Committed your version. > BTW, should there be a relevant test in partition_join.sql? If yes, > attached a patch (partitionwise-join-collation-test-1.patch) to add one. I don't feel strongly about it, but I'm not going to try to prevent you from adding one, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Changing the autovacuum launcher scheduling; oldest table first algorithm
On 02/28/2018 12:04 PM, Masahiko Sawada wrote: Hi, I've created the new thread for the changing AV launcher scheduling. The problem of AV launcher scheduling is described on [1] but I summarize it here. If there is even one database that is at risk of wraparound, currently AV launcher selects the database having the oldest datfrozenxid until all tables in that database have been vacuumed. This leads that tables on other database can bloat and not be frozen because other database are not selected by AV launcher. It makes things worse if the database has a large table that takes a long time to be vacuumed. Attached patch changes the AV launcher scheduling algorithm based on the proposed idea by Robert so that it selects a database first that has the oldest table when the database is at risk of wraparound. For detail of the algorithm please refer to [2]. In this algorithm, the running AV workers advertise the next datfrozenxid on the shared memory that we will have. That way, AV launcher can select a database that has the oldest table in the database cluster. However, this algorithm doesn't support the case where the age of next datfrozenxid gets greater than autovacum_vacuum_max_age. This case can happen if users set autovacvuum_vacuum_max_age to small value and vacuuming the whole database takes a long time. But since it's not a common case and it doesn't degrade than current behavior even if this happened, I think it's not a big problem. Feedback is very welcome. [1] https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1F8A4DC6%40G01JPEXMBYT05 [2] https://www.postgresql.org/message-id/CA%2BTgmobT3m%3D%2BdU5HF3VGVqiZ2O%2Bv6P5wN1Gj%2BPrq%2Bhj7dAm9AQ%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center Hello! I`ve tried to compile your patch on Fedora24 with gcc 6.3.1 20161221: gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g3 -O0 -I../../../src/include -D_GNU_SOURCE -c -o autovacuum.o autovacuum.c In file included from ../../../src/include/postgres.h:46:0, from autovacuum.c:62: autovacuum.c: In function ‘get_next_xid_bounds’: autovacuum.c:1979:9: warning: implicit declaration of function ‘TransactionIdIsVaild’ [-Wimplicit-function-declaration] Assert(TransactionIdIsVaild(frozenxid)); ^ ../../../src/include/c.h:713:7: note: in definition of macro ‘Trap’ if (condition) \ ^ autovacuum.c:1979:2: note: in expansion of macro ‘Assert’ Assert(TransactionIdIsVaild(frozenxid)); ^~ autovacuum.c:1980:28: error: ‘minmutli’ undeclared (first use in this function) Assert(MultiXactIdIsValid(minmutli)); ^ ../../../src/include/c.h:713:7: note: in definition of macro ‘Trap’ if (condition) \ ^ autovacuum.c:1980:2: note: in expansion of macro ‘Assert’ Assert(MultiXactIdIsValid(minmutli)); ^~ autovacuum.c:1980:9: note: in expansion of macro ‘MultiXactIdIsValid’ Assert(MultiXactIdIsValid(minmutli)); ^~ autovacuum.c:1980:28: note: each undeclared identifier is reported only once for each function it appears in Assert(MultiXactIdIsValid(minmutli)); ^ ../../../src/include/c.h:713:7: note: in definition of macro ‘Trap’ if (condition) \ ^ autovacuum.c:1980:2: note: in expansion of macro ‘Assert’ Assert(MultiXactIdIsValid(minmutli)); ^~ autovacuum.c:1980:9: note: in expansion of macro ‘MultiXactIdIsValid’ Assert(MultiXactIdIsValid(minmutli)); ^~ : recipe for target 'autovacuum.o' failed make[3]: *** [autovacuum.o] Error 1 -- Grigory Smolkin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type
OK, time to revive this old thread ... On 09/23/2017 05:27 PM, Tom Lane wrote: > Tomas Vondrawrites: [ scalarineqsel may fall over when used by extension operators ] > >> What about using two-pronged approach: > >> 1) fall back to mid bucket in back branches (9.3 - 10) >> 2) do something smarter (along the lines you outlined) in PG11 > > Sure. We need to test the fallback case anyway. > Attached is a minimal fix adding a flag to convert_numeric_to_scalar, tracking when it fails because of unsupported data type. If any of the 3 calls (value + lo/hi boundaries) sets it to 'true' we simply fall back to the default estimate (0.5) within the bucket. >>> [ sketch of a more extensible design ] > >> Sounds reasonable to me, I guess - I can't really think about anything >> simpler giving us the same flexibility. > > Actually, on further thought, that's still too simple. If you look > at convert_string_to_scalar() you'll see it's examining all three > values concurrently (the probe value, of one datatype, and two bin > boundary values of possibly a different type). The reason is that > it's stripping off whatever common prefix those strings have before > trying to form a numeric equivalent. While certainly > convert_string_to_scalar is pretty stupid in the face of non-ASCII > sort orders, the prefix-stripping is something I really don't want > to give up. So the design I sketched of considering each value > totally independently isn't good enough. > > We could, perhaps, provide an API that lets an operator estimation > function replace convert_to_scalar in toto, but that seems like > you'd still end up duplicating code in many cases. Not sure about > how to find a happy medium. > I plan to work on this improvement next, once I polish a couple of other patches for the upcoming commit fest. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index fcc8323..5f6019a 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -176,7 +176,7 @@ static bool estimate_multivariate_ndistinct(PlannerInfo *root, static bool convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue, Datum lobound, Datum hibound, Oid boundstypid, double *scaledlobound, double *scaledhibound); -static double convert_numeric_to_scalar(Datum value, Oid typid); +static double convert_numeric_to_scalar(Datum value, Oid typid, bool *unknown_type); static void convert_string_to_scalar(char *value, double *scaledvalue, char *lobound, @@ -4071,10 +4071,15 @@ convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue, case REGDICTIONARYOID: case REGROLEOID: case REGNAMESPACEOID: - *scaledvalue = convert_numeric_to_scalar(value, valuetypid); - *scaledlobound = convert_numeric_to_scalar(lobound, boundstypid); - *scaledhibound = convert_numeric_to_scalar(hibound, boundstypid); - return true; + { + bool unknown_type = false; + + *scaledvalue = convert_numeric_to_scalar(value, valuetypid, _type); + *scaledlobound = convert_numeric_to_scalar(lobound, boundstypid, _type); + *scaledhibound = convert_numeric_to_scalar(hibound, boundstypid, _type); + + return (!unknown_type); + } /* * Built-in string types @@ -4147,7 +4152,7 @@ convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue, * Do convert_to_scalar()'s work for any numeric data type. */ static double -convert_numeric_to_scalar(Datum value, Oid typid) +convert_numeric_to_scalar(Datum value, Oid typid, bool *unknown_type) { switch (typid) { @@ -4185,9 +4190,11 @@ convert_numeric_to_scalar(Datum value, Oid typid) /* * Can't get here unless someone tries to use scalarineqsel() on an - * operator with one numeric and one non-numeric operand. + * operator with one numeric and one non-numeric operand. Or more + * precisely, operand that we don't know how to transform to scalar. */ - elog(ERROR, "unsupported type: %u", typid); + *unknown_type = true; + return 0; }
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Feb 27, 2018 at 5:07 PM, Peter Geogheganwrote: > I now feel like Simon's suggestion of throwing an error in corner > cases isn't so bad. It still seems like we could do better, but the > more I think about it, the less that seems like a cop-out. My reasons > are: I still think we really ought to try not to add a new class of error. > * We can all agree that *not* raising an error in the specific way > Simon proposes is possible, somehow or other. We also all agree that > avoiding the broader category of RC errors can only be taken so far > (e.g. in any event duplicate violations errors are entirely possible, > in RC mode, when a MERGE inserts a row). So this is a question of what > exact middle ground to take. Neither of the two extremes (throwing an > error on the first sign of a RC conflict, and magically preventing > concurrency anomalies) are actually on the table. Just because there's no certainty about which behavior is best doesn't mean that none of them are better than throwing an error. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Registering LWTRANCHE_PARALLEL_HASH_JOIN
On Tue, Feb 27, 2018 at 3:58 PM, Thomas Munrowrote: > On Wed, Feb 28, 2018 at 8:39 AM, Robert Haas wrote: >> On Sat, Feb 10, 2018 at 6:07 PM, Thomas Munro >> wrote: >>> I forgot to register a display name for LWTRANCHE_PARALLEL_HASH_JOIN, >>> the tranche ID used by the LWLock that Parallel Hash uses when handing >>> out chunks of memory. Please see attached. >> >> I think that you need to insert some weasel words into the >> documentation for this, because I don't think it's really accurate to >> say that it's only used when trying to acquire a new chunk of memory. >> >> Or maybe I'm wrong and it's altogether accurate ... but >> ExecParallelHashMergeCounters doesn't look like an allocation to me, >> and ExecParallelHashTuplePrealloc doesn't really look like an >> allocation either. > > Ok. How about this? > > I noticed that some of the descriptions don't attempt to explain what > activity the lock protects at all, they just say "Waiting for $BLAH > lock". I went the other way and covered the various different uses. > There are 4 uses for the lock but only three things in my list, > because I think "allocate" covers both ExecParallelHashTupleAlloc() > and ExecParallelHashTuplePrealloc(). Well, the trouble with that of course is that if something changes later then we have to update the docs, whereas if we keep it vague then we avoid that. But I've committed that version as you have it and maybe by the time it needs to be updated they'll have made you a committer and you can be the poor shmuck who has to spend time committing fixes like this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Online enabling of checksums
On Sun, Feb 25, 2018 at 9:54 AM, Magnus Haganderwrote: > Also if that wasn't clear -- we only do the full page write if there isn't > already a checksum on the page and that checksum is correct. Hmm. Suppose that on the master there is a checksum on the page and that checksum is correct, but on the standby the page contents differ in some way that we don't always WAL-log, like as to hint bits, and there the checksum is incorrect. Then you'll enable checksums when the standby still has some pages without valid checksums, and disaster will ensue. I think this could be hard to prevent if checksums are turned on and off multiple times. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel index creation & pg_stat_activity
Hi Peter, On 2018-02-28 16:50:44 +, Phil Florent wrote: > With an index creation (create index t1_i1 on t1(c1, c2);) I have this kind > of output : > > ./t -d 20 -o "pid, backend_type, query, wait_event_type, wait_event" > busy_pc | distinct_exe | pid | backend_type | query > | wait_event_type | wait_event > -+--+--++---+-+-- > 68 | 1 / 136 | 8262 | client backend | create index t1_i1 on > t1(c1, c2); | IO | DataFileRead > 26 | 1 / 53 | 8262 | client backend | create index t1_i1 on > t1(c1, c2); | | >6 | 1 / 11 | 8262 | client backend | create index t1_i1 on > t1(c1, c2); | IO | BufFileWrite > (3 rows) > No parallel worker. At least one parallel worker was active though, I could > see its work with a direct query on pg_stat_activity or a ps -ef : > > ... > postgres 8262 8230 7 08:54 ?00:22:46 postgres: 11/main: postgres > postgres [local] CREATE INDEX > ... > postgres 9833 8230 23 14:17 ?00:00:33 postgres: 11/main: parallel > worker for PID 8262 > ... Looks like we're not doing a pgstat_report_activity() in the workers? Any argument for not doing so? Greetings, Andres Freund
Re: [HACKERS] path toward faster partition pruning
On Wed, Feb 28, 2018 at 6:42 AM, Amit Langotewrote: > On 2018/02/28 1:05, Robert Haas wrote: >> On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote >> wrote: >>> You may say that partition bounds might have to be different too in this >>> case and hence partition-wise join won't occur anyway, but I'm wondering >>> if the mismatch of partcollation itself isn't enough to conclude that? >> >> Yeah, you're right. I think that this is just a bug in partition-wise >> join, and that the partition scheme should just be using partcollation >> instead of parttypcoll, as in the attached. > > Ah, OK. I was missing that there is no need to have both parttypcoll and > partcollation in PartitionSchemeData, as the Vars in rel->partexprs are > built from a bare PartitionKey (not PartitionSchemeData), and after that > point, parttypcoll no longer needs to kept around. Yes. That's right. > > I noticed that there is a typo in the patch. > > + memcpy(part_scheme->partcollation, partkey->parttypcoll, > > s/parttypcoll/partcollation/g > > BTW, should there be a relevant test in partition_join.sql? If yes, > attached a patch (partitionwise-join-collation-test-1.patch) to add one. A partition-wise join path will be created but discarded because of higher cost. This test won't see it in that case. So, please add some data like other tests and add command to analyze the partitioned tables. That kind of protects from something like that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] path toward faster partition pruning
On Tue, Feb 27, 2018 at 3:03 PM, Amit Langotewrote: > Attached an updated version in which I incorporated some of the revisions > that David Rowley suggested to OR clauses handling (in partprune.c) that > he posted as a separate patch on the run-time pruning thread [1]. Some comments on 0001. partnatts != part_scheme->partnatts) continue; -/* Match the partition key types. */ +/* + * Match the partition key types and partitioning-specific collations. + */ We are comparing opfamily and opclass input type as well, but this comment doesn't explicitly mention those like it mentions collation. Instead, I think we should just say, "Match partition key type properties" You had commented on "advanced partition matching code" about asserting that parsupfuncs also match. Robert too has expressed similar opinion upthread. May be we should at least try to assert that the function OIDs match. -Oid *parttypcoll;/* OIDs of collations of partition keys. */ + +/* + * We store both the collation implied by the partition key's type and the + * one specified for partitioning. Values in the former are used as + * varcollid in the Vars corresponding to simple column partition keys so + * as to make them match corresponding Vars appearing elsewhere in the + * query tree. Whereas, the latter is used when actually comparing values + * against partition bounds datums, such as, when doing partition pruning. + */ +Oid *parttypcoll; +Oid *partcollation; As you have already mentioned upthread only partcollation is needed, not parttypcoll. /* Cached information about partition key data types. */ int16 *parttyplen; bool *parttypbyval; + +/* + * Cached array of partitioning comparison functions' fmgr structs. We + * don't compare these when trying to match two partition schemes. + */ I think this comment should go away. The second sentence doesn't explain why and if it did so it should do that in find_partition_scheme() not here. partsupfunc is another property of partition keys that is cached like parttyplen, parttypbyval. Why does it deserve a separate comment when others don't? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company