inserts into partitioned table may cause crash

2018-02-28 Thread Amit Langote
I've run into what seems to be a bug in ExecInsert() that causes a crash
when inserting multiple rows into a partitioned table that each go into
different partitions with different tuple descriptors.  Crash occurs if
ExecInsert() returns without resetting estate->es_result_relation_info
back to the root table's resultRelInfo.  For example, if a BR trigger on a
partition returns NULL for a row.

Crashing example:

create table p (a int, b text) partition by list (a);
create table p12 (b text, a int);

-- p12 has different tuple descriptor than p
alter table p attach partition p12 for values in (1, 2);

create table p4 partition of p for values in (4);

create function blackhole () returns trigger as $$ begin return NULL; end;
$$ language plpgsql;
create trigger blackhole before insert on p12 for each row execute
procedure blackhole();

insert into p values (1, 'b'), (4, 'a');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Crash is caused because we enter into ExecFindPartition with p12's
resultRelInfo as if it correponds to the root table.  That happens because
we didn't reset estate->es_result_relation_info, which had been set to
p12's resultRelInfo to point back to the original resultRelInfo (that is,
p's) before returning like below:

   slot = ExecIRInsertTriggers(estate, resultRelInfo, slot);

if (slot == NULL)   /* "do nothing" */
return NULL;

There are other places where we prematurely return like that.

Attached a patch to fix that, which would need to be back-patched to 10.

Thanks,
Amit
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index c32928d9bd..0d6326d1de 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -281,7 +281,7 @@ ExecInsert(ModifyTableState *mtstate,
/*
 * get information on the (current) result relation
 */
-   resultRelInfo = estate->es_result_relation_info;
+   saved_resultRelInfo = resultRelInfo = estate->es_result_relation_info;
 
/* Determine the partition to heap_insert the tuple into */
if (mtstate->mt_partition_tuple_routing)
@@ -408,7 +408,10 @@ ExecInsert(ModifyTableState *mtstate,
slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
 
if (slot == NULL)   /* "do nothing" */
-   return NULL;
+   {
+   result = NULL;
+   goto restore_estate_result_rel;
+   }
 
/* trigger might have changed tuple */
tuple = ExecMaterializeSlot(slot);
@@ -421,7 +424,10 @@ ExecInsert(ModifyTableState *mtstate,
slot = ExecIRInsertTriggers(estate, resultRelInfo, slot);
 
if (slot == NULL)   /* "do nothing" */
-   return NULL;
+   {
+   result = NULL;
+   goto restore_estate_result_rel;
+   }
 
/* trigger might have changed tuple */
tuple = ExecMaterializeSlot(slot);
@@ -439,7 +445,10 @@ ExecInsert(ModifyTableState *mtstate,

   planSlot);
 
if (slot == NULL)   /* "do nothing" */
-   return NULL;
+   {
+   result = NULL;
+   goto restore_estate_result_rel;
+   }
 
/* FDW might have changed tuple */
tuple = ExecMaterializeSlot(slot);
@@ -495,7 +504,7 @@ ExecInsert(ModifyTableState *mtstate,
 * No need though if the tuple has been routed, and a BR trigger
 * doesn't exist.
 */
-   if (saved_resultRelInfo != NULL &&
+   if (saved_resultRelInfo != resultRelInfo &&
!(resultRelInfo->ri_TrigDesc &&
  resultRelInfo->ri_TrigDesc->trig_insert_before_row))
check_partition_constr = false;
@@ -544,7 +553,8 @@ ExecInsert(ModifyTableState *mtstate,

 estate, canSetTag, ))
{

InstrCountFiltered2(>ps, 1);
-   return returning;
+   result = returning;
+   goto restore_estate_result_rel;
}
else
goto vlock;
@@ -559,7 +569,8 @@ 

Re: Wait event names mismatch: oldserxid

2018-02-28 Thread Michael Paquier
On Tue, Feb 27, 2018 at 02:44:37PM -0500, Robert Haas wrote:
> On Fri, Feb 9, 2018 at 8:53 AM, Michael Paquier  wrote:
>> So the docs look correct to me on this side.  What I find weird is the
>> phrasing to define oldserxid.  Instead of that, the current description:
>> Waiting to I/O on an oldserxid buffer.
>> I would suggest "waiting *for* I/O"
>>
>> A small patch is attached.
> 
> Agreed.  Committed.

Thanks, Robert.
--
Michael


signature.asc
Description: PGP signature


Re: inserts into partitioned table may cause crash

2018-02-28 Thread Amit Langote
On 2018/02/28 17:36, Amit Langote wrote:
> I've run into what seems to be a bug in ExecInsert() that causes a crash
> when inserting multiple rows into a partitioned table that each go into
> different partitions with different tuple descriptors.  Crash occurs if
> ExecInsert() returns without resetting estate->es_result_relation_info
> back to the root table's resultRelInfo.  For example, if a BR trigger on a
> partition returns NULL for a row.
> 
> Crashing example:
> 
> create table p (a int, b text) partition by list (a);
> create table p12 (b text, a int);
> 
> -- p12 has different tuple descriptor than p
> alter table p attach partition p12 for values in (1, 2);
> 
> create table p4 partition of p for values in (4);
> 
> create function blackhole () returns trigger as $$ begin return NULL; end;
> $$ language plpgsql;
> create trigger blackhole before insert on p12 for each row execute
> procedure blackhole();
> 
> insert into p values (1, 'b'), (4, 'a');
> server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> 
> Crash is caused because we enter into ExecFindPartition with p12's
> resultRelInfo as if it correponds to the root table.  That happens because
> we didn't reset estate->es_result_relation_info, which had been set to
> p12's resultRelInfo to point back to the original resultRelInfo (that is,
> p's) before returning like below:
> 
>slot = ExecIRInsertTriggers(estate, resultRelInfo, slot);
> 
> if (slot == NULL)   /* "do nothing" */
> return NULL;
> 
> There are other places where we prematurely return like that.
> 
> Attached a patch to fix that, which would need to be back-patched to 10.

BTW, this won't crash if the table descriptors match, but one would get an
unintuitive error like this:

create table p (a int, b text) partition by list (a);
create table p12 partition of p for values in (1, 2);
create table p4 partition of p for values in (4);
create trigger blackhole before insert on p12 for each row execute
procedure blackhole();

insert into p values (1, 'a'), (4, 'a');
ERROR:  new row for relation "p12" violates partition constraint
DETAIL:  Failing row contains (4, a).

That is, after trying to insert (4, 'a') into p12 as if it were the root.

Still, it's a bug all the same.

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-02-28 Thread David Rowley
On 27 February 2018 at 22:33, Amit Langote
 wrote:
> Attached an updated version in which I incorporated some of the revisions
> that David Rowley suggested to OR clauses handling (in partprune.c) that
> he posted as a separate patch on the run-time pruning thread [1].

Thanks for fixing that up and including it.

Micro review of v34:

1. Looks like you've renamed the parttypid parameter in the definition
of partkey_datum_from_expr and partition_cmp_args, but not updated the
declaration too.

+static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum *value);

+static bool
+partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value)

+static bool partition_cmp_args(Oid parttypid, Oid partopfamily,
+PartClause *pc, PartClause *leftarg, PartClause *rightarg,
+bool *result);

+static bool
+partition_cmp_args(Oid partopcintype, Oid partopfamily,
+PartClause *pc, PartClause *leftarg, PartClause *rightarg,
+bool *result)

2. In prune_append_rel_partitions(), it's not exactly illegal, but int
i is declared twice in different scopes. Looks like there's no need
for the inner one.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Function to track shmem reinit time

2018-02-28 Thread Anastasia Lubennikova

Attached patch introduces a new function pg_shmem_init_time(),
which returns the time shared memory was last (re)initialized.
It is created for use by monitoring tools to track backend crashes.

Currently, if the 'restart_after_crash' option is on, postgres will just 
restart.

And the only way to know that it happened is to regularly parse logfile
or monitor it, catching restart messages. This approach is really 
inconvenient for

users, who have gigabytes of logs.

This new function can be periodiacally called by a monitoring agent, and,
if /shmem_init_time/ doesn't match /pg_postmaster_start_time,/
we know that server crashed-restarted, and also know the exact time, when.

Also, working on this patch, I noticed a bit of dead code
and some discordant comments in postmaster.c.
I see no reason to leave it as is.
So there is a small remove_dead_shmem_reinit_code_v0.patch.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit d396ef1ccf12be031c1cf90cc5a357da4933a3b8
Author: Anastasia 
Date:   Wed Feb 28 13:50:18 2018 +0300

Remove some dead code related to shmem reinit. Fix outdated comments.

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f3ddf82..a397260 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -223,11 +223,10 @@ static char ExtraOptions[MAXPGPATH];
 /*
  * These globals control the behavior of the postmaster in case some
  * backend dumps core.  Normally, it kills all peers of the dead backend
- * and reinitializes shared memory.  By specifying -s or -n, we can have
+ * and reinitializes shared memory.  By specifying -T, we can have
  * the postmaster stop (rather than kill) peers and not reinitialize
- * shared data structures.  (Reinit is currently dead code, though.)
+ * shared data structures. It lets us to collect core dumps of all processes.
  */
-static bool Reinit = true;
 static int	SendStop = false;
 
 /* still more option variables */
@@ -738,11 +737,6 @@ PostmasterMain(int argc, char *argv[])
 SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
 break;
 
-			case 'n':
-/* Don't reinit shared mem after abnormal exit */
-Reinit = false;
-break;
-
 			case 'O':
 SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
 break;
@@ -3427,7 +3421,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			 *
 			 * SIGQUIT is the special signal that says exit without proc_exit
 			 * and let the user know what's going on. But if SendStop is set
-			 * (-s on command line), then we send SIGSTOP instead, so that we
+			 * (-T on command line), then we send SIGSTOP instead, so that we
 			 * can get core dumps from all backends by hand.
 			 *
 			 * We could exclude dead_end children here, but at least in the
commit 13817de092ee53f2af2d73270ddc3e02556d7c0c
Author: Anastasia 
Date:   Wed Feb 28 14:04:00 2018 +0300

Add function pg_shmem_init_time() which returns the time shared memory was last initialized

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2f59af2..bebdddf 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15924,6 +15924,12 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
   
 
   
+   pg_shmem_init_time()
+   timestamp with time zone
+   shared memory initialization time
+  
+
+  
pg_current_logfile(text)
text
Primary log file name, or log in the requested format,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a397260..7cc9780 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -512,6 +512,7 @@ typedef struct
 	pid_t		PostmasterPid;
 	TimestampTz PgStartTime;
 	TimestampTz PgReloadTime;
+	TimestampTz PgShmemInitTime;
 	pg_time_t	first_syslogger_file_time;
 	bool		redirection_done;
 	bool		IsBinaryUpgrade;
@@ -2555,6 +2556,8 @@ reset_shared(int port)
 	 * objects if the postmaster crashes and is restarted.
 	 */
 	CreateSharedMemoryAndSemaphores(false, port);
+
+	PgShmemInitTime = GetCurrentTimestamp();
 }
 
 
@@ -6058,6 +6061,7 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->PostmasterPid = PostmasterPid;
 	param->PgStartTime = PgStartTime;
 	param->PgReloadTime = PgReloadTime;
+	param->PgShmemInitTime = PgShmemInitTime;
 	param->first_syslogger_file_time = first_syslogger_file_time;
 
 	param->redirection_done = redirection_done;
@@ -6290,6 +6294,7 @@ restore_backend_variables(BackendParameters *param, Port *port)
 	PostmasterPid = param->PostmasterPid;
 	PgStartTime = param->PgStartTime;
 	PgReloadTime = param->PgReloadTime;
+	PgShmemInitTime = param->PgShmemInitTime;
 	first_syslogger_file_time = param->first_syslogger_file_time;
 
 	redirection_done = 

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

2018-02-28 Thread Shubham Barai
On 28 February 2018 at 05:51, Thomas Munro 
wrote:

> On Wed, Jan 3, 2018 at 4:31 AM, Shubham Barai 
> wrote:
> > I have created new isolation tests. Please have a look at
> > updated patch.
>
> Hi Shubham,
>
> Could we please have a rebased version of the gin one?
>

Sure. I have attached a rebased version

Regards,
Shubham


Predicate-Locking-in-gin-index_v5.patch
Description: Binary data


Re: [HACKERS] path toward faster partition pruning

2018-02-28 Thread Robert Haas
On Tue, Feb 27, 2018 at 8:12 PM, Amit Langote
 wrote:
> Ah, OK. I was missing that there is no need to have both parttypcoll and
> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
> built from a bare PartitionKey (not PartitionSchemeData), and after that
> point, parttypcoll no longer needs to kept around.
>
> I noticed that there is a typo in the patch.
>
> +   memcpy(part_scheme->partcollation, partkey->parttypcoll,
>
> s/parttypcoll/partcollation/g

Committed your version.

> BTW, should there be a relevant test in partition_join.sql?  If yes,
> attached a patch (partitionwise-join-collation-test-1.patch) to add one.

I don't feel strongly about it, but I'm not going to try to prevent
you from adding one, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changing the autovacuum launcher scheduling; oldest table first algorithm

2018-02-28 Thread Grigory Smolkin



On 02/28/2018 12:04 PM, Masahiko Sawada wrote:

Hi,

I've created the new thread for the changing AV launcher scheduling.
The problem of AV launcher scheduling is described on [1] but I
summarize it here.

If there is even one database that is at risk of wraparound, currently
AV launcher selects the database having the oldest datfrozenxid until
all tables in that database have been vacuumed. This leads that tables
on other database can bloat and not be frozen because other database
are not selected by AV launcher. It makes things worse if the database
has a large table that takes a long time to be vacuumed.

Attached patch changes the AV launcher scheduling algorithm based on
the proposed idea by Robert so that it selects a database first that
has the oldest table when the database is at risk of wraparound. For
detail of the algorithm please refer to [2].

In this algorithm, the running AV workers advertise the next
datfrozenxid on the shared memory that we will have. That way, AV
launcher can select a database that has the oldest table in the
database cluster. However, this algorithm doesn't support the case
where the age of next datfrozenxid gets greater than
autovacum_vacuum_max_age. This case can happen if users set
autovacvuum_vacuum_max_age to small value and vacuuming the whole
database takes a long time. But since it's not a common case and it
doesn't degrade than current behavior even if this happened, I think
it's not a big problem.

Feedback is very welcome.

[1] 
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1F8A4DC6%40G01JPEXMBYT05
[2] 
https://www.postgresql.org/message-id/CA%2BTgmobT3m%3D%2BdU5HF3VGVqiZ2O%2Bv6P5wN1Gj%2BPrq%2Bhj7dAm9AQ%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Hello!
I`ve tried to compile your patch on Fedora24 with gcc 6.3.1 20161221:

gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -g3 -O0 -I../../../src/include  
-D_GNU_SOURCE   -c -o autovacuum.o autovacuum.c

In file included from ../../../src/include/postgres.h:46:0,
 from autovacuum.c:62:
autovacuum.c: In function ‘get_next_xid_bounds’:
autovacuum.c:1979:9: warning: implicit declaration of function 
‘TransactionIdIsVaild’ [-Wimplicit-function-declaration]

  Assert(TransactionIdIsVaild(frozenxid));
 ^
../../../src/include/c.h:713:7: note: in definition of macro ‘Trap’
   if (condition) \
   ^
autovacuum.c:1979:2: note: in expansion of macro ‘Assert’
  Assert(TransactionIdIsVaild(frozenxid));
  ^~
autovacuum.c:1980:28: error: ‘minmutli’ undeclared (first use in this 
function)

  Assert(MultiXactIdIsValid(minmutli));
^
../../../src/include/c.h:713:7: note: in definition of macro ‘Trap’
   if (condition) \
   ^
autovacuum.c:1980:2: note: in expansion of macro ‘Assert’
  Assert(MultiXactIdIsValid(minmutli));
  ^~
autovacuum.c:1980:9: note: in expansion of macro ‘MultiXactIdIsValid’
  Assert(MultiXactIdIsValid(minmutli));
 ^~
autovacuum.c:1980:28: note: each undeclared identifier is reported only 
once for each function it appears in

  Assert(MultiXactIdIsValid(minmutli));
^
../../../src/include/c.h:713:7: note: in definition of macro ‘Trap’
   if (condition) \
   ^
autovacuum.c:1980:2: note: in expansion of macro ‘Assert’
  Assert(MultiXactIdIsValid(minmutli));
  ^~
autovacuum.c:1980:9: note: in expansion of macro ‘MultiXactIdIsValid’
  Assert(MultiXactIdIsValid(minmutli));
 ^~
: recipe for target 'autovacuum.o' failed
make[3]: *** [autovacuum.o] Error 1

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2018-02-28 Thread Tomas Vondra
OK, time to revive this old thread ...

On 09/23/2017 05:27 PM, Tom Lane wrote:
> Tomas Vondra  writes:
 [ scalarineqsel may fall over when used by extension operators ]
> 
>> What about using two-pronged approach:
> 
>> 1) fall back to mid bucket in back branches (9.3 - 10)
>> 2) do something smarter (along the lines you outlined) in PG11
> 
> Sure.  We need to test the fallback case anyway.
> 

Attached is a minimal fix adding a flag to convert_numeric_to_scalar,
tracking when it fails because of unsupported data type. If any of the 3
calls (value + lo/hi boundaries) sets it to 'true' we simply fall back
to the default estimate (0.5) within the bucket.

>>> [ sketch of a more extensible design ]
> 
>> Sounds reasonable to me, I guess - I can't really think about anything
>> simpler giving us the same flexibility.
> 
> Actually, on further thought, that's still too simple.  If you look
> at convert_string_to_scalar() you'll see it's examining all three
> values concurrently (the probe value, of one datatype, and two bin
> boundary values of possibly a different type).  The reason is that
> it's stripping off whatever common prefix those strings have before
> trying to form a numeric equivalent.  While certainly
> convert_string_to_scalar is pretty stupid in the face of non-ASCII
> sort orders, the prefix-stripping is something I really don't want
> to give up.  So the design I sketched of considering each value
> totally independently isn't good enough.
> 
> We could, perhaps, provide an API that lets an operator estimation
> function replace convert_to_scalar in toto, but that seems like
> you'd still end up duplicating code in many cases.  Not sure about
> how to find a happy medium.
> 

I plan to work on this improvement next, once I polish a couple of other
patches for the upcoming commit fest.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323..5f6019a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -176,7 +176,7 @@ static bool estimate_multivariate_ndistinct(PlannerInfo *root,
 static bool convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue,
   Datum lobound, Datum hibound, Oid boundstypid,
   double *scaledlobound, double *scaledhibound);
-static double convert_numeric_to_scalar(Datum value, Oid typid);
+static double convert_numeric_to_scalar(Datum value, Oid typid, bool *unknown_type);
 static void convert_string_to_scalar(char *value,
 		 double *scaledvalue,
 		 char *lobound,
@@ -4071,10 +4071,15 @@ convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue,
 		case REGDICTIONARYOID:
 		case REGROLEOID:
 		case REGNAMESPACEOID:
-			*scaledvalue = convert_numeric_to_scalar(value, valuetypid);
-			*scaledlobound = convert_numeric_to_scalar(lobound, boundstypid);
-			*scaledhibound = convert_numeric_to_scalar(hibound, boundstypid);
-			return true;
+		{
+			bool	unknown_type = false;
+
+			*scaledvalue = convert_numeric_to_scalar(value, valuetypid, _type);
+			*scaledlobound = convert_numeric_to_scalar(lobound, boundstypid, _type);
+			*scaledhibound = convert_numeric_to_scalar(hibound, boundstypid, _type);
+
+			return (!unknown_type);
+		}
 
 			/*
 			 * Built-in string types
@@ -4147,7 +4152,7 @@ convert_to_scalar(Datum value, Oid valuetypid, double *scaledvalue,
  * Do convert_to_scalar()'s work for any numeric data type.
  */
 static double
-convert_numeric_to_scalar(Datum value, Oid typid)
+convert_numeric_to_scalar(Datum value, Oid typid, bool *unknown_type)
 {
 	switch (typid)
 	{
@@ -4185,9 +4190,11 @@ convert_numeric_to_scalar(Datum value, Oid typid)
 
 	/*
 	 * Can't get here unless someone tries to use scalarineqsel() on an
-	 * operator with one numeric and one non-numeric operand.
+	 * operator with one numeric and one non-numeric operand. Or more
+	 * precisely, operand that we don't know how to transform to scalar.
 	 */
-	elog(ERROR, "unsupported type: %u", typid);
+	*unknown_type = true;
+
 	return 0;
 }
 


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-28 Thread Robert Haas
On Tue, Feb 27, 2018 at 5:07 PM, Peter Geoghegan  wrote:
> I now feel like Simon's suggestion of throwing an error in corner
> cases isn't so bad. It still seems like we could do better, but the
> more I think about it, the less that seems like a cop-out. My reasons
> are:

I still think we really ought to try not to add a new class of error.

> * We can all agree that *not* raising an error in the specific way
> Simon proposes is possible, somehow or other. We also all agree that
> avoiding the broader category of RC errors can only be taken so far
> (e.g. in any event duplicate violations errors are entirely possible,
> in RC mode, when a MERGE inserts a row). So this is a question of what
> exact middle ground to take. Neither of the two extremes (throwing an
> error on the first sign of a RC conflict, and magically preventing
> concurrency anomalies) are actually on the table.

Just because there's no certainty about which behavior is best doesn't
mean that none of them are better than throwing an error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Registering LWTRANCHE_PARALLEL_HASH_JOIN

2018-02-28 Thread Robert Haas
On Tue, Feb 27, 2018 at 3:58 PM, Thomas Munro
 wrote:
> On Wed, Feb 28, 2018 at 8:39 AM, Robert Haas  wrote:
>> On Sat, Feb 10, 2018 at 6:07 PM, Thomas Munro
>>  wrote:
>>> I forgot to register a display name for LWTRANCHE_PARALLEL_HASH_JOIN,
>>> the tranche ID used by the LWLock that Parallel Hash uses when handing
>>> out chunks of memory.  Please see attached.
>>
>> I think that you need to insert some weasel words into the
>> documentation for this, because I don't think it's really accurate to
>> say that it's only used when trying to acquire a new chunk of memory.
>>
>> Or maybe I'm wrong and it's altogether accurate ... but
>> ExecParallelHashMergeCounters doesn't look like an allocation to me,
>> and ExecParallelHashTuplePrealloc doesn't really look like an
>> allocation either.
>
> Ok.  How about this?
>
> I noticed that some of the descriptions don't attempt to explain what
> activity the lock protects at all, they just say "Waiting for $BLAH
> lock".  I went the other way and covered the various different uses.
> There are 4 uses for the lock but only three things in my list,
> because I think "allocate" covers both ExecParallelHashTupleAlloc()
> and ExecParallelHashTuplePrealloc().

Well, the trouble with that of course is that if something changes
later then we have to update the docs, whereas if we keep it vague
then we avoid that.  But I've committed that version as you have it
and maybe by the time it needs to be updated they'll have made you a
committer and you can be the poor shmuck who has to spend time
committing fixes like this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online enabling of checksums

2018-02-28 Thread Robert Haas
On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander  wrote:
> Also if that wasn't clear -- we only do the full page write if there isn't
> already a checksum on the page and that checksum is correct.

Hmm.

Suppose that on the master there is a checksum on the page and that
checksum is correct, but on the standby the page contents differ in
some way that we don't always WAL-log, like as to hint bits, and there
the checksum is incorrect.  Then you'll enable checksums when the
standby still has some pages without valid checksums, and disaster
will ensue.

I think this could be hard to prevent if checksums are turned on and
off multiple times.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel index creation & pg_stat_activity

2018-02-28 Thread Andres Freund
Hi Peter,

On 2018-02-28 16:50:44 +, Phil Florent wrote:
> With an index creation (create index t1_i1 on t1(c1, c2);) I have this kind 
> of output :
> 
> ./t -d 20 -o "pid, backend_type, query, wait_event_type, wait_event"
> busy_pc | distinct_exe | pid  |  backend_type  |   query  
>  | wait_event_type |  wait_event
> -+--+--++---+-+--
>   68 | 1 / 136  | 8262 | client backend | create index t1_i1 on 
> t1(c1, c2); | IO  | DataFileRead
>   26 | 1 / 53   | 8262 | client backend | create index t1_i1 on 
> t1(c1, c2); | |
>6 | 1 / 11   | 8262 | client backend | create index t1_i1 on 
> t1(c1, c2); | IO  | BufFileWrite
> (3 rows)

> No parallel worker. At least one parallel worker was active though, I could 
> see its work with a direct query on pg_stat_activity or a ps -ef :
> 
> ...
> postgres  8262  8230  7 08:54 ?00:22:46 postgres: 11/main: postgres 
> postgres [local] CREATE INDEX
> ...
> postgres  9833  8230 23 14:17 ?00:00:33 postgres: 11/main: parallel 
> worker for PID 8262
> ...

Looks like we're not doing a pgstat_report_activity() in the workers?
Any argument for not doing so?

Greetings,

Andres Freund



Re: [HACKERS] path toward faster partition pruning

2018-02-28 Thread Ashutosh Bapat
On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote
 wrote:
> On 2018/02/28 1:05, Robert Haas wrote:
>> On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote
>>  wrote:
>>> You may say that partition bounds might have to be different too in this
>>> case and hence partition-wise join won't occur anyway, but I'm wondering
>>> if the mismatch of partcollation itself isn't enough to conclude that?
>>
>> Yeah, you're right.  I think that this is just a bug in partition-wise
>> join, and that the partition scheme should just be using partcollation
>> instead of parttypcoll, as in the attached.
>
> Ah, OK. I was missing that there is no need to have both parttypcoll and
> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
> built from a bare PartitionKey (not PartitionSchemeData), and after that
> point, parttypcoll no longer needs to kept around.

Yes. That's right.

>
> I noticed that there is a typo in the patch.
>
> +   memcpy(part_scheme->partcollation, partkey->parttypcoll,
>
> s/parttypcoll/partcollation/g
>
> BTW, should there be a relevant test in partition_join.sql?  If yes,
> attached a patch (partitionwise-join-collation-test-1.patch) to add one.

A partition-wise join path will be created but discarded because of
higher cost. This test won't see it in that case. So, please add some
data like other tests and add command to analyze the partitioned
tables. That kind of protects from something like that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] path toward faster partition pruning

2018-02-28 Thread Ashutosh Bapat
On Tue, Feb 27, 2018 at 3:03 PM, Amit Langote
 wrote:
> Attached an updated version in which I incorporated some of the revisions
> that David Rowley suggested to OR clauses handling (in partprune.c) that
> he posted as a separate patch on the run-time pruning thread [1].

Some comments on 0001.

 partnatts != part_scheme->partnatts)
 continue;

-/* Match the partition key types. */
+/*
+ * Match the partition key types and partitioning-specific collations.
+ */

We are comparing opfamily and opclass input type as well, but this comment
doesn't explicitly mention those like it mentions collation. Instead, I think
we should just say, "Match partition key type properties"

You had commented on "advanced partition matching code" about asserting that
parsupfuncs also match. Robert too has expressed similar opinion upthread. May
be we should at least try to assert that the function OIDs match.

-Oid   *parttypcoll;/* OIDs of collations of partition keys. */
+
+/*
+ * We store both the collation implied by the partition key's type and the
+ * one specified for partitioning.  Values in the former are used as
+ * varcollid in the Vars corresponding to simple column partition keys so
+ * as to make them match corresponding Vars appearing elsewhere in the
+ * query tree.  Whereas, the latter is used when actually comparing values
+ * against partition bounds datums, such as, when doing partition pruning.
+ */
+Oid   *parttypcoll;
+Oid   *partcollation;

As you have already mentioned upthread only partcollation is needed, not
parttypcoll.

 /* Cached information about partition key data types. */
 int16   *parttyplen;
 bool   *parttypbyval;
+
+/*
+ * Cached array of partitioning comparison functions' fmgr structs.  We
+ * don't compare these when trying to match two partition schemes.
+ */

I think this comment should go away. The second sentence doesn't explain why
and if it did so it should do that in find_partition_scheme() not here.
partsupfunc is another property of partition keys that is cached like
parttyplen, parttypbyval. Why does it deserve a separate comment when others
don't?



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



<    1   2