Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Masahiko Sawada
On Wed, Nov 22, 2017 at 2:56 PM, Craig Ringer  wrote:
> On 22 November 2017 at 12:15, Kyotaro HORIGUCHI
>  wrote:
>>
>> At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier
>>  wrote in
>> 

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-11-21 Thread Fabien COELHO


Note that the patch may interact with other patches which add functions to 
pgbench, so might need a rebase depending on the order in which the patch are 
applied.


Attached a minor rebase after 16827d4424.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index f6e93c3..f119ae6 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1093,6 +1093,14 @@ pgbench  options  d
an integer between 1 and 10
   
   
+   random_zipfian(lb, ub, parameter)
+   integer
+   Zipfian-distributed random integer in [lb, ub],
+  see below
+   random_zipfian(1, 10, 1.5)
+   an integer between 1 and 10
+  
+  
sqrt(x)
double
square root
@@ -1173,6 +1181,27 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
   of the Box-Muller transform.
  
 
+
+ 
+  random_zipfian generates an approximated bounded zipfian
+  distribution. For parameter in (0, 1), an
+  approximated algorithm is taken from
+  "Quickly Generating Billion-Record Synthetic Databases",
+  Jim Gray et al, SIGMOD 1994. For parameter
+  in (1, 1000), a rejection method is used, based on
+  "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,
+  Springer 1986. The distribution is not defined when the parameter's
+  value is 1.0. The drawing performance is poor for parameter values
+  close and above 1.0 and on a small range.
+ 
+ 
+  parameter
+  defines how skewed the distribution is. The larger the parameter, the more
+  frequently values to the beginning of the interval are drawn.
+  The closer to 0 parameter is,
+  the flatter (more uniform) the access distribution.
+ 
+

 
   
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index b3a2d9b..25d5ad4 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -191,6 +191,9 @@ static const struct
 	{
 		"random_exponential", 3, PGBENCH_RANDOM_EXPONENTIAL
 	},
+	{
+		"random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN
+	},
 	/* keep as last array element */
 	{
 		NULL, 0, 0
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index bd96eae..7ce6f60 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -95,7 +95,10 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define ZIPF_CACHE_SIZE	15		/* cache cells number */
+
 #define MIN_GAUSSIAN_PARAM		2.0 /* minimum parameter for gauss */
+#define MAX_ZIPFIAN_PARAM		1000	/* maximum parameter for zipfian */
 
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
@@ -331,6 +334,35 @@ typedef struct
 } CState;
 
 /*
+ * Cache cell for zipfian_random call
+ */
+typedef struct
+{
+	/* cell keys */
+	double		s;/* s - parameter of zipfan_random function */
+	int64		n;/* number of elements in range (max - min + 1) */
+
+	double		harmonicn;		/* generalizedHarmonicNumber(n, s) */
+	double		alpha;
+	double		beta;
+	double		eta;
+
+	uint64		last_used;		/* last used logical time */
+} ZipfCell;
+
+/*
+ * Zipf cache for zeta values
+ */
+typedef struct
+{
+	uint64		current;		/* counter for LRU cache replacement algorithm */
+
+	int			nb_cells;		/* number of filled cells */
+	int			overflowCount;	/* number of cache overflows */
+	ZipfCell	cells[ZIPF_CACHE_SIZE];
+} ZipfCache;
+
+/*
  * Thread state
  */
 typedef struct
@@ -342,6 +374,8 @@ typedef struct
 	unsigned short random_state[3]; /* separate randomness for each thread */
 	int64		throttle_trigger;	/* previous/next throttling (us) */
 	FILE	   *logfile;		/* where to log, or NULL */
+	ZipfCache	zipf_cache;		/* for thread-safe  zipfian random number
+ * generation */
 
 	/* per thread collected stats */
 	instr_time	start_time;		/* thread start time */
@@ -746,6 +780,137 @@ getPoissonRand(TState *thread, int64 center)
 	return (int64) (-log(uniform) * ((double) center) + 0.5);
 }
 
+/* helper function for getZipfianRand */
+static double
+generalizedHarmonicNumber(int64 n, double s)
+{
+	int			i;
+	double		ans = 0.0;
+
+	for (i = n; i > 1; i--)
+		ans += pow(i, -s);
+	return ans + 1.0;
+}
+
+/* set harmonicn and other parameters to cache cell */
+static void
+zipfSetCacheCell(ZipfCell * cell, int64 n, double s)
+{
+	double		harmonic2;
+
+	cell->n = n;
+	cell->s = s;
+
+	harmonic2 = generalizedHarmonicNumber(2, s);
+	cell->harmonicn = generalizedHarmonicNumber(n, s);
+
+	cell->alpha = 1.0 / (1.0 - s);
+	cell->beta = pow(0.5, s);
+	cell->eta = (1.0 - pow(2.0 / n, 1.0 - s)) / (1.0 - harmonic2 / cell->harmonicn);
+}
+
+/*
+ * search for cache cell with keys (n, s)
+ * and create new cell if it does not exist
+ */
+static ZipfCell *
+zipfFindOrCreateCacheCell(ZipfCache * cache, int64 n, double s)
+{
+	int			i,
+	

Re: With commit 4e5fe9ad19, range partition missing handling for the NULL partition key

2017-11-21 Thread Amit Langote
Hi Rushabh,

On 2017/11/22 13:45, Rushabh Lathia wrote:
> Consider the below test:
> 
> CREATE TABLE range_tab(a int, b int) PARTITION BY RANGE(a);
> CREATE TABLE range_tab_p1 PARTITION OF range_tab FOR VALUES FROM (minvalue)
> TO (10);
> CREATE TABLE range_tab_p2 PARTITION OF range_tab FOR VALUES FROM (10) TO
> (20);
> CREATE TABLE range_tab_p3 PARTITION OF range_tab FOR VALUES FROM (20) TO
> (maxvalue);
> 
> INSERT INTO range_tab VALUES(NULL, 10);
> 
> Above insert should fail with an error "no partition of relation found for
> row".
> 
> Looking further I found that, this behaviour is changed after below commit:
> 
> commit 4e5fe9ad19e14af360de7970caa8b150436c9dec
> Author: Robert Haas 
> Date:   Wed Nov 15 10:23:28 2017 -0500
> 
> Centralize executor-related partitioning code.
> 
> Some code is moved from partition.c, which has grown very quickly
> lately;
> splitting the executor parts out might help to keep it from getting
> totally out of control.  Other code is moved from execMain.c.  All is
> moved to a new file execPartition.c.  get_partition_for_tuple now has
> a new interface that more clearly separates executor concerns from
> generic concerns.
> 
> Amit Langote.  A slight comment tweak by me.
> 
> Before above commit insert with NULL partition key in the range partition
> was throwing a proper error.

Oops, good catch.

> Attaching patch to fix as well as regression test.

Thanks for the patch.  About the code, how about do it like the attached
instead?

Thanks,
Amit
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 9a44cceb22..a87dacc057 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2536,11 +2536,10 @@ get_partition_for_tuple(Relation relation, Datum 
*values, bool *isnull)
 */
for (i = 0; i < key->partnatts; i++)
{
-   if (isnull[i] &&
-   
partition_bound_has_default(partdesc->boundinfo))
+   if (isnull[i])
{
range_partkey_has_null = true;
-   part_index = 
partdesc->boundinfo->default_index;
+   break;
}
}
 
@@ -2560,6 +2559,8 @@ get_partition_for_tuple(Relation relation, Datum *values, 
bool *isnull)
 */
part_index = 
partdesc->boundinfo->indexes[bound_offset + 1];
}
+   else
+   part_index = 
partdesc->boundinfo->default_index;
}
break;
 
diff --git a/src/test/regress/expected/insert.out 
b/src/test/regress/expected/insert.out
index 9d84ba4658..a0e3746e70 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -642,6 +642,10 @@ create table mcrparted2 partition of mcrparted for values 
from (10, 6, minvalue)
 create table mcrparted3 partition of mcrparted for values from (11, 1, 1) to 
(20, 10, 10);
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, 
minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to 
(maxvalue, maxvalue, maxvalue);
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+ERROR:  no partition of relation "mcrparted" found for row
+DETAIL:  Partition key of the failing row contains (a, abs(b), c) = (null, 
null, null).
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 791817ba50..1c4491a6a2 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -417,6 +417,9 @@ create table mcrparted3 partition of mcrparted for values 
from (11, 1, 1) to (20
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, 
minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to 
(maxvalue, maxvalue, maxvalue);
 
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Craig Ringer
On 22 November 2017 at 12:15, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier <
> michael.paqu...@gmail.com> wrote in 

Re: [HACKERS] Issues with logical replication

2017-11-21 Thread Simon Riggs
On 4 October 2017 at 10:35, Petr Jelinek  wrote:
> On 02/10/17 18:59, Petr Jelinek wrote:
>>>
>>> Now fix the trigger function:
>>> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
>>> BEGIN
>>>   RETURN NEW;
>>> END $$ LANGUAGE plpgsql;
>>>
>>> And manually perform at master two updates inside one transaction:
>>>
>>> postgres=# begin;
>>> BEGIN
>>> postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
>>> UPDATE 1
>>> postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
>>> UPDATE 1
>>> postgres=# commit;
>>> 
>>>
>>> and in replica log we see:
>>> 2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply
>>> worker for subscription "sub" has started
>>> 2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible
>>> tuple
>>> 2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical
>>> replication worker for subscription 16403 (PID 2954) exited with exit
>>> code 1
>>>
>>> Error happens in trigger.c:
>>>
>>> #3  0x0069bddb in GetTupleForTrigger (estate=0x2e36b50,
>>> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
>>> lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
>>> trigger.c:3103
>>> #4  0x0069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
>>> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
>>> fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
>>> #5  0x006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
>>> epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
>>> at execReplication.c:461
>>> #6  0x00820894 in apply_handle_update (s=0x7ffc442163b0) at
>>> worker.c:736
>>
>> We have locked the same tuple in RelationFindReplTupleByIndex() just
>> before this gets called and didn't get the same error. I guess we do
>> something wrong with snapshots. Will need to investigate more.
>>
>
> Okay, so it's not snapshot. It's the fact that we don't set the
> es_output_cid in replication worker which GetTupleForTrigger is using
> when locking the tuple. Attached one-liner fixes it.

Applied, thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



With commit 4e5fe9ad19, range partition missing handling for the NULL partition key

2017-11-21 Thread Rushabh Lathia
Consider the below test:

CREATE TABLE range_tab(a int, b int) PARTITION BY RANGE(a);
CREATE TABLE range_tab_p1 PARTITION OF range_tab FOR VALUES FROM (minvalue)
TO (10);
CREATE TABLE range_tab_p2 PARTITION OF range_tab FOR VALUES FROM (10) TO
(20);
CREATE TABLE range_tab_p3 PARTITION OF range_tab FOR VALUES FROM (20) TO
(maxvalue);

INSERT INTO range_tab VALUES(NULL, 10);

Above insert should fail with an error "no partition of relation found for
row".

Looking further I found that, this behaviour is changed after below commit:

commit 4e5fe9ad19e14af360de7970caa8b150436c9dec
Author: Robert Haas 
Date:   Wed Nov 15 10:23:28 2017 -0500

Centralize executor-related partitioning code.

Some code is moved from partition.c, which has grown very quickly
lately;
splitting the executor parts out might help to keep it from getting
totally out of control.  Other code is moved from execMain.c.  All is
moved to a new file execPartition.c.  get_partition_for_tuple now has
a new interface that more clearly separates executor concerns from
generic concerns.

Amit Langote.  A slight comment tweak by me.

Before above commit insert with NULL partition key in the range partition
was throwing a proper error.

postgres@112171=#INSERT INTO range_tab VALUES(NULL, 10);
ERROR:  no partition of relation "range_tab" found for row
DETAIL:  Partition key of the failing row contains (a) = (null).

Looking at the code partition_bound_cmp(), before 4e5fe9ad19 commit there
was a condition for the null values:

 /*
 * No range includes NULL, so this will be accepted by
the
 * default partition if there is one, and otherwise
 * rejected.
 */
for (i = 0; i < key->partnatts; i++)
{
if (isnull[i] &&

partition_bound_has_default(partdesc->boundinfo))
{
range_partkey_has_null = true;
break;
}







*else if (isnull[i]){
*failed_at = parent;
*failed_slot = slot;result = -1;
goto error_exit;}*}

But after commit, condition for isnull is missing.  It doesn't look
intentional,
is it?

Attaching patch to fix as well as regression test.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 67d4c2a..b62e8f5 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2541,6 +2541,11 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 		range_partkey_has_null = true;
 		part_index = partdesc->boundinfo->default_index;
 	}
+	else if(isnull[i])
+	{
+		range_partkey_has_null = true;
+		part_index = -1;
+	}
 }
 
 if (!range_partkey_has_null)
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index 9d84ba4..a0e3746 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -642,6 +642,10 @@ create table mcrparted2 partition of mcrparted for values from (10, 6, minvalue)
 create table mcrparted3 partition of mcrparted for values from (11, 1, 1) to (20, 10, 10);
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to (maxvalue, maxvalue, maxvalue);
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+ERROR:  no partition of relation "mcrparted" found for row
+DETAIL:  Partition key of the failing row contains (a, abs(b), c) = (null, null, null).
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 791817b..1c4491a 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -417,6 +417,9 @@ create table mcrparted3 partition of mcrparted for values from (11, 1, 1) to (20
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to (maxvalue, maxvalue, maxvalue);
 
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);


Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2017-11-21 Thread Jing Wang
Hi Nathan,

Thanks for review comments.

Enclosed please find the patch which has been updated according to your
suggestion.

The CURRENT_DATABASE can be used as following SQL statements and people can
find information from sgml files:
1. COMMENT ON DATABASE CURRENT_DATABASE is ...
2. ALTER DATABASE CURRENT_DATABASE OWNER to ...
3. ALTER DATABASE CURRENT_DATABASE SET parameter ...
4. ALTER DATABASE CURRENT_DATABASE RESET parameter ...
5. ALTER DATABASE CURRENT_DATABASE RESET ALL
6. SELECT CURRENT_DATABASE
7. SECURITY LABEL ON DATABASE CURRENT_DATABASE

As your mentioned the database_name are also present in the
GRANT/REVOKE/ALTER ROLE, so a patch will be present later for supporting
CURRENT_DATABASE on these SQL statements.

Regards,
Jing Wang
Fujitsu Australia


comment_on_current_database_no_pgdump_v4.4.patch
Description: Binary data


comment_on_current_database_for_pgdump_v4.4.patch
Description: Binary data


Re: [HACKERS] Issues with logical replication

2017-11-21 Thread Craig Ringer
On 4 October 2017 at 07:35, Petr Jelinek 
wrote:

> On 02/10/17 18:59, Petr Jelinek wrote:
> >>
> >> Now fix the trigger function:
> >> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER
> AS $$
> >> BEGIN
> >>   RETURN NEW;
> >> END $$ LANGUAGE plpgsql;
> >>
> >> And manually perform at master two updates inside one transaction:
> >>
> >> postgres=# begin;
> >> BEGIN
> >> postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
> >> UPDATE 1
> >> postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
> >> UPDATE 1
> >> postgres=# commit;
> >> 
> >>
> >> and in replica log we see:
> >> 2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply
> >> worker for subscription "sub" has started
> >> 2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible
> >> tuple
> >> 2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical
> >> replication worker for subscription 16403 (PID 2954) exited with exit
> >> code 1
> >>
> >> Error happens in trigger.c:
> >>
> >> #3  0x0069bddb in GetTupleForTrigger (estate=0x2e36b50,
> >> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
> >> lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
> >> trigger.c:3103
> >> #4  0x0069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
> >> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
> >> fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
> >> #5  0x006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
> >> epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
> >> at execReplication.c:461
> >> #6  0x00820894 in apply_handle_update (s=0x7ffc442163b0) at
> >> worker.c:736
> >
> > We have locked the same tuple in RelationFindReplTupleByIndex() just
> > before this gets called and didn't get the same error. I guess we do
> > something wrong with snapshots. Will need to investigate more.
> >
>
> Okay, so it's not snapshot. It's the fact that we don't set the
> es_output_cid in replication worker which GetTupleForTrigger is using
> when locking the tuple. Attached one-liner fixes it.
>

This seems like a clear-cut bug with a simple fix.

Lets get this committed, so we don't lose it. The rest of the thread is
going off into the weeds a bit issues unrelated to the original problem.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier  
wrote in 

Re: [HACKERS] More stats about skipped vacuums

2017-11-21 Thread Michael Paquier
On Wed, Nov 22, 2017 at 1:08 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier 
>  wrote in 
> 
>> On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
>>  wrote:
>> > By the way I'm uneasy that the 'last_vacuum_index_scans' (and
>> > vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
>> > both VACUUM command and autovacuum, while last_vacuum and
>> > vacuum_count is mentioning only the command. Splitting it into
>> > vacuum/autovaccum seems nonsense but the name is confusing. Do
>> > you have any idea?
>>
>> Hm. I think that you should actually have two fields, one for manual
>> vacuum and one for autovacuum, because each is tied to respectively
>> maintenance_work_mem and autovacuum_work_mem. This way admins are able
>
> It's very convincing for me. Thanks for the suggestion.
>
>> to tune each one of those parameters depending on a look at
>> pg_stat_all_tables. So those should be named perhaps
>> last_vacuum_index_scans and last_autovacuum_index_scans?
>
> Agreed. I'll do so in the next version.

Thanks for considering the suggestion.

> # I forgot to add the version to the patch files...

Don't worry about that. That's not a problem for me I'll just keep
track of the last entry. With the room I have I'll keep focused on
0001 by the way. Others are of course welcome to look at 0002 and
onwards.
-- 
Michael



Re: [HACKERS] More stats about skipped vacuums

2017-11-21 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier  
wrote in 
> On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
>  wrote:
> > By the way I'm uneasy that the 'last_vacuum_index_scans' (and
> > vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
> > both VACUUM command and autovacuum, while last_vacuum and
> > vacuum_count is mentioning only the command. Splitting it into
> > vacuum/autovaccum seems nonsense but the name is confusing. Do
> > you have any idea?
> 
> Hm. I think that you should actually have two fields, one for manual
> vacuum and one for autovacuum, because each is tied to respectively
> maintenance_work_mem and autovacuum_work_mem. This way admins are able

It's very convincing for me. Thanks for the suggestion.

> to tune each one of those parameters depending on a look at
> pg_stat_all_tables. So those should be named perhaps
> last_vacuum_index_scans and last_autovacuum_index_scans?

Agreed. I'll do so in the next version.

# I forgot to add the version to the patch files...

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Logical Replication and triggers

2017-11-21 Thread Craig Ringer
On 22 November 2017 at 05:35, Robert Haas  wrote:

> On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs 
> wrote:
> > I would have acted on this myself a few days back if I thought the
> > patch was correct, but I see multiple command counter increments
> > there, so suspect an alternate fix is correct.
>
> Oh, well, I'm glad you said something.  I was actually thinking about
> committing the patch Peter posted in
> http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
> because it looks correct to me, but I'm certainly not an expert on
> that code so I'll wait for your analysis.
>

I'll read over the patch and respond on that thread.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Logical Replication and triggers

2017-11-21 Thread Craig Ringer
On 22 November 2017 at 02:27, Robert Haas  wrote:

> On Sat, Nov 18, 2017 at 7:30 PM, Craig Ringer 
> wrote:
> > On 15 November 2017 at 21:12, Thomas Rosenstein
> >  wrote:
> >> I would like somebody to consider Petr Jelineks patch for worker.c from
> >> here
> >> (https://www.postgresql.org/message-id/619c557d-93e6-1833-
> 1692-b010b176ff77%402ndquadrant.com)
> >>
> >> I'm was facing the same issue with 10.1 and BEFORE INSERT OR UPDATE
> >> triggers.
> >
> > Please:
> >
> > - Apply it to current HEAD
> > - Test its functionality
> > - Report back on the patch thread
> > - Update the commitfest app with your results and sign on as a reviewer
> > - If you're able, read over the patch and make any comments you can
> >
> > "Somebody" needs to be you, if you want this functionality.
>
> You realize we're talking about a bug fix, right?  And for a feature
> that was developed and committed by your colleagues?
>

I did not realise it was a bug fix, and agree that changes things.

There was discussion at a similar time around people wanting extra features
for triggers and incorrectly assumed this was regarding that post.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
At Tue, 21 Nov 2017 20:53:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171121.205304.90315453.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 21 Nov 2017 20:27:25 +0900, atorikoshi 
>  wrote in 
> 
> > Thanks for reviewing!
> > 
> > 
> > On 2017/11/21 18:12, Masahiko Sawada wrote:
> > > On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada
> > >  wrote:
> > >> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
> > >>  wrote:
> > >>> Hi,
> > >>>
> > >>> I put many queries into one transaction and made ReorderBuffer spill
> > >>> data to disk, and sent SIGKILL to postgres before the end of the
> > >>> transaction.
> > >>>
> > >>> After starting up postgres again, I observed the files spilled to
> > >>> data wasn't deleted.
> > >>>
> > >>> I think these files should be deleted because its transaction was no
> > >>> more valid, so no one can use these files.
> > >>>
> > >>>
> > >>> Below is a reproduction instructions.
> > >>>
> > >>> 
> > >>> 1. Create table and publication at publiser.
> > >>>
> > >>>   @pub =# CREATE TABLE t1(
> > >>>   id INT PRIMARY KEY,
> > >>>   name TEXT);
> > >>>
> > >>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
> > >>>
> > >>> 2. Create table and subscription at subscriber.
> > >>>
> > >>>   @sub =# CREATE TABLE t1(
> > >>>   id INT PRIMARY KEY,
> > >>>   name TEXT
> > >>>   );
> > >>>
> > >>>   @sub =# CREATE SUBSCRIPTION sub
> > >>>   CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
> > >>>   PUBLICATION pub;
> > >>>
> > >>> 3. Put many queries into one transaction.
> > >>>
> > >>>   @pub =# BEGIN;
> > >>> INSERT INTO t1
> > >>> SELECT
> > >>>   i,
> > >>>   'aa'
> > >>> FROM
> > >>> generate_series(1, 100) as i;
> > >>>
> > >>> 4. Then we can see spilled files.
> > >>>
> > >>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
> > >>> state
> > >>> xid-561-lsn-0-100.snap
> > >>> xid-561-lsn-0-200.snap
> > >>> xid-561-lsn-0-300.snap
> > >>> xid-561-lsn-0-400.snap
> > >>> xid-561-lsn-0-500.snap
> > >>> xid-561-lsn-0-600.snap
> > >>> xid-561-lsn-0-700.snap
> > >>> xid-561-lsn-0-800.snap
> > >>> xid-561-lsn-0-900.snap
> > >>>
> > >>> 5. Kill publisher's postgres process before COMMIT.
> > >>>
> > >>>   @pub $ kill -s SIGKILL [pid of postgres]
> > >>>
> > >>> 6. Start publisher's postgres process.
> > >>>
> > >>>   @pub $ pg_ctl start -D ${PGDATA}
> > >>>
> > >>> 7. After a while, we can see the files remaining.
> > >>>   (Immediately after starting publiser, we can not see these files.)
> > >>>
> > >>>   @pub $ pg_ctl start -D ${PGDATA}
> > >>>
> > >>>   When I configured with '--enable-cassert', below assertion error
> > >>>   was appeared.
> > >>>
> > >>> TRAP: FailedAssertion("!(txn->final_lsn != 0)", File:
> > >>> "reorderbuffer.c",
> > >>> Line: 2576)
> > >>> 
> > >>>
> > >>> Attached patch sets final_lsn to the last ReorderBufferChange if
> > >>> final_lsn == 0.
> > >>
> > >> Thank you for the report. I could reproduce this issue with the above
> > >> step. My analysis is, the cause of that a serialized reorder buffer
> > >> isn't cleaned up is that the aborted transaction without an abort WAL
> > >> record has no chance to set ReorderBufferTXN->final_lsn. So if there
> > >> is such serialized transactions ReorderBufferRestoreCleanup cleanups
> > >> no files, which is cause of the assertion failure (or a file being
> > >> orphaned). What do you think?
> > >>
> > >> On detail of your patch, I'm not sure it's safe if we set the lsn of
> > >> other than commit record or abort record to final_lsn. The comment in
> > >> reorderbuffer.h says,
> > >>
> > >> typedef trcut ReorderBufferTXN
> > >> {
> > >> (snip)
> > >>
> > >> /* 
> > >>  * LSN of the record that lead to this xact to be committed or
> > >>  * aborted. This can be a
> > >>  * * plain commit record
> > >>  * * plain commit record, of a parent transaction
> > >>  * * prepared transaction commit
> > >>  * * plain abort record
> > >>  * * prepared transaction abort
> > >>  * * error during decoding
> > >>  * 
> > >>  */
> > >> XLogRecPtr  final_lsn;
> > >>
> > >> But with your patch, we could set a lsn of a record that is other than
> > >> what listed above to final_lsn. One way I came up with is to make
> > 
> > I added some comments on final_lsn.
> > 
> > >> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
> > >> regards it as a aborted transaction that doesn't has a abort WAL
> > >> record. So we can cleanup all serialized files if final_lsn of a
> > >> transaction is invalid.
> > >
> > > After more 

Re: Logical Replication and triggers

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs  wrote:
> I would have acted on this myself a few days back if I thought the
> patch was correct, but I see multiple command counter increments
> there, so suspect an alternate fix is correct.

Oh, well, I'm glad you said something.  I was actually thinking about
committing the patch Peter posted in
http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
because it looks correct to me, but I'm certainly not an expert on
that code so I'll wait for your analysis.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: default range partition and constraint exclusion

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 4:36 AM, Amit Langote
 wrote:
>>> The attached will make the constraint to look like:
>>
>> Uh, if the constraint exclusion logic we're using is drawing false
>> conclusions, we need to fix it so it doesn't, not change the
>> constraint so that the wrong logic gives the right answer.
>
> I did actually consider it, but ended up concluding that constraint
> exclusion is doing all it can using the provided list partition constraint
> clauses.
>
> If all predicate_refuted_by() receives is the expression tree (AND/OR)
> with individual nodes being strict clauses involving partition keys (and
> nothing about the nullness of the keys), the downstream code is just
> playing by the rules as explained in the header comment of
> predicate_refuted_by_recurse() in concluding that query's restriction
> clause a = 2 refutes it.

Oh, wait a minute.  Actually, I think predicate_refuted_by() is doing
the right thing here.  Isn't the problem that mc2p2 shouldn't be
accepting a (2, null) tuple at all?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical Replication and triggers

2017-11-21 Thread Simon Riggs
On 21 November 2017 at 16:13, Thomas Rosenstein
 wrote:

> To weigh in here, I actually find it's a big hurdle
>
> I'm a postgres user and not a postgres dev, so I definitly have the feeling
> I'm not qualified to answer if this really does what it's intended todo.
> Further more not beeing in the processes it will take me probably 2 - 3
> hours (if I have time) to figure out everything I should do and how I should
> do it,
> somebody doing this regularly might take 5 minutes.
>
> Yes it fixes my replication issues, yes it seems to work on the first look,
> but what does it really do - no idea!

Fair enough, but that's not what Craig asked is it?

Seems like he gave a helpful, long explanation of how you can help
expedite the process of fixing the bug, to which he received no reply.

We're trying to ensure that bugs are fixed, but also take care not to
introduce further bugs or false fixes that don't move us forwards.

I would have acted on this myself a few days back if I thought the
patch was correct, but I see multiple command counter increments
there, so suspect an alternate fix is correct.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Peter Geoghegan
On Tue, Nov 21, 2017 at 12:55 PM, Robert Haas  wrote:
> I agree.
>
> I have been of the opinion all along that progress monitoring needs to
> report facts, not theories.  The number of tuples read thus far is a
> fact, and is fine to report for whatever value it may have to someone.

That makes a lot of sense to me. I sometimes think that we're too
hesitant to expose internal information due to concerns about it being
hard to interpret. I see wait events as bucking this trend, which I
welcome. We see similar trends in the Linux kernel, with tools like
perf and BCC/eBPF now being regularly used to debug production issues.

> The number of tuples that will be read in the future is a theory, and
> as you say, progress monitoring is most likely to be used in cases
> where theory and practice ended up being very different.

You hit the nail on the head here.

It's not that these things are not difficult to interpret - the
concern itself is justified. It just needs to be weighed against the
benefit of having some instrumentation to start with. People are much
more likely to complain about obscure debug information, which makes
them feel dumb, than they are to complain about the absence of any
instrumentation, but I still think that the latter is the bigger
problem.

Besides, you don't necessarily have to understand something to act on
it. The internals of Oracle are trade secrets, but they were the first
to have wait events, I think. At least having something that you can
Google can make all the difference.

-- 
Peter Geoghegan



Re: Logical Replication and triggers

2017-11-21 Thread Thomas Rosenstein
On Tue, Nov 21, 2017 at 3:29 PM, Simon Riggs  
wrote:

You realize we're talking about a bug fix, right?  And for a feature
that was developed and committed by your colleagues?


Craig is asking Thomas to confirm the proposed bug fix works. How is
this not normal?


That's not exactly how I read Craig's email, but it's of course
difficult to know what tone somebody intended from an email.  Suffice
it to say that I think "please pay attention to this proposed bug-fix
patch" is a pretty legitimate request.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


To weigh in here, I actually find it's a big hurdle

I'm a postgres user and not a postgres dev, so I definitly have the 
feeling I'm not qualified to answer if this really does what it's 
intended todo.
Further more not beeing in the processes it will take me probably 2 - 3 
hours (if I have time) to figure out everything I should do and how I 
should do it,

somebody doing this regularly might take 5 minutes.

Yes it fixes my replication issues, yes it seems to work on the first 
look, but what does it really do - no idea!





Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Peter Geoghegan
On Mon, Oct 2, 2017 at 6:04 AM, Robert Haas  wrote:
> Progress reporting on sorts seems like a tricky problem to me, as I
> said before.  In most cases, a sort is going to involve an initial
> stage where it reads all the input tuples and writes out quicksorted
> runs, and then a merge phase where it merges all the output tapes into
> a sorted result.  There are some complexities; for example, if the
> number of tapes is really large, then we might need multiple merge
> phases, only the last of which will produce tuples.

This would ordinarily be the point at which I'd say "but you're very
unlikely to require multiple passes for an external sort these days".
But I won't say that on this thread, because CLUSTER generally has
unusually wide tuples, and so is much more likely to be I/O bound, to
require multiple passes, etc. (I bet the v10 enhancements
disproportionately improved CLUSTER performance.)

-- 
Peter Geoghegan



Re: [HACKERS] Replication status in logical replication

2017-11-21 Thread Masahiko Sawada
On Tue, Nov 14, 2017 at 6:46 AM, Thomas Munro
 wrote:
> On Tue, Sep 26, 2017 at 3:45 PM, Masahiko Sawada  
> wrote:
>> On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran
>>  wrote:
>>> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson  wrote:
 I’m not entirely sure why this was flagged as "Waiting for Author” by the
 automatic run, the patch applies for me and builds so resetting back to
 “Needs
 review”.

>>>
>>> This patch applies and build cleanly and I did a testing with one publisher
>>> and one subscriber, and confirm that the replication state after restarting
>>> the server now is "streaming" and not "Catchup".
>>>
>>> And, I don't find any issues with code and patch to me is ready for
>>> committer, marked the same in cf entry.
>
> Hi Sawada-san,
>
> My patch-testing robot doesn't like this patch[1].  I just tried it on
> my laptop to double-check and get some more details, and saw the same
> failures:
>
> (1) "make check" under src/test/recovery fails like this:
>
> t/006_logical_decoding.pl  2/16 # Looks like your test
> exited with 29 just after 4.
> t/006_logical_decoding.pl  Dubious, test returned 29
> (wstat 7424, 0x1d00)
> Failed 12/16 subtests
>
> regress_log_006_logical_decoding says:
>
> ok 4 - got same expected output from pg_recvlogical decoding session
> pg_recvlogical timed out at
> /opt/local/lib/perl5/vendor_perl/5.24/IPC/Run.pm line 2918.
>  waiting for endpos 0/1609B60 with stdout '', stderr '' at
> /Users/munro/projects/postgres/src/test/recovery/../../../src/test/perl/PostgresNode.pm
> line 1700.
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D
> /Users/munro/projects/postgres/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -m immediate stop
> waiting for server to shut down done
> server stopped
> # No postmaster PID for node "master"
> # Looks like your test exited with 29 just after 4.
>
> (2) "make check" under src/test/subscription says:
>
> t/001_rep_changes.pl .. ok
> t/002_types.pl  #
> # Looks like your test exited with 60 before it could output anything.
> t/002_types.pl  Dubious, test returned 60 (wstat 15360, 0x3c00)
> Failed 3/3 subtests
> t/003_constraints.pl ..
>
> Each of those tooks several minutes, and I stopped it there.  It may
> be going to say some more things but is taking a very long time
> (presumably timing out, but the 001 took ages and then succeeded...
> hmm).  In fact I had to run this on my laptop to see that because on
> Travis CI the whole test job just gets killed after 10 minutes of
> non-output and the above output was never logged because of the way
> concurrent test jobs' output is buffered.
>
> I didn't try to figure out what is going wrong.
>

Thank you for the notification!

After investigation, I found out that my previous patch was wrong
direction. I should have changed XLogSendLogical() so that we can
check the read LSN and set WalSndCaughtUp = true even after read a
record without wait. Attached updated patch passed 'make check-world'.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


logical_repl_caught_up_v2.patch
Description: Binary data


Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Merlin Moncure
On Tue, Nov 21, 2017 at 2:16 PM, Merlin Moncure  wrote:
> On Tue, Nov 21, 2017 at 12:20 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
 I am very much looking at the new stored procedure functionality and
 imaging a loop like this:

 LOOP
   FOR r IN SELECT * FROM pg_get_notifications(30)
   LOOP
 PERFORM do_stuff(r);
   END LOOP;
   COMMIT;  -- advance xmin etc
 END LOOP;
>>
>>> Yeah, if you keep the timeout fairly short, it would probably work OK
>>> (with Peter's stuff).
>>
>> Traditionally, NOTIFY messages are delivered to the client only between
>> transactions, so that there is no question about whether the
>> message-delivery should roll back if the surrounding transaction aborts.
>> It's not very clear to me what the behavior of pg_get_notifications()
>> inside a transaction ought to be.  Is it OK if it's a volatile function
>> and the messages are just gone once the function has returned them,
>> even if you fail to do anything about them because your transaction
>> fails later?
>
> I think destroying upon consumption is OK.  There are a lot of
> mitigation strategies to deal with that issue and NOTIFY is for
> signalling, not queuing.
>
>> (I'd be against having a function that returns more than one at a time,
>> in any case, as that just complicates matters even more.)

Hm, a less controversial approach might be to only allow consumption
of notifications that were delivered at the start of transaction.
Procedures could then issue an intervening 'COMMIT' statement to pick
up new notifications.  There's be no reason for a timeout argument in
that case obviously, so the end user would have to poll in order to
pick up the notification, which I don't like.   This would be an
alternative approach to the way it do it today, which is to poll for a
set table flag in a non-serializable transaction, maybe with enough
differentiation in use to merit introduction.

merlin



Re: [HACKERS] [PATCH] Incremental sort

2017-11-21 Thread Thomas Munro
On Tue, Nov 21, 2017 at 1:00 PM, Alexander Korotkov
 wrote:
> On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro
>  wrote:
>> Is there some reason not to use ApplySortComparator for this?  I think
>> you're missing out on lower-overhead comparators, and in any case it's
>> probably good code reuse, no?
>
> However, for incremental sort case we don't need to know here whether A > B
> or B > A.  It's enough for us to know if A = B or A != B.  In some cases
> it's way cheaper.  For instance, for texts equality check is basically
> memcmp while comparison may use collation.

Ah, right, of course.

>> I gather that you have
>> determined empirically that it's better to be able to sort groups of
>> at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
>> leading attributes, but why is that the case?
>
> Right.  The issue that not only case of one tuple per group could cause
> overhead, but few tuples (like 2 or 3) is also case of overhead.  Also,
> overhead is related not only to sorting.  While investigate of regression
> case provided by Heikki [1], I've seen extra time spent mostly in extra
> copying of sample tuple and comparison with that.  In order to cope this
> overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
> tuples too frequently.

I see.  I wonder if there could ever be a function like
ExecMoveTuple(dst, src).  Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership.  I haven't really thought that through and
there may be fundamental problems with it...

If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could), (2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns), (3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns.  Ok, that last one is probably hard but the first two
might be easier...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 12:05 PM, Antonin Houska  wrote:
> Robert Haas  wrote:
>> On Wed, Aug 30, 2017 at 10:12 PM, Tatsuro Yamada
>>  wrote:
>> >   1. scanning heap
>> >   2. sort tuples
>>
>> These two phases overlap, though. I believe progress reporting for
>> sorts is really hard.  In the simple case where the data fits in
>> work_mem, none of the work of the sort gets done until all the data is
>> read.  Once you switch to an external sort, you're writing batch
>> files, so a lot of the work is now being done during data loading.
>> But as the number of batch files grows, the final merge at the end
>> becomes an increasingly noticeable part of the cost, and eventually
>> you end up needing multiple merge passes.  I think we need some smart
>> way to report on sorts so that we can tell how much of the work has
>> really been done, but I don't know how to do it.
>
> Whatever complexity is hidden in the sort, cost_sort() should have taken it
> into consideration when called via plan_cluster_use_sort(). Thus I think that
> once we have both startup and total cost, the current progress of the sort
> stage can be estimated from the current number of input and output
> rows. Please remind me if my proposal appears to be too simplistic.

I think it is far too simplistic.  If the sort is being fed by a
sequential scan, reporting the number of blocks scanned so far as
compared to the total number that will be scanned would be a fine way
of reporting on the progress of the sequential scan -- and it's better
to use blocks, which we know for sure about, than rows, at which we
can only guess.  But that's the *scan* progress, not the *sort*
progress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 12:25 PM, Tom Lane  wrote:
> Antonin Houska  writes:
>> Robert Haas  wrote:
>>> These two phases overlap, though. I believe progress reporting for
>>> sorts is really hard.
>
>> Whatever complexity is hidden in the sort, cost_sort() should have taken it
>> into consideration when called via plan_cluster_use_sort(). Thus I think that
>> once we have both startup and total cost, the current progress of the sort
>> stage can be estimated from the current number of input and output
>> rows. Please remind me if my proposal appears to be too simplistic.
>
> Well, even if you assume that the planner's cost model omits nothing
> (which I wouldn't bet on), its result is only going to be as good as the
> planner's estimate of the number of rows to be sorted.  And, in cases
> where people actually care about progress monitoring, it's likely that
> the planner got that wrong, maybe horribly so.  I think it's a bad idea
> for progress monitoring to depend on the planner's estimates in any way
> whatsoever.

I agree.

I have been of the opinion all along that progress monitoring needs to
report facts, not theories.  The number of tuples read thus far is a
fact, and is fine to report for whatever value it may have to someone.
The number of tuples that will be read in the future is a theory, and
as you say, progress monitoring is most likely to be used in cases
where theory and practice ended up being very different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Combine function returning NULL unhandled?

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 10:36 PM, Andres Freund  wrote:
> The plain transition case contains:
> if (pergroupstate->transValueIsNull)
> {
> /*
>  * Don't call a strict function with NULL inputs.  
> Note it is
>  * possible to get here despite the above tests, if 
> the transfn is
>  * strict *and* returned a NULL on a prior cycle. If 
> that happens
>  * we will propagate the NULL all the way to the end.
>  */
> return;
> }
>
> how come similar logic is not present for combine functions? I don't see
> any checks preventing a combinefunc from returning NULL, nor do I see
> https://www.postgresql.org/docs/devel/static/sql-createaggregate.html
> spell out a requirement that that not be the case.

I don't know of a reason why that logic shouldn't be present for the
combine-function case as well.  It seems like it should be pretty
straightforward to write a test that hits that case and watch it blow
up ... assuming it does, then I guess we should back-patch the
addition of that logic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Custom compression methods

2017-11-21 Thread Tomas Vondra

On 11/21/2017 09:28 PM, Ildus K wrote:
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> I will check that. Is your extension published somewhere?
> 

No, it was just an experiment, so I've only attached it to the initial
review. Attached is an updated version, with a fix or two.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pg_lz4.tgz
Description: application/compressed-tar


Re: [HACKERS] pgbench regression test failure

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 3:28 PM, Tom Lane  wrote:
> Seems like a good idea, but the way you've written it is inconsistent
> with the "n/m" notation used just above.  I'd suggest
>
> ... latency limit: 33 (33/33, 100.000 %)
>
> or just
>
> ... latency limit: 33/33 (100.000 %)

Oh, yeah.  That last one sounds good; no reason to print the same
value more than once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical Replication and triggers

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 3:29 PM, Simon Riggs  wrote:
>> You realize we're talking about a bug fix, right?  And for a feature
>> that was developed and committed by your colleagues?
>
> Craig is asking Thomas to confirm the proposed bug fix works. How is
> this not normal?

That's not exactly how I read Craig's email, but it's of course
difficult to know what tone somebody intended from an email.  Suffice
it to say that I think "please pay attention to this proposed bug-fix
patch" is a pretty legitimate request.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical Replication and triggers

2017-11-21 Thread Simon Riggs
On 21 November 2017 at 13:27, Robert Haas  wrote:
> On Sat, Nov 18, 2017 at 7:30 PM, Craig Ringer  wrote:
>> On 15 November 2017 at 21:12, Thomas Rosenstein
>>  wrote:
>>> I would like somebody to consider Petr Jelineks patch for worker.c from
>>> here
>>> (https://www.postgresql.org/message-id/619c557d-93e6-1833-1692-b010b176ff77%402ndquadrant.com)
>>>
>>> I'm was facing the same issue with 10.1 and BEFORE INSERT OR UPDATE
>>> triggers.
>>
>> Please:
>>
>> - Apply it to current HEAD
>> - Test its functionality
>> - Report back on the patch thread
>> - Update the commitfest app with your results and sign on as a reviewer
>> - If you're able, read over the patch and make any comments you can
>>
>> "Somebody" needs to be you, if you want this functionality.
>
> You realize we're talking about a bug fix, right?  And for a feature
> that was developed and committed by your colleagues?

Craig is asking Thomas to confirm the proposed bug fix works. How is
this not normal?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-11-21 Thread Ildus K
On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra  wrote:

 
> 
> I propose to use either
> 
>CompressionMethodOptions (and CompressionMethodRoutine)
> 
> or
> 
>CompressionOptions (and CompressionRoutine)

Sounds good, thanks.

> 
> OK. But then I don't understand why tsvector.c does things like
> 
> VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
> VARRAWSIZE_4B_C(data) - arrsize
> 
> instead of
> 
> VARSIZE_ANY_EXHDR(data) - arrsize
> VARSIZE_ANY(data) - arrsize
> 
> Seems somewhat confusing.
> 

VARRAWSIZE_4B_C returns original size of data, before compression (from
va_rawsize in current postgres, and from va_info in my patch), not size
of the already compressed data, so you can't use VARSIZE_ANY here.

VARSIZE_ANY_EXHDR in current postgres returns VARSIZE-VARHDRSZ, despite
the varlena is compressed or not, so I just kept this behavior for
custom compressed varlenas too. If you look into tuptoaster.c you will
also see lines like 'VARSIZE(attr) - TOAST_COMPRESS_HDRSZ'. So I think
if VARSIZE_ANY_EXHDR will subtract different header sizes then it
should subtract them for usual compressed varlenas too.

> >   
> 
> Hmmm, it still doesn't work for me. See this:
> 
> test=# create extension pg_lz4 ;
> CREATE EXTENSION
> test=# create table t_lz4 (v text compressed lz4);
> CREATE TABLE
> test=# create table t_pglz (v text);
> CREATE TABLE
> test=# insert into t_lz4 select repeat(md5(1::text),300);
> INSERT 0 1
> test=# insert into t_pglz select * from t_lz4;
> INSERT 0 1
> test=# drop extension pg_lz4 cascade;
> NOTICE:  drop cascades to 2 other objects
> DETAIL:  drop cascades to compression options for lz4
> drop cascades to table t_lz4 column v
> DROP EXTENSION
> test=# \c test
> You are now connected to database "test" as user "user".
> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
> test=# select * from t_pglz ;
> ERROR:  cache lookup failed for compression options 16419
> 
> That suggests no recompression happened.

I will check that. Is your extension published somewhere?




Re: [HACKERS] pgbench regression test failure

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 20, 2017 at 1:40 PM, Tom Lane  wrote:
>> I dunno, it just looks odd to me that when we've set up a test case in
>> which every one of the transactions is guaranteed to exceed the latency
>> limit, that it doesn't say that they all did.  I don't particularly buy
>> your assumption that the percentages should sum.  Anybody else have an
>> opinion there?

> I agree with you, but I don't think either approach is free from
> possible confusion.  I think it would help to show the numerator and
> the denominator explicitly, e.g.:

> number of clients: 1
> number of threads: 1
> number of transactions per client: 100
> number of transactions actually processed: 33/100
> number of transactions skipped: 67 (67.000 %)
> number of transactions above the 1.0 ms latency limit: 33 (33 of 33, 100.000 
> %)

> (My proposed change is in the last line.)

Seems like a good idea, but the way you've written it is inconsistent
with the "n/m" notation used just above.  I'd suggest

... latency limit: 33 (33/33, 100.000 %)

or just

... latency limit: 33/33 (100.000 %)

regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada  wrote:
> Attached updated version patch. I've moved only relation extension
> locks out of heavy-weight lock as per discussion so far.
>
> I've done a write-heavy benchmark on my laptop; loading 24kB data to
> one table using COPY by 1 client, for 10 seconds. The through-put of
> patched is 10% better than current HEAD. The result of 5 times is the
> following.
>
> - PATCHED -
> tps = 178.791515 (excluding connections establishing)
> tps = 176.522693 (excluding connections establishing)
> tps = 168.705442 (excluding connections establishing)
> tps = 158.158009 (excluding connections establishing)
> tps = 161.145709 (excluding connections establishing)
>
> - HEAD -
> tps = 147.079803 (excluding connections establishing)
> tps = 149.079540 (excluding connections establishing)
> tps = 149.082275 (excluding connections establishing)
> tps = 148.255376 (excluding connections establishing)
> tps = 145.542552 (excluding connections establishing)
>
> Also I've done a micro-benchmark; calling LockRelationForExtension and
> UnlockRelationForExtension tightly in order to measure the number of
> lock/unlock cycles per second. The result is,
> PATCHED = 3.95892e+06 (cycles/sec)
> HEAD = 1.15284e+06 (cycles/sec)
> The patched is 3 times faster than current HEAD.
>
> Attached updated patch and the function I used for micro-benchmark.
> Please review it.

That's a nice speed-up.

How about a preliminary patch that asserts that we never take another
heavyweight lock while holding a relation extension lock?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Merlin Moncure
On Tue, Nov 21, 2017 at 12:20 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
>>> I am very much looking at the new stored procedure functionality and
>>> imaging a loop like this:
>>>
>>> LOOP
>>>   FOR r IN SELECT * FROM pg_get_notifications(30)
>>>   LOOP
>>> PERFORM do_stuff(r);
>>>   END LOOP;
>>>   COMMIT;  -- advance xmin etc
>>> END LOOP;
>
>> Yeah, if you keep the timeout fairly short, it would probably work OK
>> (with Peter's stuff).
>
> Traditionally, NOTIFY messages are delivered to the client only between
> transactions, so that there is no question about whether the
> message-delivery should roll back if the surrounding transaction aborts.
> It's not very clear to me what the behavior of pg_get_notifications()
> inside a transaction ought to be.  Is it OK if it's a volatile function
> and the messages are just gone once the function has returned them,
> even if you fail to do anything about them because your transaction
> fails later?

I think destroying upon consumption is OK.  There are a lot of
mitigation strategies to deal with that issue and NOTIFY is for
signalling, not queuing.

> (I'd be against having a function that returns more than one at a time,
> in any case, as that just complicates matters even more.)

ok.

merlin



Re: [HACKERS] Parallel Append implementation

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 6:57 AM, amul sul  wrote:
> By doing following change on the v19 patch does the fix for me:
>
> --- a/src/backend/executor/nodeAppend.c
> +++ b/src/backend/executor/nodeAppend.c
> @@ -489,11 +489,9 @@ choose_next_subplan_for_worker(AppendState *node)
> }
>
> /* Pick the plan we found, and advance pa_next_plan one more time. */
> -   node->as_whichplan = pstate->pa_next_plan;
> +   node->as_whichplan = pstate->pa_next_plan++;
> if (pstate->pa_next_plan == node->as_nplans)
> pstate->pa_next_plan = append->first_partial_plan;
> -   else
> -   pstate->pa_next_plan++;
>
> /* If non-partial, immediately mark as finished. */
> if (node->as_whichplan < append->first_partial_plan)
>
> Attaching patch does same changes to Amit's ParallelAppend_v19_rebased.patch.

Yes, that looks like a correct fix.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)

2017-11-21 Thread Robert Haas
On Sun, Nov 19, 2017 at 7:49 PM, Badrul Chowdhury  wrote:
>>> I spent a little more time looking at this patch today.  I think that the 
>>> patch
>>> should actually send NegotiateProtocolVersion when *either* the requested
>>> version is differs from the latest one we support *or* an unsupported 
>>> protocol
>>> option is present.  Otherwise, you only find out about unsupported protocol
>>> options if you also request a newer minor version, which isn't good, 
>>> because it
>>> makes it hard to add new protocol options *without* bumping the protocol
>>> version.
>
> It makes sense from a maintainability point of view.
>
>>> Here's an updated version with that change and a proposed commit message.
>
> I have tested the new patch and it works great. The comments look good as 
> well.

Committed and back-patched to all supported branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] SQL procedures

2017-11-21 Thread Andrew Dunstan


On 11/20/2017 04:25 PM, I wrote:
> I've been through this fairly closely, and I think it's pretty much
> committable. The only big item that stands out for me is the issue of
> OUT parameters.
>
> While returning multiple result sets will be a useful feature, it's also
> very common in my experience for stored procedures to have scalar out
> params as well. I'm not sure how we should go about providing for it,
> but I think we need to be sure we're not closing any doors.
>
> Here, for example, is how the MySQL stored procedure feature works with
> JDBC:
> 
> I think it will be OK if we use cursors to return multiple result sets,
> along the lines of Peter's next patch, but we shouldn't regard that as
> the end of the story. Even if we don't provide for it in this go round
> we should aim at eventually providing for stored procedure OUT params.
>
>
>



Of course it's true that we could replace a scalar OUT parameter with a
one row resultset if we have return of multiple resultsets from SPs. But
it's different from the common use pattern and a darn sight more
cumbersome to use.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Does XMLSERIALIZE output xmlattributes in a stable order?

2017-11-21 Thread Chapman Flack
On 11/21/2017 11:49 AM, Tom Lane wrote:

> AFAICS, XMLSERIALIZE in our current implementation boils down to
> being a binary-compatible coercion from XML (which is stored as
> a string) to text.  So the interesting question here is where are
> you getting the XML values from?  The stability of the results is
> going to be whatever the next level down does.

Well, constructed using xmlelement and xmlattributes in a big query.
The structure of the query does not change from one run to the next.

So as long as the internal XML form is essentially already serialized,
I guess it comes down to what xmlattributes(...) inside xmlelement
produces. If that is stable, say in the order of the attribute
arguments, then that probably fits the bill.

I don't see that clearly addressed in the doc for xmlattributes
either. Should something be added to the docs, it's probably worth
mentioning at XMLSERIALIZE anyway, keeping the fact that the internal
form is already serialized as more of an implementation detail.

-Chap


select
xmlserialize(document xmlroot(
  xmlelement(name top,
xmlattributes(
  foo,
  bar,
  baz
),
xmlelement(name things,
  (
  select
xmlagg(
  xmlelement(name thing,
xmlattributes(
  name,
  importance,
  awesomeness,
  fluidity
),
case when comment is not null then
  xmlelement(name comment, comment)
end,
(
select
  xmlagg(
xmlelement(name property,
  xmlattributes(setting)
)
  )
from
  unnest(properties) as q(setting)
),
(
select
  xmlagg(
xmlelement(name referenced,
  xmlattributes(
linksrc as source
  )
)
  )
from
  (
  select distinct
...



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
>> I am very much looking at the new stored procedure functionality and
>> imaging a loop like this:
>> 
>> LOOP
>>   FOR r IN SELECT * FROM pg_get_notifications(30)
>>   LOOP
>> PERFORM do_stuff(r);
>>   END LOOP;
>>   COMMIT;  -- advance xmin etc
>> END LOOP;

> Yeah, if you keep the timeout fairly short, it would probably work OK
> (with Peter's stuff).

Traditionally, NOTIFY messages are delivered to the client only between
transactions, so that there is no question about whether the
message-delivery should roll back if the surrounding transaction aborts.
It's not very clear to me what the behavior of pg_get_notifications()
inside a transaction ought to be.  Is it OK if it's a volatile function
and the messages are just gone once the function has returned them,
even if you fail to do anything about them because your transaction
fails later?

(I'd be against having a function that returns more than one at a time,
in any case, as that just complicates matters even more.)

regards, tom lane



Re: View with duplicate GROUP BY entries

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 21, 2017 at 12:05 PM, Tom Lane  wrote:
>> Yeah, we probably ought to make more of an effort to regenerate the
>> original query wording.  I do not think that forcing positional notation
>> is a suitable answer in this case, because it would result in converting
>> SQL-standard queries to nonstandard ones.

> Who cares?  The other end is presumptively PostgresSQL, because this
> is postgres_fdw.

No, you missed the context.  Yes, the original problem is in postgres_fdw,
and there indeed it seems fine to emit GROUP BY 1,2.  What Ashutosh is
pointing out is that ruleutils.c can emit a representation of a view
that fails to preserve its original semantics, thus causing dump/reload
problems that have nothing at all to do with FDWs.  And what I'm pointing
out is that we don't like pg_dump to emit nonstandard representations
of objects that were created with perfectly standard-compliant queries;
therefore emitting GROUP BY 1,2 isn't good if the query wasn't spelled
like that to begin with.

regards, tom lane



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
>> I think that wouldn't work very well, because I think we must have a
>> snapshot open in order to run pg_get_notifications(), and that means
>> we're holding back the system-wide xmin.
>
> I am very much looking at the new stored procedure functionality and
> imaging a loop like this:
>
> LOOP
>   FOR r IN SELECT * FROM pg_get_notifications(30)
>
>   LOOP
> PERFORM do_stuff(r);
>   END LOOP;
>
>   COMMIT;  -- advance xmin etc
> END LOOP;
>
> ...I'm obviously speculatively thinking ahead to Peter's stored
> procedure work seeing the light of day (which, based on the utility vs
> the simplicity of the patch and how it works in testing I'm very
> optimistic about).  The above would provide real time response to
> certain actions I do now with polling, typically in bash.  Without
> stored procedures, I agree that this would be a foot gun.

Yeah, if you keep the timeout fairly short, it would probably work OK
(with Peter's stuff).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: View with duplicate GROUP BY entries

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 12:05 PM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> While reviewing patch for similar problem in postgres_fdw [1], I
>> noticed that we don't use positional notation while creating the view.
>> This might introduced anomalies when GROUP BY entries are
>> non-immutable.
>
> Yeah, we probably ought to make more of an effort to regenerate the
> original query wording.  I do not think that forcing positional notation
> is a suitable answer in this case, because it would result in converting
> SQL-standard queries to nonstandard ones.

Who cares?  The other end is presumptively PostgresSQL, because this
is postgres_fdw.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: to_typemod(type_name) information function

2017-11-21 Thread Stephen Frost
Greeting, Sophie!

* Sophie Herold (sophi...@hemio.de) wrote:
> I did not find any advice on how to choose a new OID for pg_proc.

(Haven't looked at the patch itself yet really, but wanted to answer
this.)

The main thing is to not duplicate the OID, which you can avoid by
calling 'unused_oids' in src/include/catalog.  That will then return a
list of OIDs that haven't been used yet.  Generally speaking, for a case
where you only need one OID, grabbing one from any of the blocks listed
is fine, though it doesn't hurt to check and see what the nearby used
OIDs were for and if there might be some reason to keep a particular OID
free for future use (just for grouping convenience with other related
things).

Generally though, it's not something you have to worry about too much,
just try to avoid duplicating them.  Even then, if you do, most likely
the committer who picks the patch up will realize it and adjust
accordingly.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: View with duplicate GROUP BY entries

2017-11-21 Thread Tom Lane
Ashutosh Bapat  writes:
> While reviewing patch for similar problem in postgres_fdw [1], I
> noticed that we don't use positional notation while creating the view.
> This might introduced anomalies when GROUP BY entries are
> non-immutable.

Yeah, we probably ought to make more of an effort to regenerate the
original query wording.  I do not think that forcing positional notation
is a suitable answer in this case, because it would result in converting
SQL-standard queries to nonstandard ones.  We might have to extend the
parsetree representation so that we can tell whether positional notation
was used to begin with.

regards, tom lane



Re: Does XMLSERIALIZE output xmlattributes in a stable order?

2017-11-21 Thread Tom Lane
Chapman Flack  writes:
> Suppose I have a query that generates some XML content, and I want
> to do this on a periodic schedule and check the resulting XML into
> a version control system.
> ...
> But then, there are the attributes of elements. Order of attributes
> is not significant in XML, and is not required (by the "XML Infoset"
> standard) to be preserved. Nevertheless, it would be a useful
> property (for a purpose like I've described) if XMLSERIALIZE were
> known to at least produce the attributes in some consistent order
> across evaluations of the same query.

> Is that true of the implementation in PostgreSQL? I might find out
> with a quick test, but it seemed worth explicitly asking.

AFAICS, XMLSERIALIZE in our current implementation boils down to
being a binary-compatible coercion from XML (which is stored as
a string) to text.  So the interesting question here is where are
you getting the XML values from?  The stability of the results is
going to be whatever the next level down does.

regards, tom lane



Does XMLSERIALIZE output xmlattributes in a stable order?

2017-11-21 Thread Chapman Flack
Suppose I have a query that generates some XML content, and I want
to do this on a periodic schedule and check the resulting XML into
a version control system.

To avoid spurious diffs, I know I can control the order of child
elements generated by xmlagg by slipping an ORDER BY into the
aggregate expression.

But then, there are the attributes of elements. Order of attributes
is not significant in XML, and is not required (by the "XML Infoset"
standard) to be preserved. Nevertheless, it would be a useful
property (for a purpose like I've described) if XMLSERIALIZE were
known to at least produce the attributes in some consistent order
across evaluations of the same query.

Is that true of the implementation in PostgreSQL? I might find out
with a quick test, but it seemed worth explicitly asking.

This is subtle enough that, if it's true, it is probably worth
mentioning in the docs. (If it isn't true, it might even be worth
making it true, then mentioning it in the docs.) While [XML Infoset]
does say that an element's attributes are an "unordered set",
[XQuery and XPath Data Model (XDM)] says "the order of Attribute Nodes
is stable but implementation dependent", and it's the latter document
that's referenced by [XSLT and XQuery Serialization], which is the
standard upon which XMLSERIALIZE is defined in [SQL/XML]. (Phew!)

-Chap



Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> I don't quite follow the need for this.  I mean, if we just stick a
>> Param reference in there and create a corresponding InitPlan, the
>> Param will be evaluated on demand, right?  Is the point of the new
>> node to make sure that the Param gets re-evaluated when needed?

> Yeah.  By default, an initplan is only going to be evaluated once per
> query, not once per row.  It's possible that you could set up Param
> dependencies to force the correct semantics, but that would be more
> complicated, and probably less performant, than introducing a new
> expression node type.  Also, I'm dubious that it could be made to work
> at all for standalone expression evaluation; Param-driven re-evaluation
> is only considered when running a plan tree.

BTW, thinking about that a bit more, standalone expressions (eg CHECK
constraints) are possibly going to need some work anyway to make this go
through.  I do not think there's any mechanism in place at the moment
to set up a ParamListInfo array for such an expression.  While it's
conceivable that the expression eval machinery could be set up to not
need one, I'm not real sure, if we try to make these things be
PARAM_EXEC Params.

This suggests that maybe the Params used for LambdaExprs should be a
distinct kind of Param, not the same PARAM_EXEC type that's used for
Params supplied externally to eval of a particular expression.  That
would reflect the fact that expression eval handles them completely
internally.

I'm also wondering about folding CaseTestExpr and CoerceToDomainValue
into the same mechanism.  It's not very hard to see those cases as
being the same as a function-based lambda.

regards, tom lane



Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Thu, Nov 16, 2017 at 2:51 PM, Andres Freund  wrote:
>> Right, but it doesn't sound that hard to introduce. Basically there'd
>> need to be a WithParamValue node,  that first evaluates parameters and
>> then executes the child expression.

> I don't quite follow the need for this.  I mean, if we just stick a
> Param reference in there and create a corresponding InitPlan, the
> Param will be evaluated on demand, right?  Is the point of the new
> node to make sure that the Param gets re-evaluated when needed?

Yeah.  By default, an initplan is only going to be evaluated once per
query, not once per row.  It's possible that you could set up Param
dependencies to force the correct semantics, but that would be more
complicated, and probably less performant, than introducing a new
expression node type.  Also, I'm dubious that it could be made to work
at all for standalone expression evaluation; Param-driven re-evaluation
is only considered when running a plan tree.

regards, tom lane



Re: IndexTupleDSize macro seems redundant

2017-11-21 Thread Amit Kapila
On Mon, Nov 20, 2017 at 9:01 PM, Ildar Musin  wrote:
> Hi all,
>
> While I was looking through the indexes code I got confused by couple of
> macros - IndexTupleSize() and IndexTupleDSize() - which seem to do the same
> thing with only difference that the first one takes pointer as an argument
> while the second one takes struct. And in most cases IndexTupleDSize() is
> used with dereferencing of index tuple where IndexTupleSize() would suit
> perfectly. Is there a particular reason to have them both? I've made a patch
> that removes IndexTupleDSize macro. All the tests have passed.
>

+1.  I was also once confused with these macros.  I think this is a
good cleanup.  On a quick look, I don't see any problem with your
changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-21 Thread Robert Haas
On Fri, Nov 17, 2017 at 7:24 AM, Ashutosh Bapat
 wrote:
> + * this value should be multiplied with cpu_tuple_cost wherever applicable.
> + */
> +#define DEFAULT_APPEND_COST_FACTOR 0.5
>
> I am wondering whether we should just define
> #define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
> and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
> have other than multiplying with cpu_tuple_cost?

-1.  If you wrap it in a macro like that, future readers of the code
will have to go look up what the macro does.  If you just multiply by
DEFAULT_APPEND_COST_FACTOR it will be clear that's what being used is
a multiple of cpu_tuple_cost.

> I am not sure whether we should be discussing why this technique performs
> better or when it performs better. We don't have similar discussion for
> partition-wise join. That paragraph just describes the technique and may be we
> want to do the same here.

+1.

> + * might be optimal because of presence of suitable paths with pathkeys or
> + * because the hash tables for most of the partitions fit into the memory.
> + * However, if partition keys are not part of the group by clauses, then we
> + * still able to compute the partial aggregation for each partition and then
> + * finalize them over append. This can either win or lose. It may win if the
> + * PartialAggregate stage greatly reduces the number of groups and lose if we
> + * have lots of small groups.
>
> I have not seen prologue of a function implementing a query optimization
> technique explain why that technique improves performance. So I am not sure
> whether the comment should include this explanation. One of the reasons being
> that the reasons why a technique works might change over the period of time
> with the introduction of other techniques, thus obsoleting the comment.  But
> may be it's good to have it here.

+1 for keeping it.

> +extra.inputRows = 0;/* Not needed at child paths creation */
>
> Why? Comment should be on its own line.

Comments on same line are fine if they are short enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



[PATCH] using arc4random for strong randomness matters.

2017-11-21 Thread David CARLIER
Hello,

This is my first small personal contribution.

Motivation :
- Using fail-safe, file descriptor free solution on *BSD and Darwin system
- Somehow avoiding at the moment FreeBSD as it still uses RC4 (seemingly
updated to Chacha20 for FreeBSD 12.0 and eventually backported later on).
- For implementation based on Chacha* it is known to be enough fast for the
purpose.
- make check passes.

Hope it is good.

Thanks in advance.


0001-Using-arc4random-API-for-strong-randomness-if-availa.patch
Description: Binary data


Re: [HACKERS] Parallel Append implementation

2017-11-21 Thread amul sul
On Tue, Nov 21, 2017 at 2:22 PM, Amit Khandekar  wrote:
> On 21 November 2017 at 12:44, Rafia Sabih  
> wrote:
>> On Mon, Nov 13, 2017 at 12:54 PM, Amit Khandekar  
>> wrote:
>>> Thanks a lot Robert for the patch. I will have a look. Quickly tried
>>> to test some aggregate queries with a partitioned pgbench_accounts
>>> table, and it is crashing. Will get back with the fix, and any other
>>> review comments.
>>>
>>> Thanks
>>> -Amit Khandekar
>>
>> I was trying to get the performance of this patch at commit id -
>> 11e264517dff7a911d9e6494de86049cab42cde3 and TPC-H scale factor 20
>> with the following parameter settings,
>> work_mem = 1 GB
>> shared_buffers = 10GB
>> effective_cache_size = 10GB
>> max_parallel_workers_per_gather = 4
>> enable_partitionwise_join = on
>>
>> and the details of the partitioning scheme is as follows,
>> tables partitioned = lineitem on l_orderkey and orders on o_orderkey
>> number of partitions in each table = 10
>>
>> As per the explain outputs PA was used in following queries- 1, 3, 4,
>> 5, 6, 7, 8, 10, 12, 14, 15, 18, and 21.
>> Unfortunately, at the time of executing any of these query, it is
>> crashing with the following information in  core dump of each of the
>> workers,
>>
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x10600984 in pg_atomic_read_u32_impl (ptr=0x3ec29294)
>> at ../../../../src/include/port/atomics/generic.h:48
>> 48 return ptr->value;
>>
>> In case this a different issue as you pointed upthread, you may want
>> to have a look at this as well.
>> Please let me know if you need any more information in this regard.
>
> Right, for me the crash had occurred with a similar stack, although
> the real crash happened in one of the workers. Attached is the script
> file
> pgbench_partitioned.sql to create a schema with which I had reproduced
> the crash.
>
> The query that crashed :
> select sum(aid), avg(aid) from pgbench_accounts;
>
> Set max_parallel_workers_per_gather to at least 5.
>
> Also attached is v19 patch rebased.
>

I've spent little time to debug this crash. The crash happens in ExecAppend()
due to subnode in node->appendplans array is referred using incorrect
array index (out of bound value) in the following code:

/*
 * figure out which subplan we are currently processing
 */
subnode = node->appendplans[node->as_whichplan];

This incorrect value to node->as_whichplan is get assigned in the
choose_next_subplan_for_worker().

By doing following change on the v19 patch does the fix for me:

--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -489,11 +489,9 @@ choose_next_subplan_for_worker(AppendState *node)
}

/* Pick the plan we found, and advance pa_next_plan one more time. */
-   node->as_whichplan = pstate->pa_next_plan;
+   node->as_whichplan = pstate->pa_next_plan++;
if (pstate->pa_next_plan == node->as_nplans)
pstate->pa_next_plan = append->first_partial_plan;
-   else
-   pstate->pa_next_plan++;

/* If non-partial, immediately mark as finished. */
if (node->as_whichplan < append->first_partial_plan)

Attaching patch does same changes to Amit's ParallelAppend_v19_rebased.patch.

Regards,
Amul


fix_crash.patch
Description: Binary data


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 21 Nov 2017 20:27:25 +0900, atorikoshi 
 wrote in 

> Thanks for reviewing!
> 
> 
> On 2017/11/21 18:12, Masahiko Sawada wrote:
> > On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada
> >  wrote:
> >> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
> >>  wrote:
> >>> Hi,
> >>>
> >>> I put many queries into one transaction and made ReorderBuffer spill
> >>> data to disk, and sent SIGKILL to postgres before the end of the
> >>> transaction.
> >>>
> >>> After starting up postgres again, I observed the files spilled to
> >>> data wasn't deleted.
> >>>
> >>> I think these files should be deleted because its transaction was no
> >>> more valid, so no one can use these files.
> >>>
> >>>
> >>> Below is a reproduction instructions.
> >>>
> >>> 
> >>> 1. Create table and publication at publiser.
> >>>
> >>>   @pub =# CREATE TABLE t1(
> >>>   id INT PRIMARY KEY,
> >>>   name TEXT);
> >>>
> >>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
> >>>
> >>> 2. Create table and subscription at subscriber.
> >>>
> >>>   @sub =# CREATE TABLE t1(
> >>>   id INT PRIMARY KEY,
> >>>   name TEXT
> >>>   );
> >>>
> >>>   @sub =# CREATE SUBSCRIPTION sub
> >>>   CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
> >>>   PUBLICATION pub;
> >>>
> >>> 3. Put many queries into one transaction.
> >>>
> >>>   @pub =# BEGIN;
> >>> INSERT INTO t1
> >>> SELECT
> >>>   i,
> >>>   'aa'
> >>> FROM
> >>> generate_series(1, 100) as i;
> >>>
> >>> 4. Then we can see spilled files.
> >>>
> >>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
> >>> state
> >>> xid-561-lsn-0-100.snap
> >>> xid-561-lsn-0-200.snap
> >>> xid-561-lsn-0-300.snap
> >>> xid-561-lsn-0-400.snap
> >>> xid-561-lsn-0-500.snap
> >>> xid-561-lsn-0-600.snap
> >>> xid-561-lsn-0-700.snap
> >>> xid-561-lsn-0-800.snap
> >>> xid-561-lsn-0-900.snap
> >>>
> >>> 5. Kill publisher's postgres process before COMMIT.
> >>>
> >>>   @pub $ kill -s SIGKILL [pid of postgres]
> >>>
> >>> 6. Start publisher's postgres process.
> >>>
> >>>   @pub $ pg_ctl start -D ${PGDATA}
> >>>
> >>> 7. After a while, we can see the files remaining.
> >>>   (Immediately after starting publiser, we can not see these files.)
> >>>
> >>>   @pub $ pg_ctl start -D ${PGDATA}
> >>>
> >>>   When I configured with '--enable-cassert', below assertion error
> >>>   was appeared.
> >>>
> >>> TRAP: FailedAssertion("!(txn->final_lsn != 0)", File:
> >>> "reorderbuffer.c",
> >>> Line: 2576)
> >>> 
> >>>
> >>> Attached patch sets final_lsn to the last ReorderBufferChange if
> >>> final_lsn == 0.
> >>
> >> Thank you for the report. I could reproduce this issue with the above
> >> step. My analysis is, the cause of that a serialized reorder buffer
> >> isn't cleaned up is that the aborted transaction without an abort WAL
> >> record has no chance to set ReorderBufferTXN->final_lsn. So if there
> >> is such serialized transactions ReorderBufferRestoreCleanup cleanups
> >> no files, which is cause of the assertion failure (or a file being
> >> orphaned). What do you think?
> >>
> >> On detail of your patch, I'm not sure it's safe if we set the lsn of
> >> other than commit record or abort record to final_lsn. The comment in
> >> reorderbuffer.h says,
> >>
> >> typedef trcut ReorderBufferTXN
> >> {
> >> (snip)
> >>
> >> /* 
> >>  * LSN of the record that lead to this xact to be committed or
> >>  * aborted. This can be a
> >>  * * plain commit record
> >>  * * plain commit record, of a parent transaction
> >>  * * prepared transaction commit
> >>  * * plain abort record
> >>  * * prepared transaction abort
> >>  * * error during decoding
> >>  * 
> >>  */
> >> XLogRecPtr  final_lsn;
> >>
> >> But with your patch, we could set a lsn of a record that is other than
> >> what listed above to final_lsn. One way I came up with is to make
> 
> I added some comments on final_lsn.
> 
> >> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
> >> regards it as a aborted transaction that doesn't has a abort WAL
> >> record. So we can cleanup all serialized files if final_lsn of a
> >> transaction is invalid.
> >
> > After more thought, since there are some codes cosmetically setting
> > final_lsn when the fate of transaction is determined possibly we
> > should not accept a invalid value of final_lsn even in the case.
> >

It doesn't seem just a cosmetic. The first/last_lsn are used to
determin the files to be deleted. On the other hand the TXN
cannot have last_lsn since it hasn't see abort/commit record.

> My new patch keeps setting final_lsn, but changed its 

Re: [HACKERS] path toward faster partition pruning

2017-11-21 Thread Rajkumar Raghuwanshi
On Fri, Nov 17, 2017 at 3:31 PM, Amit Langote  wrote:

>
> Support for hash partitioning and tests for the same.  Also, since
> update/delete on partitioned tables still depend on constraint exclusion
> for pruning, fix things such that get_relation_constraints includes
> partition constraints in its result only for non-select queries (for
> selects we have the new pruning code).  Other bug fixes.
>
> Hi Amit,

I have applied attached patch set on commit
11e264517dff7a911d9e6494de86049cab42cde3 and try to test for hash
partition. I got a server crash with below test.

CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
remainder 0) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus
4, remainder 0) PARTITION BY HASH (c);
CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH
(modulus 4, remainder 0);
CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
remainder 1) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
4, remainder 1);
CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
4, remainder 2);
insert into  hp_tbl select i,i,i from generate_series(0,10)i where i not
in(2,4,6,7,10);

explain select * from hp_tbl where a = 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [HACKERS] path toward faster partition pruning

2017-11-21 Thread Kyotaro HORIGUCHI
Thank you and sorry for the confused comments.

At Mon, 13 Nov 2017 18:46:28 +0900, Amit Langote 
 wrote in 
<8460a6c3-68c5-b78a-7d18-d253180f2...@lab.ntt.co.jp>
> Horiguchi-san,
> 
> Thanks for taking a look.  Replying to all your emails here.

> > In 0003, 
> > 
> > +match_clauses_to_partkey(RelOptInfo *rel,
> > ...
> > +  if (rinfo->pseudoconstant &&
> > +(IsA(clause, Const) &&
> > + Const *) clause)->constisnull) ||
> > +  !DatumGetBool(((Const *) clause)->constvalue
> > +  {
> > +*constfalse = true;
> > +continue;
> > 
> > If we call this function in both conjunction and disjunction
> > context (the latter is only in recursive case). constfalse ==
> > true means no need of any clauses for the former case.
> > 
> > Since (I think) just a list of RestrictInfo is expected to be
> > treated as a conjunction (it's quite doubious, though..),
> 
> I think it makes sense to consider a list of RestrictInfo's, such as
> baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
> to be mutually conjunctive for our purpose here.

You're right and I know it. I'm ok to leave it since I recalled
that clause_selectivity always has a similar code.

> On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote:
> >
> > This is working fine. Sorry for the bogus comment.
> 
> I'd almost started looking around if something might be wrong after all. :)

Very sorry for the wrong comment:(

> On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote:
> >
> >>> Those two conditions are not orthogonal. Maybe something like
> >>> following seems more understantable.
> >>>
>  if (!constfalse)
>  {
>    /* No constraints on the keys, so, return *all* partitions. */
>    if (nkeys == 0)
>  return bms_add_range(result, 0, partdesc->nparts - 1);
> 
>    result = get_partitions_for_keys(relation, );
>  }
> >
> > So, the condition (!constfalse && nkeys == 0) cannot return
> > there. I'm badly confused by the variable name.
> 
> Do you mean by 'constfalse'?

Perhaps. The name "constfalse" is suggesting (for me) that the
cluses evaluate to false constantly. But acutally it means just
the not-in-the-return clauses are results in false. Anyway I'll
take a look on v12 and will comment at the time.

> > I couldn't find another reasonable structure using the current
> > classify_p_b_keys(), but could you add a comment like the
> > following as an example?
> 
> OK, will add comments explaining what's going on.
> 
> 
> Will post the updated patches after also taking care of David's and Amul's
> review comments upthread.
> 
> Thanks,
> Amit

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: default range partition and constraint exclusion

2017-11-21 Thread Amit Langote
On 2017/11/18 8:28, Robert Haas wrote:
> On Fri, Nov 17, 2017 at 12:57 AM, Amit Langote
>  wrote:
>> While working on the patch for partition pruning for declarative
>> partitioned tables, I noticed that default range partition will fail to be
>> included in a plan in certain cases due to pruning by constraint exclusion.
>> you'll notice that it doesn't explicitly say that the default partition
>> allows rows where a is null or b is null or both are null.  Given that,
>> constraint exclusion will end up concluding that the default partition's
>> constraint is refuted by a = 2.
>>
>> The attached will make the constraint to look like:
> 
> Uh, if the constraint exclusion logic we're using is drawing false
> conclusions, we need to fix it so it doesn't, not change the
> constraint so that the wrong logic gives the right answer.

I did actually consider it, but ended up concluding that constraint
exclusion is doing all it can using the provided list partition constraint
clauses.

If all predicate_refuted_by() receives is the expression tree (AND/OR)
with individual nodes being strict clauses involving partition keys (and
nothing about the nullness of the keys), the downstream code is just
playing by the rules as explained in the header comment of
predicate_refuted_by_recurse() in concluding that query's restriction
clause a = 2 refutes it.

You may notice in the example I gave that all rows with non-null partition
keys have a non-default range partitions to choose from.  So, the only
rows that exist in the default partition are those that contain null in
either or both partition keys.

Constraint that's currently generated for the default partition only
describes rows that contain non-null partition keys.  The default
partition in the example contains no such rows, so constraint exclusion is
right in excluding that it is excluded by a clause like a = 2.

By the way, I made a mistake when listing the constraint that the proposed
patch will produce; it's actually:

NOT (
  a IS NOT NULL
   AND
  b IS NOT NULL
   AND
  (
((a < 1) OR ((a = 1) AND (b < 1)))
 OR
((a > 1) OR ((a = 1) AND (b >= 1)))
  )
)

Am I missing something?

Thanks,
Amit




Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-11-21 Thread Peter Moser
2017-11-14 18:42 GMT+01:00 Tom Lane :
> You might consider putting the rewriting into, um, the rewriter.
> It could be a separate pass after view expansion, if direct integration
> with the existing behavior seems unduly spaghetti-ish.  Or do it in
> an early phase of planning as he suggested.  There's not really that
> much difference between the rewriter and the planner for this purpose.
> Although one way to draw the distinction is that the output of the
> rewriter is (currently) still fully expressible as plain SQL, whereas
> once the planner goes into action the intermediate states of the tree
> might not really be SQL anymore (eg, it might contain join types that
> don't correspond to any SQL syntax).  So depending on what your rewrite
> emits, there would be a weak preference for calling it part of the
> rewriter or planner respectively.

2017-11-16 16:42 GMT+01:00 Robert Haas :
> Another thing to think about is that even though the CURRENT
> implementation just rewrites the relevant constructs as SQL, in the
> future somebody might want to do something else.  I feel like it's not
> hard to imagine a purpose-build ALIGN or NORMALIZE join type being a
> lot faster than the version that's just done by rewriting the SQL.
> That would be more work, potentially, but it would be nice if the
> initial implementation leant itself to be extended that way in the
> future, which an all-rewriter implementation would not.  On the other
> hand, maybe an early-in-the-optimizer implementation wouldn't either,
> and maybe it's not worth worrying about it anyway.  But it would be
> cool if this worked out in a way that meant it could be further
> improved without having to change it completely.

Hi hackers,
we like to rethink our approach...

For simplicity I'll drop ALIGN for the moment and focus solely on NORMALIZE:

SELECT * FROM (R NORMALIZE S ON R.x = S.y WITH (R.time, S.time)) c;

Our normalization executor node needs the following input (for now
expressed in plain SQL):

SELECT R.*, p1
FROM (SELECT *, row_id() OVER () rn FROM R) R
 LEFT OUTER JOIN (
SELECT y, LOWER(time) p1 FROM S
UNION
SELECT y, UPPER(time) p1 FROM S
 ) S
 ON R.x = S.y AND p1 <@ R.time
ORDER BY rn, p1;

In other words:
1) The left subquery adds an unique ID to each tuple (i.e., rn).
2) The right subquery creates two results for each input tuple: one for
   the upper and one for the lower bound of each input tuple's valid time
   column. The boundaries get put into a single (scalar) column, namely p1.
3) We join both subqueries if the normalization predicates hold (R.x = S.y)
   and p1 is inside the time of the current outer tuple.
4) Finally, we sort the result by the unique ID (rn) and p1, and give all
   columns of the outer relation, rn and p1 back.

Our first attempt to understand the new approach would be as follows: The
left base rel of the inner left-outer-join can be expressed as a WindowAgg
node. However, the right query of the join is much more difficult to build
(maybe through hash aggregates). Both queries could be put together with a
MergeJoin for instance. However, if we create the plan tree by hand and
choose algorithms for it manually, how is it possible to have it optimized
later? Or, if that is not possible, how do we choose the best algorithms
for it?

Best regards,
Anton, Johann, Michael, Peter