Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-07 Thread Andy Fan
Hi Ashutosh:
   Thanks for your time.

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat 
wrote:

> Hi Andy,
> What might help is to add more description to your email message like
> giving examples to explain your idea.
>
> Anyway, I looked at the testcases you added for examples.
> +create table select_distinct_a(a int, b char(20),  c char(20) not null,
>  d int, e int, primary key(a, b));
> +set enable_mergejoin to off;
> +set enable_hashjoin to off;
> +-- no node for distinct.
> +explain (costs off) select distinct * from select_distinct_a;
> +  QUERY PLAN
> +---
> + Seq Scan on select_distinct_a
> +(1 row)
>
> From this example, it seems that the distinct operation can be dropped
> because (a, b) is a primary key. Is my understanding correct?
>

Yes, you are correct.   Actually I added then to commit message,
but it's true that I should have copied them in this email body as
 well.  so copy it now.

[PATCH] Erase the distinctClause if the result is unique by
 definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.
>
> However the patch as presented has some problems
> 1. What happens if the primary key constraint or NOT NULL constraint gets
> dropped between a prepare and execute? The plan will no more be valid and
> thus execution may produce non-distinct results.
>

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the
patch
to continue discussing of this issue.


> PostgreSQL has similar concept of allowing non-grouping expression as part
> of targetlist when those expressions can be proved to be functionally
> dependent on the GROUP BY clause. See check_functional_grouping() and its
> caller. I think, DISTINCT elimination should work on similar lines.
>
2. For the same reason described in check_functional_grouping(), using
> unique indexes for eliminating DISTINCT should be discouraged.
>

I checked the comments of check_functional_grouping,  the reason is

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique
constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints
added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able
to get the not null information from relation->rd_attr->attrs[n].attnotnull
which
is just what this patch did.

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
> as well
>

This is a good point.   The rules may have some different for join,  so I
prefer
to to focus on the current one so far.


> 4. The patch works only at the query level, but that functionality can be
> expanded generally to other places which add Unique/HashAggregate/Group
> nodes if the underlying relation can be proved to produce distinct rows.
> But that's probably more work since we will have to label paths with unique
> keys similar to pathkeys.
>

Do you mean adding some information into PlannerInfo,  and when we create
a node for Unique/HashAggregate/Group,  we can just create a dummy node?


> 5. Have you tested this OUTER joins, which can render inner side nullable?
>

Yes, that part was missed in the test case.  I just added them.

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan  wrote:
>
>> update the patch with considering the semi/anti join.
>>
>> Can anyone help to review this patch?
>>
>> Thanks
>>
>>
>> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan 
>> wrote:
>>
>>> Hi:
>>>
>>> I wrote a patch to erase the distinctClause if the result is unique by
>>> definition,  I find this because a user switch this code from oracle
>>> to PG and find the performance is bad due to this,  so I adapt pg for
>>> this as well.
>>>
>>> This patch doesn't work for a well-written SQL,  but some drawback
>>> of a SQL may be not very obvious,  since the cost of checking is pretty
>>> low as well,  so I think it would be ok to add..
>>>
>>> Please see the patch for details.
>>>
>>> Thank you.
>>>
>>
>
> --
> --
> Best Wishes,
> Ashutosh Bapat
>


0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data


Re: Internal key management system

2020-02-07 Thread Masahiko Sawada
On Sat, 8 Feb 2020 at 03:24, Andres Freund  wrote:
>
> Hi,
>
> On 2020-02-07 20:44:31 +0900, Masahiko Sawada wrote:
> > Yeah I'm not going to use pgcrypto for transparent data encryption.
> > The KMS patch includes the new basic infrastructure for cryptographic
> > functions (mainly AES-CBC). I'm thinking we can expand that
> > infrastructure so that we can also use it for TDE purpose by
> > supporting new cryptographic functions such as AES-CTR. Anyway, I
> > agree to not have it depend on pgcrypto.
>
> I thought for a minute, before checking the patch, that you were saying
> above that the KMS patch includes its *own* implementation of
> cryptographic functions.  I think it's pretty crucial that it continues
> not to do that...

I meant that we're going to use OpenSSL for AES encryption and
decryption independent of pgcrypto's openssl code, as the first step.
That is, KMS is available only when configured --with-openssl. And
hopefully we eventually merge these openssl code and have pgcrypto use
it, like when we introduced SCRAM.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




POC: rational number type (fractions)

2020-02-07 Thread Joe Nelson
Hi hackers, attached is a proof of concept patch adding a new base type
called "rational" to represent fractions. It includes arithmetic,
simplification, conversion to/from float, finding intermediates with a
stern-brocot tree, custom aggregates, and btree/hash indices.

The primary motivation was as a column type to support user-defined
ordering of rows (with the ability to dynamically rearrange rows). The
postgres wiki has a page [0] about this using pairs of integers to
represent fractions, but it's not particularly elegant.

I wrote about options for implementing user-defined order in an article
[1] and ended up creating a postgres extension, pg_rational [2], to
provide the new type. People have been using the extension, but told me
they wished they could use it on hosted platforms like Amazon RDS which
have a limited set of whitelisted extensions. Thus I'm submitting this
patch to discuss getting the feature in core postgres.

For usage, see the included regression test. To see how it works for the
user-defined order use case see my article. I haven't included docs in
the patch since the interface may change with community feedback.

0: https://wiki.postgresql.org/wiki/User-specified_ordering_with_fractions
1: https://begriffs.com/posts/2018-03-20-user-defined-order.html
2: https://github.com/begriffs/pg_rational

-- 
Joe Nelson  https://begriffs.com
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..10bdcff5dc 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -80,6 +80,7 @@ OBJS = \
 	rangetypes_selfuncs.o \
 	rangetypes_spgist.o \
 	rangetypes_typanalyze.o \
+	rational.o \
 	regexp.o \
 	regproc.o \
 	ri_triggers.o \
diff --git a/src/backend/utils/adt/rational.c b/src/backend/utils/adt/rational.c
new file mode 100644
index 00..ab91198da3
--- /dev/null
+++ b/src/backend/utils/adt/rational.c
@@ -0,0 +1,637 @@
+/*-
+ *
+ * rational.c
+ *	  Rational number data type for the Postgres database system
+ *
+ * Copyright (c) 1998-2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/rational.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "access/hash.h"
+#include "common/int.h"			/* portable overflow detection */
+#include "libpq/pqformat.h"		/* send/recv functions */
+#include 
+#include 
+
+PG_MODULE_MAGIC;
+
+typedef struct
+{
+	int32		numer;
+	int32		denom;
+}			Rational;
+
+static int32 gcd(int32, int32);
+static bool simplify(Rational *);
+static int32 cmp(Rational *, Rational *);
+static void neg(Rational *);
+static Rational * create(long long, long long);
+static Rational * add(Rational *, Rational *);
+static Rational * mul(Rational *, Rational *);
+static void mediant(Rational *, Rational *, Rational *);
+
+/*
+ * IO **
+ */
+
+PG_FUNCTION_INFO_V1(rational_in);
+PG_FUNCTION_INFO_V1(rational_in_float);
+PG_FUNCTION_INFO_V1(rational_out);
+PG_FUNCTION_INFO_V1(rational_out_float);
+PG_FUNCTION_INFO_V1(rational_recv);
+PG_FUNCTION_INFO_V1(rational_create);
+PG_FUNCTION_INFO_V1(rational_in_integer);
+PG_FUNCTION_INFO_V1(rational_send);
+
+Datum
+rational_in(PG_FUNCTION_ARGS)
+{
+	char	   *s = PG_GETARG_CSTRING(0),
+			   *after;
+	long long	n,
+d;
+
+	if (!isdigit(*s) && *s != '-')
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("Missing or invalid numerator")));
+
+	n = strtoll(s, , 10);
+
+	if (*after == '\0')
+	{
+		/* if just a number and no slash, interpret as an int */
+		d = 1;
+	}
+	else
+	{
+		/* otherwise look for denominator */
+		if (*after != '/')
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+	 errmsg("Expecting '/' after number but found '%c'", *after)));
+		if (*(++after) == '\0')
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+	 errmsg("Expecting value after '/' but got '\\0'")));
+
+		d = strtoll(after, , 10);
+		if (*after != '\0')
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+	 errmsg("Expecting '\\0' but found '%c'", *after)));
+	}
+	PG_RETURN_POINTER(create(n, d));
+}
+
+/*
+  This function taken from John Kennedy's paper, "Algorithm To Convert a
+  Decimal to a Fraction." Translated from Pascal.
+*/
+Datum
+rational_in_float(PG_FUNCTION_ARGS)
+{
+	float8		target = PG_GETARG_FLOAT8(0),
+z,
+fnumer,
+fdenom,
+error;
+	int32		prev_denom,
+sign;
+	Rational   *result = palloc(sizeof(Rational));
+
+	if (target == (int32) target)
+	{
+		result->numer = (int32) target;
+		result->denom = 1;
+		PG_RETURN_POINTER(result);
+	}
+
+	sign = target < 0.0 ? -1 : 1;
+	target = fabs(target);
+
+	if (!(target <= INT32_MAX))
+	{			/* also excludes NaN's */
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+		

Re: Does recovery write to backup_label ?

2020-02-07 Thread Fujii Masao
On Sat, Feb 8, 2020 at 5:05 AM Chapman Flack  wrote:
>
> On 2/7/20 2:55 PM, Andres Freund wrote:
>
> >> If the file needs to have 0600 permissions, should there be
> >> a note in the nonexclusive-mode backup docs to say so?
> >
> > I'm not convinced that that's useful. The default is that everything
> > needs to be writable by postgres. The exceptions should be noted if
> > anything, not the default.

+1

> Could this arguably be a special case, as most things in the datadir
> are put there by postgres, but the backup_label is now to be put there
> (and not even 'there' there, but added as a final step only to a
> 'backup-copy-of-there' there) by the poor schmuck who reads the
> non-exclusive backup docs as saying "retrieve this content from
> pg_stop_backup() and preserve in a file named backup_label" and can't
> think of any obvious reason to put write permission on a file
> that preserves immutable history in a backup?

I have no strong objection to add more note about permissions
of the files that the users put in the data directory. If we really
do that, it'd be better to note about not only backup_label
but also other files like tablespace_map, recovery.signal,
promotion trigger file, etc.

Regards,

-- 
Fujii Masao




Postgres 32 bits client compilation fail. Win32 bits client is supported?

2020-02-07 Thread Ranier Vilela
Hi,
I am migrating my applications that use postgres client from msvc 2010
(32bits) to msvc 2019 (32 bits).
Compilation using msvc 2019 (64 bits), works very well.
But the build using msvc 2019 (32 bit) is not working.
The 32-bit Platform variable is set to x86, resulting in the first error.

"Project" C: \ dll \ postgres \ pgsql.sln "on node 1 (default targets).
C: \ dll \ postgres \ pgsql.sln.metaproj: error MSB4126: The specified
solution configuration "Release | x86" is invalid. Plea
if specify a valid solution configuration using the Configuration and
Platform properties (e.g. MSBuild.exe Solution.sl
n / p: Configuration = Debug / p: Platform = "Any CPU") or leave those
properties blank to use the default solution configurati
on. [C: \ dll \ postgres \ pgsql.sln]
Done Building Project "C: \ dll \ postgres \ pgsql.sln" (default targets) -
FAILED. "

This is because the Sub DeterminePlatform function of the Solution.pm
program uses the following expression:
"my $ output =` cl /? 2> & 1`; "
The result of msvc 2019 (32bits) is:
"Microsoft (R) C / C ++ Optimizing Compiler Version 19.24.28315 for x86"

By setting the Platform variable manually to WIn32, the compilation process
continues until it stops at:
"Generating configuration headers ..."

Where the second error occurs:
unused defines: HAVE_STRUCT_CMSGCRED
USE_NAMED_POSI ... etc ...
ALIGNOF_DOUBLE USE_DEV_URANDOM at C: \ dll \ postgres \ src \ tools \ msvc
/ Mkvcbuild.pm line 842.

Question:
Will Postgres continue to support 32-bit client?

regards,
Ranier Vilela


Re: Implementing Incremental View Maintenance

2020-02-07 Thread nuko yokohama
Hi.

UNION query problem.(server crash)

When creating an INCREMENTAL MATERIALIZED VIEW,
the server process crashes if you specify a query with a UNION.

(commit id = 23151be7be8d8f8f9c35c2d0e4e5353aedf2b31e)

execute log.

```
[ec2-user@ip-10-0-1-10 ivm]$ psql testdb -e -f union_query_crash.sql
DROP TABLE IF EXISTS table_x CASCADE;
psql:union_query_crash.sql:6: NOTICE:  drop cascades to view xy_union_v
DROP TABLE
DROP TABLE IF EXISTS table_y CASCADE;
DROP TABLE
CREATE TABLE table_x (id int, data numeric);
CREATE TABLE
CREATE TABLE table_y (id int, data numeric);
CREATE TABLE
INSERT INTO table_x VALUES (generate_series(1, 3), random()::numeric);
INSERT 0 3
INSERT INTO table_y VALUES (generate_series(1, 3), random()::numeric);
INSERT 0 3
SELECT * FROM table_x;
 id |data
+
  1 |  0.950724735058774
  2 | 0.0222670808201144
  3 |  0.391258547114841
(3 rows)

SELECT * FROM table_y;
 id |data
+
  1 |  0.991717347778337
  2 | 0.0528458947672874
  3 |  0.965044982911163
(3 rows)

CREATE VIEW xy_union_v AS
SELECT 'table_x' AS name, * FROM table_x
UNION
SELECT 'table_y' AS name, * FROM table_y
;
CREATE VIEW
TABLE xy_union_v;
  name   | id |data
-++
 table_y |  2 | 0.0528458947672874
 table_x |  2 | 0.0222670808201144
 table_y |  3 |  0.965044982911163
 table_x |  1 |  0.950724735058774
 table_x |  3 |  0.391258547114841
 table_y |  1 |  0.991717347778337
(6 rows)

CREATE INCREMENTAL MATERIALIZED VIEW xy_imv AS
SELECT 'table_x' AS name, * FROM table_x
UNION
SELECT 'table_y' AS name, * FROM table_y
;
psql:union_query_crash.sql:28: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:union_query_crash.sql:28: fatal: connection to server was lost
```
UNION query problem.(server crash)

When creating an INCREMENTAL MATERIALIZED VIEW,
the server process crashes if you specify a query with a UNION.

(commit id = 23151be7be8d8f8f9c35c2d0e4e5353aedf2b31e)

execute log.

```
[ec2-user@ip-10-0-1-10 ivm]$ psql testdb -e -f union_query_crash.sql
DROP TABLE IF EXISTS table_x CASCADE;
psql:union_query_crash.sql:6: NOTICE:  drop cascades to view xy_union_v
DROP TABLE
DROP TABLE IF EXISTS table_y CASCADE;
DROP TABLE
CREATE TABLE table_x (id int, data numeric);
CREATE TABLE
CREATE TABLE table_y (id int, data numeric);
CREATE TABLE
INSERT INTO table_x VALUES (generate_series(1, 3), random()::numeric);
INSERT 0 3
INSERT INTO table_y VALUES (generate_series(1, 3), random()::numeric);
INSERT 0 3
SELECT * FROM table_x;
 id |data
+
  1 |  0.950724735058774
  2 | 0.0222670808201144
  3 |  0.391258547114841
(3 rows)

SELECT * FROM table_y;
 id |data
+
  1 |  0.991717347778337
  2 | 0.0528458947672874
  3 |  0.965044982911163
(3 rows)

CREATE VIEW xy_union_v AS
SELECT 'table_x' AS name, * FROM table_x
UNION
SELECT 'table_y' AS name, * FROM table_y
;
CREATE VIEW
TABLE xy_union_v;
  name   | id |data
-++
 table_y |  2 | 0.0528458947672874
 table_x |  2 | 0.0222670808201144
 table_y |  3 |  0.965044982911163
 table_x |  1 |  0.950724735058774
 table_x |  3 |  0.391258547114841
 table_y |  1 |  0.991717347778337
(6 rows)

CREATE INCREMENTAL MATERIALIZED VIEW xy_imv AS
SELECT 'table_x' AS name, * FROM table_x
UNION
SELECT 'table_y' AS name, * FROM table_y
;
psql:union_query_crash.sql:28: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:union_query_crash.sql:28: fatal: connection to server was lost
```

2018年12月27日(木) 21:57 Yugo Nagata :

> Hi,
>
> I would like to implement Incremental View Maintenance (IVM) on
> PostgreSQL.
> IVM is a technique to maintain materialized views which computes and
> applies
> only the incremental changes to the materialized views rather than
> recomputate the contents as the current REFRESH command does.
>
> I had a presentation on our PoC implementation of IVM at PGConf.eu 2018
> [1].
> Our implementation uses row OIDs to compute deltas for materialized
> views.
> The basic idea is that if we have information about which rows in base
> tables
> are contributing to generate a certain row in a matview then we can
> identify
> the affected rows when a base table is updated. This is based on an idea of
> Dr. Masunaga [2] who is a member of our group and inspired from ID-based
> approach[3].
>
> In our implementation, the mapping of the row OIDs of the materialized view
> and the base tables are stored in "OID map". When a base relation is
> modified,
> AFTER trigger is executed and the delta is recorded in delta tables using
> the transition table feature. The accual udpate of the matview is triggerd
> by REFRESH command with INCREMENTALLY option.
>
> However, we realize problems of our implementation. First, 

Re: Marking some contrib modules as trusted extensions

2020-02-07 Thread Tom Lane
After looking more closely at these modules, I'm kind of inclined
*not* to put the trusted marker on intagg.  That module is just
a backwards-compatibility wrapper around functionality that
exists in the core code nowadays.  So I think what we ought to be
doing with it is deprecating and eventually removing it, not
encouraging people to keep using it.

Given that and the other discussion in this thread, I think the
initial list of modules to trust is:

btree_gin
btree_gist
citext
cube
dict_int
earthdistance
fuzzystrmatch
hstore
hstore_plperl
intarray
isn
jsonb_plperl
lo
ltree
pg_trgm
pgcrypto
seg
tablefunc
tcn
tsm_system_rows
tsm_system_time
unaccent
uuid-ossp

So attached is a patch to do that.  The code changes are trivial; just
add "trusted = true" to each control file.  We don't need to bump the
module version numbers, since this doesn't change the contents of any
extension, just who can install it.  I do not think any regression
test changes are needed either.  (Note that commit 50fc694e4 already
added a test that trusted extensions behave as expected, see
src/pl/plperl/sql/plperl_setup.sql.)  So it seems like the only thing
that needs much discussion is the documentation changes.  I adjusted
contrib.sgml's discussion of how to install these modules in general,
and then labeled the individual modules if they are trusted.

regards, tom lane

diff --git a/contrib/btree_gin/btree_gin.control b/contrib/btree_gin/btree_gin.control
index d576da7..67d0c99 100644
--- a/contrib/btree_gin/btree_gin.control
+++ b/contrib/btree_gin/btree_gin.control
@@ -3,3 +3,4 @@ comment = 'support for indexing common datatypes in GIN'
 default_version = '1.3'
 module_pathname = '$libdir/btree_gin'
 relocatable = true
+trusted = true
diff --git a/contrib/btree_gist/btree_gist.control b/contrib/btree_gist/btree_gist.control
index 81c8509..cd2d7eb 100644
--- a/contrib/btree_gist/btree_gist.control
+++ b/contrib/btree_gist/btree_gist.control
@@ -3,3 +3,4 @@ comment = 'support for indexing common datatypes in GiST'
 default_version = '1.5'
 module_pathname = '$libdir/btree_gist'
 relocatable = true
+trusted = true
diff --git a/contrib/citext/citext.control b/contrib/citext/citext.control
index a872a3f..ccf4454 100644
--- a/contrib/citext/citext.control
+++ b/contrib/citext/citext.control
@@ -3,3 +3,4 @@ comment = 'data type for case-insensitive character strings'
 default_version = '1.6'
 module_pathname = '$libdir/citext'
 relocatable = true
+trusted = true
diff --git a/contrib/cube/cube.control b/contrib/cube/cube.control
index f39a838..3e238fc 100644
--- a/contrib/cube/cube.control
+++ b/contrib/cube/cube.control
@@ -3,3 +3,4 @@ comment = 'data type for multidimensional cubes'
 default_version = '1.4'
 module_pathname = '$libdir/cube'
 relocatable = true
+trusted = true
diff --git a/contrib/dict_int/dict_int.control b/contrib/dict_int/dict_int.control
index 6e2d2b3..ec04cce 100644
--- a/contrib/dict_int/dict_int.control
+++ b/contrib/dict_int/dict_int.control
@@ -3,3 +3,4 @@ comment = 'text search dictionary template for integers'
 default_version = '1.0'
 module_pathname = '$libdir/dict_int'
 relocatable = true
+trusted = true
diff --git a/contrib/earthdistance/earthdistance.control b/contrib/earthdistance/earthdistance.control
index 5816d22..3df666d 100644
--- a/contrib/earthdistance/earthdistance.control
+++ b/contrib/earthdistance/earthdistance.control
@@ -3,4 +3,5 @@ comment = 'calculate great-circle distances on the surface of the Earth'
 default_version = '1.1'
 module_pathname = '$libdir/earthdistance'
 relocatable = true
+trusted = true
 requires = 'cube'
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 6b2832a..3cd6660 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -3,3 +3,4 @@ comment = 'determine similarities and distance between strings'
 default_version = '1.1'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
+trusted = true
diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control
index 93688cd..e0fbb8b 100644
--- a/contrib/hstore/hstore.control
+++ b/contrib/hstore/hstore.control
@@ -3,3 +3,4 @@ comment = 'data type for storing sets of (key, value) pairs'
 default_version = '1.6'
 module_pathname = '$libdir/hstore'
 relocatable = true
+trusted = true
diff --git a/contrib/hstore_plperl/hstore_plperl.control b/contrib/hstore_plperl/hstore_plperl.control
index 16277f6..4b9fd13 100644
--- a/contrib/hstore_plperl/hstore_plperl.control
+++ b/contrib/hstore_plperl/hstore_plperl.control
@@ -3,4 +3,5 @@ comment = 'transform between hstore and plperl'
 default_version = '1.0'
 module_pathname = '$libdir/hstore_plperl'
 relocatable = true
+trusted = true
 requires = 'hstore,plperl'
diff --git a/contrib/intarray/intarray.control b/contrib/intarray/intarray.control
index 7e50cc3..bf28804 100644
--- a/contrib/intarray/intarray.control
+++ 

Re: error context for vacuum to include block number

2020-02-07 Thread Justin Pryzby
On Tue, Feb 04, 2020 at 01:58:20PM +0900, Masahiko Sawada wrote:
> Here is the comment for v16 patch:
> 
> 2.
> I think we can set the error context for heap scan again after
> freespace map vacuum finishing, maybe after reporting the new phase.
> Otherwise the user will get confused if an error occurs during
> freespace map vacuum. And I think the comment is unclear, how about
> "Set the error context fro heap scan again"?

Good point

> 3.
> +   if (cbarg->blkno!=InvalidBlockNumber)
> +   errcontext(_("while scanning block %u of relation \"%s.%s\""),
> +   cbarg->blkno, cbarg->relnamespace, cbarg->relname);
> 
> We can use BlockNumberIsValid macro instead.

Thanks.  See attached, now squished together patches.

I added functions to initialize the callbacks, so error handling is out of the
way and minimally distract from the rest of vacuum.
>From 95265412c56f3b308eed16531d7c83243e278f4f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 20:54:37 -0600
Subject: [PATCH v17 1/2] vacuum errcontext to show block being processed

As requested here.
https://www.postgresql.org/message-id/20190807235154.erbmr4o4bo6vgnjv%40alap3.anarazel.de
---
 src/backend/access/heap/vacuumlazy.c | 117 +++
 1 file changed, 117 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a23cdef..9358ab4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -292,6 +292,14 @@ typedef struct LVRelStats
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+typedef struct
+{
+	char 		*relnamespace;
+	char		*relname;
+	char 		*indname; /* undefined while not processing index */
+	BlockNumber blkno;	/* undefined while not processing heap */
+	int			phase;	/* Reusing same enums as for progress reporting */
+} vacuum_error_callback_arg;
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -361,6 +369,7 @@ static void end_parallel_vacuum(Relation *Irel, IndexBulkDeleteResult **stats,
 LVParallelState *lps, int nindexes);
 static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
 static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static void vacuum_error_callback(void *arg);
 
 
 /*
@@ -724,6 +733,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	ErrorContextCallback errcallback;
+	vacuum_error_callback_arg errcbarg;
 
 	pg_rusage_init();
 
@@ -870,6 +881,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	else
 		skipping_blocks = false;
 
+	/* Setup error traceback support for ereport() */
+	errcbarg.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	errcbarg.relname = relname;
+	errcbarg.blkno = InvalidBlockNumber; /* Not known yet */
+	errcbarg.phase = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
+
+	errcallback.callback = vacuum_error_callback;
+	errcallback.arg = (void *) 
+	errcallback.previous = error_context_stack;
+	error_context_stack = 
+
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
@@ -891,6 +913,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
+		errcbarg.blkno = blkno;
+
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		if (blkno == next_unskippable_block)
@@ -987,6 +1011,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 vmbuffer = InvalidBuffer;
 			}
 
+			/* Pop the error context stack */
+			error_context_stack = errcallback.previous;
+
 			/* Work on all the indexes, then the heap */
 			lazy_vacuum_all_indexes(onerel, Irel, indstats,
 	vacrelstats, lps, nindexes);
@@ -1011,6 +1038,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 		 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* Set the error context while continuing heap scan */
+			error_context_stack = 
 		}
 
 		/*
@@ -1597,6 +1627,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
 	}
 
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
 	/* report that everything is scanned and vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
@@ -1772,11 +1805,26 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 	int			npages;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
+	ErrorContextCallback errcallback;
+	vacuum_error_callback_arg errcbarg;
 
 	/* Report that we are now vacuuming the heap */
 	

Re: Adding a test for speculative insert abort case

2020-02-07 Thread Andres Freund
Hi,

I'm currently fighting with a race I'm observing in about 1/4 of the
runs. I get:
@@ -361,16 +361,18 @@
 locktype   mode   granted

 speculative tokenShareLock  f
 speculative tokenExclusiveLock  t
 step controller_unlock_2_4: SELECT pg_advisory_unlock(2, 4);
 pg_advisory_unlock

 t
 s1: NOTICE:  blurt_and_lock_123() called for k1 in session 1
 s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock_123() called for k1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
 step s1_upsert: <... completed>
 step s2_upsert: <... completed>
 step controller_show: SELECT * FROM upserttest;
 keydata

 k1 inserted s2 with conflict update s1
(this is the last permutation)

The issue is basically that s1 goes through the
check_exclusion_or_unique_constraint() conflict check twice.

I added a bit of debugging information (using fprintf, with elog it was
much harder to hit):

Success:
2020-02-07 16:14:56.501 PST [1167003][5/1254:8465] CONTEXT:  PL/pgSQL function 
blurt_and_lock_123(text) line 7 at RAISE
1167003: acquiring xact lock
2020-02-07 16:14:56.512 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.522 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
1167002: releasing xact lock 2 3
1167004: acquired xact lock
1167004: xid 8466 acquiring 5
2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] LOG:  INSERT @ 0/9CC70F0:  - 
Heap/INSERT: off 2 flags 0x0C
2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] NOTICE:  
blurt_and_lock_123() called for k1 in session 2
2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] CONTEXT:  PL/pgSQL function 
blurt_and_lock_123(text) line 3 at RAISE
2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] NOTICE:  acquiring advisory 
lock on 2
2020-02-07 16:14:56.523 PST [1167004][6/1014:8466] CONTEXT:  PL/pgSQL function 
blurt_and_lock_123(text) line 7 at RAISE
1167004: acquiring xact lock
2020-02-07 16:14:56.533 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.544 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.555 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.565 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.576 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.587 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
1167002: releasing xact lock 2 2
1167004: acquired xact lock
2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] LOG:  INSERT @ 0/9CC7150:  - 
Btree/NEWROOT: lev 0
2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] LOG:  INSERT @ 0/9CC7190:  - 
Btree/INSERT_LEAF: off 1
2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] NOTICE:  blurt_and_lock_4() 
called for k1 in session 2
2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] CONTEXT:  PL/pgSQL function 
blurt_and_lock_4(text) line 3 at RAISE
2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] NOTICE:  acquiring advisory 
lock on 4
2020-02-07 16:14:56.588 PST [1167004][6/1014:8466] CONTEXT:  PL/pgSQL function 
blurt_and_lock_4(text) line 4 at RAISE
1167004: acquiring xact lock
2020-02-07 16:14:56.598 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.609 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.620 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
2020-02-07 16:14:56.630 PST [1167001][3/0:0] DEBUG:  bind  to 
isolationtester_waiting
1167002: releasing xact lock 1 2
1167003: acquired xact lock
2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] LOG:  INSERT @ 0/9CC71D0:  - 
Btree/INSERT_LEAF: off 1
2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] NOTICE:  blurt_and_lock_4() 
called for k1 in session 1
2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] CONTEXT:  PL/pgSQL function 
blurt_and_lock_4(text) line 3 at RAISE
2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] NOTICE:  acquiring advisory 
lock on 4
2020-02-07 16:14:56.631 PST [1167003][5/1254:8465] CONTEXT:  PL/pgSQL function 
blurt_and_lock_4(text) line 4 at RAISE
1167003: acquiring xact lock
1167003: acquired xact lock
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] LOG:  INSERT @ 0/9CC7230:  - 
Btree/NEWROOT: lev 0
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] LOG:  INSERT @ 0/9CC7270:  - 
Btree/INSERT_LEAF: off 1
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] LOG:  INSERT @ 0/9CC72A8:  - 
Heap/DELETE: off 1 flags 0x08
1167003: xid 8465 releasing lock 5
1167003: retry due to conflict
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] NOTICE:  
blurt_and_lock_123() called for k1 in session 1
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] CONTEXT:  PL/pgSQL function 
blurt_and_lock_123(text) line 3 at RAISE
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] NOTICE:  acquiring advisory 
lock on 2
2020-02-07 16:14:56.632 PST [1167003][5/1254:8465] CONTEXT:  PL/pgSQL function 

Re: Draft release notes are up for review

2020-02-07 Thread Tom Lane
Daniel Gustafsson  writes:
> My name is misspelled master 7d0bcb047 s/Gustaffson/Gustafsson/.  Nothing else
> stood out from skimming the diff.

Ah, looks like I copied and pasted that from the commit log.  Sorry,
will fix in next draft.

regards, tom lane




Re: Dumping/restoring fails on inherited generated column

2020-02-07 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-02-03 20:32, Tom Lane wrote:
>> This is showing us at least two distinct problems.  Now as for
>> "gtest30_1", what we have is that in the parent table "gtest30", column b
>> exists but it has no default; the generated property is only added
>> at the child table gtest30_1.  So we need to emit ALTER COLUMN SET
>> GENERATED ALWAYS for gtest30_1.b.  HEAD is already doing the wrong
>> thing there (it's emitting the expression, but as a plain default
>> not GENERATED).  And this patch makes it emit nothing, even worse.
>> I think the key point here is that "attislocal" refers to whether the
>> column itself is locally defined, not to whether its default is.

> This is a bit of a mess.  Let me explain my thinking on generated 
> columns versus inheritance.

> If a parent table has a generated column, then any inherited column must 
> also be generated and use the same expression.  (Otherwise querying the 
> parent table would produce results that are inconsistent with the 
> generation expression if the rows come from the child table.)

Check.

> If a parent table has a column that is not generated, then I think it 
> would be semantically sound if a child table had that same column but 
> generated.  However, I think it would be very tricky to support this 
> correctly, and it doesn't seem useful enough, so I'd rather not do it.

So ... why is that so hard exactly?  AFAICS, the existing regression
test cases show that it works fine.  Except that pg_dump gets it wrong.
In general, we surely want to support child columns that have defaults
different from the parent column's default, so this doesn't seem quite
that huge a leap to me.

> That's what the gtest30_1 case above shows.  It's not even clear whether 
> it's possible to dump this correctly in all cases because the syntax 
> that you allude to "turn this existing column into a generated column" 
> does not exist.

I'm a little confused by that statement.  What is this doing, if not
that:

regression=# create table foo (f1 int not null);
CREATE TABLE
regression=# alter table foo alter column f1 add generated always as identity;
ALTER TABLE
regression=# \d foo
   Table "public.foo"
 Column |  Type   | Collation | Nullable |   Default
+-+---+--+--
 f1 | integer |   | not null | generated always as identity

If we didn't have things like ALTER ... SET GENERATED and
ALTER ... DROP EXPRESSION, I'd be a lot more content to accept
the position that generated-ness is an immutable column property.
But we do have those things, so the restriction you're proposing
seems mighty arbitrary.

regards, tom lane




Re: Draft release notes are up for review

2020-02-07 Thread Daniel Gustafsson
> On 7 Feb 2020, at 22:56, Tom Lane  wrote:

> Please send comments before Sunday.

My name is misspelled master 7d0bcb047 s/Gustaffson/Gustafsson/.  Nothing else
stood out from skimming the diff.

cheers ./daniel



Draft release notes are up for review

2020-02-07 Thread Tom Lane
See

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=388d4351f78dfa6082922074127d496e6f525033

(Note: these cover through 9710d3d4a, but not the additional partitioning
fix I see Alvaro just pushed.)

Please send comments before Sunday.

regards, tom lane




Re: improve transparency of bitmap-only heap scans

2020-02-07 Thread Tomas Vondra

On Fri, Feb 07, 2020 at 03:22:12PM +, Alexey Bashtanov wrote:

Hello,


It took me a while to figure out what those names mean.  "unfetched",
as you call it on the code, may be more descriptive than "avoided" for
the new label.  However I think the other two are more confusing.  It
may be a good idea to change them together with this.

It'll be sad if this patch is forgotten only because of the words choice.
I've changed it all to "unfetched" for at least not to call the same 
thing differently
in the code and in the output, and also rebased it and fit in 80 lines 
width limit.




I kinda suspect one of the ressons why this got so little attention is
that it was never added to any CF.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Reducing WaitEventSet syscall churn

2020-02-07 Thread Thomas Munro
On Sat, Feb 8, 2020 at 10:00 AM Thomas Munro  wrote:
> On Tue, Jan 21, 2020 at 1:45 PM Thomas Munro  wrote:
> > Here are some patches to get rid of frequent system calls.
>
> Here is one more case that I was sitting on because I wasn't sure how
> to do it: walreceiver.c.  To make that work, libpq needs to be able to
> tell you when the socket has changed, which I did with a counter that
> is exposed to client code in patch 0004.  The walreceiver change in
> 0005 works (trace the system calls on walreceiver to see the
> difference), but perhaps we can come up with a better way to code it
> so that eg logical/worker.c doesn't finish up duplicating the logic.
> Thoughts?

(To be clear: I know the 0005 patch doesn't clean up after itself in
various cases, it's for discussion only to see if others have ideas
about how to structure things to suit various potential users of
libpqwalreceiver.so.)




Re: pg_basebackup and snapshots

2020-02-07 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2020-02-07 14:56:47 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > Maybe that's looking too far into the future, but I'd like to see
> > > improvements to pg_basebackup that make it integrate with root requiring
> > > tooling, to do more efficient base backups. E.g. having pg_basebackup
> > > handle start/stop backup and WAL handling, but do the actual backup of
> > > the data via a snapshot mechanism (yes, one needs start/stop backup in
> > > the general case, for multiple FSs), would be nice.
> > 
> > The challenge with this approach is that you need to drop the 'backup
> > label' file into place as part of this operation, either by putting it
> > into the snapshot after it's been taken, or by putting it into the data
> > directory at restore time.  Of course, you have to keep track of WAL
> > anyway from the time the snapshots are taken until the restore is done,
> > so it's certainly possible, as with all of this, it's just somewhat
> > complicated.
> 
> It's not dead trivial, but also doesn't seem *that* hard to me compared
> to the other challenges of adding features like this?  How to best
> approach it I think depends somewhat on what exact type of backup
> (mainly whether to set up a new system or to make a PITR base backup)
> we'd want to focus on. And what kind of snapshotting system / what kind
> of target data store.

I'm also not sure that pg_basebackup is the right tool for this though,
really, given the complications and how it's somewhat beyond what
pg_basebackup's mandate is.  This isn't something you'd like do
remotely, for example, due to the need to take the snapshot, mount the
snapshot, etc.  I don't see this as really in line with "just another
option to -F", there'd be a fair bit of configuring, it seems, and a
good deal of what pg_basebackup would really be doing with this feature
is just running bits of code the user has given us, except for the
actual calls to PG to do start/stop backup.

> Plenty of snapshotting systems allow write access to the snapshot once
> it finished, so that's one way one can deal with that. I have a hard
> time believing that it'd be hard to have pg_basebackup delay writing the
> backup label in that case.  The WAL part would probably be harder, since
> there we want to start writing before the snapshot is done. And copying
> all the WAL at the end isn't enticing either.

pg_basebackup already delays writing out the backup label until the end.

But, yes, there's also timing issues to deal with, which are complicated
because there isn't just a syscall we can use to say "take a snapshot
for us" or to say "mount this snapshot over here" (at least, not in any
kind of portable way, even in places where such things do exist).  Maybe
we could have shell commands that a user provides for "take a snapshot"
and "mount this snapshot", but putting all of that on the user has its
own drawbacks (more on that below..).

> For the PITR base backup case it'd definitely be nice to support writing
> (potentially with callbacks instead of implementing all of them in core)
> into $cloud_provider's blob store, without having to transfer all data
> first through a replication connection and then again to the blob store
> (and without manually implementing non-exclusive base backup). Adding
> WAL after the fact to the same blob really a thing for anything like
> that (obviously - even if one can hack it by storing tars etc).

We seem to be mixing things now..  You've moved into talking about 'blob
stores' which are rather different from snapshots, no?  I certainly agree
with the general idea of supporting blob stores (pgbackrest has
supported s3 for quite some time, with a nicely pluggable architecture
that we'll be using to write drivers for other blob storage, all in very
well tested C99 code, and it's all done directly, if you want, without
going over the network in some other way first..).

I don't really care for the idea of using callbacks for this, at least
if what you mean by "callback" is "something like archive_command".
There's a lot of potential failure cases and issues, writing to most s3
stores requires retries, and getting it all to work right when you're
going through a shell to run some other command to actually get the data
across safely and durably is, ultimately, a bit of a mess.  I feel like
we should be learning from the mess that is archive_command and avoiding
anything like that if at all possible when it comes to moving data
around that needs to be confirmed durably written.  Making users have to
piece together the bits to make it work just isn't a good idea either
(see, again, archive command, and our own documentation for why that's a
bad idea...).

> Wonder if the the WAL part in particular would actually be best solved
> by having recovery probe more than one WAL directory when looking for
> WAL segments (i.e. doing so before switching methods). Much 

Re: Reducing WaitEventSet syscall churn

2020-02-07 Thread Thomas Munro
On Tue, Jan 21, 2020 at 1:45 PM Thomas Munro  wrote:
> Here are some patches to get rid of frequent system calls.

Here is one more case that I was sitting on because I wasn't sure how
to do it: walreceiver.c.  To make that work, libpq needs to be able to
tell you when the socket has changed, which I did with a counter that
is exposed to client code in patch 0004.  The walreceiver change in
0005 works (trace the system calls on walreceiver to see the
difference), but perhaps we can come up with a better way to code it
so that eg logical/worker.c doesn't finish up duplicating the logic.
Thoughts?


0001-Add-WaitMyLatch-to-replace-many-WaitLatch-calls.patch
Description: Binary data


0002-Use-WaitMyLatch-for-condition-variables.patch
Description: Binary data


0003-Introduce-a-reusable-WaitEventSet-for-the-stats-coll.patch
Description: Binary data


0004-libpq-Add-PQsocketChangeCount-to-advertise-socket-ch.patch
Description: Binary data


0005-Reuse-a-WaitEventSet-in-walreceiver.patch
Description: Binary data


Re: pg_basebackup and snapshots

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-07 14:56:47 -0500, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > Maybe that's looking too far into the future, but I'd like to see
> > improvements to pg_basebackup that make it integrate with root requiring
> > tooling, to do more efficient base backups. E.g. having pg_basebackup
> > handle start/stop backup and WAL handling, but do the actual backup of
> > the data via a snapshot mechanism (yes, one needs start/stop backup in
> > the general case, for multiple FSs), would be nice.
> 
> The challenge with this approach is that you need to drop the 'backup
> label' file into place as part of this operation, either by putting it
> into the snapshot after it's been taken, or by putting it into the data
> directory at restore time.  Of course, you have to keep track of WAL
> anyway from the time the snapshots are taken until the restore is done,
> so it's certainly possible, as with all of this, it's just somewhat
> complicated.

It's not dead trivial, but also doesn't seem *that* hard to me compared
to the other challenges of adding features like this?  How to best
approach it I think depends somewhat on what exact type of backup
(mainly whether to set up a new system or to make a PITR base backup)
we'd want to focus on. And what kind of snapshotting system / what kind
of target data store.

Plenty of snapshotting systems allow write access to the snapshot once
it finished, so that's one way one can deal with that. I have a hard
time believing that it'd be hard to have pg_basebackup delay writing the
backup label in that case.  The WAL part would probably be harder, since
there we want to start writing before the snapshot is done. And copying
all the WAL at the end isn't enticing either.

For the PITR base backup case it'd definitely be nice to support writing
(potentially with callbacks instead of implementing all of them in core)
into $cloud_provider's blob store, without having to transfer all data
first through a replication connection and then again to the blob store
(and without manually implementing non-exclusive base backup). Adding
WAL after the fact to the same blob really a thing for anything like
that (obviously - even if one can hack it by storing tars etc).

Wonder if the the WAL part in particular would actually be best solved
by having recovery probe more than one WAL directory when looking for
WAL segments (i.e. doing so before switching methods). Much faster than
using restore_command, and what one really wants in a pretty decent
number of cases. And it'd allow to just restore the base backup
(e.g. mount [copy of] the snapshot) and the received WAL stream
separately, without needing more complicated orchestration.


Perhaps I am also answering something completely besides what you were
wondering about?

Greetings,

Andres Freund




Re: Does recovery write to backup_label ?

2020-02-07 Thread Chapman Flack
On 2/7/20 2:55 PM, Andres Freund wrote:

>> If the file needs to have 0600 permissions, should there be
>> a note in the nonexclusive-mode backup docs to say so?
> 
> I'm not convinced that that's useful. The default is that everything
> needs to be writable by postgres. The exceptions should be noted if
> anything, not the default.

Could this arguably be a special case, as most things in the datadir
are put there by postgres, but the backup_label is now to be put there
(and not even 'there' there, but added as a final step only to a
'backup-copy-of-there' there) by the poor schmuck who reads the
non-exclusive backup docs as saying "retrieve this content from
pg_stop_backup() and preserve in a file named backup_label" and can't
think of any obvious reason to put write permission on a file
that preserves immutable history in a backup?

Regards,
-Chap




pg_basebackup and snapshots

2020-02-07 Thread Stephen Frost
Greetings,

(Moving to -hackers, changing thread title)

* Andres Freund (and...@anarazel.de) wrote:
> Maybe that's looking too far into the future, but I'd like to see
> improvements to pg_basebackup that make it integrate with root requiring
> tooling, to do more efficient base backups. E.g. having pg_basebackup
> handle start/stop backup and WAL handling, but do the actual backup of
> the data via a snapshot mechanism (yes, one needs start/stop backup in
> the general case, for multiple FSs), would be nice.

The challenge with this approach is that you need to drop the 'backup
label' file into place as part of this operation, either by putting it
into the snapshot after it's been taken, or by putting it into the data
directory at restore time.  Of course, you have to keep track of WAL
anyway from the time the snapshots are taken until the restore is done,
so it's certainly possible, as with all of this, it's just somewhat
complicated.

Certainly open to ideas on how to improve this.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Does recovery write to backup_label ?

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-07 11:08:48 -0500, Chapman Flack wrote:
> Just saw this in a PG 11.6 cluster starting a recovery:
> 
> 2020-02-07 10:45:40 EST FATAL:  42501: could not open file
> "backup_label": Permission denied
> 2020-02-07 10:45:40 EST LOCATION:  fsync_fname_ext, fd.c:3531

Well, we generally assume that files in the data directory are writable,
with a very few exceptions. And we do need to be able rename
backup_label to backup_label.old. Which strictly speaking doesn't
require write permissions on the file - but I assume that's what
triggers the issue here. There's IIRC platforms that don't like fsyncing
files with a O_RDONLY fd, so if we want to rename safely (which entails
fsyncing beforehand), we don't have much choice.


> I had assumed the label file would be treated as readonly
> during recovery.

It is IIRC documented that it does get renamed...

> If the file needs to have 0600 permissions, should there be
> a note in the nonexclusive-mode backup docs to say so?

I'm not convinced that that's useful. The default is that everything
needs to be writable by postgres. The exceptions should be noted if
anything, not the default.

Greetings,

Andres Freund




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-07 20:02:01 +0100, Tomas Vondra wrote:
> On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote:
> > > And, the issue got reproduced with the same error:
> > > WARNING:  problem in Generation Tuples: number of free chunks 0 in
> > > block 0x7fe9e9e74010 exceeds 1018 allocated
> > 
> > That seems like a problem in generation.c - because this should be
> > unreachable, I think?

> That's rather strange. How could we print this message? The code looks
> like this
> 
>   if (block->nfree >= block->nchunks)
> elog(WARNING, "problem in Generation %s: number of free chunks %d in 
> block %p exceeds %d allocated",
>  name, block->nfree, block, block->nchunks);
> 
> so this says 0 >= 1018. Or am I missing something?

Indeed, it's pretty weird. I can't reproduce it either. Kuntal, which
exact git version did you repro this on? What precise settings /
platform?

Greetings,

Andres Freund




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Andres Freund
Hi,

On 2020-01-11 23:20:56 -0500, Tom Lane wrote:
> Tomas Vondra  writes:
> Nah, don't think I believe that: the test inserts a bunch of tuples,
> but they look like they will all be *exactly* the same size.
>
> CREATE TABLE decoding_test(x integer, y text);
> ...
>
> FOR i IN 1..10 LOOP
> BEGIN
> INSERT INTO decoding_test(x) SELECT generate_series(1,5000);
> EXCEPTION
> when division_by_zero then perform 'dummy';
> END;

I think the issue this triggers higher memory usage in in older versions
is that before

commit cec2edfa7859279f36d2374770ca920c59c73dd8
Author: Amit Kapila 
Date:   2019-11-16 17:49:33 +0530

Add logical_decoding_work_mem to limit ReorderBuffer memory usage.

we enforced how many changes to keep in memory (vs on disk)

/*
 * Maximum number of changes kept in memory, per transaction. After that,
 * changes are spooled to disk.
 *
 * The current value should be sufficient to decode the entire transaction
 * without hitting disk in OLTP workloads, while starting to spool to disk in
 * other workloads reasonably fast.
 *
 * At some point in the future it probably makes sense to have a more elaborate
 * resource management here, but it's not entirely clear what that would look
 * like.
 */
static const Size max_changes_in_memory = 4096;

on a per-transaction basis. And that subtransactions are *different*
transactions for that purpose (as they can be rolled back
separately). As the test generates loads of records for different
subtransactions, they each end up having quite a few changes (including
the tuples pointed to!) in memory at the same time.

Due to the way the limit of 4096 interacts with the 5000 rows inserted
above, we only hit the out of memory error when loading. That's because
when decoding (before the commit has been seen), we spill after 4096 changes:

2020-02-07 11:18:22.399 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
585 to disk
2020-02-07 11:18:22.419 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
586 to disk
2020-02-07 11:18:22.431 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
587 to disk
2020-02-07 11:18:22.443 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
588 to disk
2020-02-07 11:18:22.454 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
589 to disk
2020-02-07 11:18:22.465 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
590 to disk
2020-02-07 11:18:22.477 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
591 to disk
2020-02-07 11:18:22.488 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
592 to disk
2020-02-07 11:18:22.499 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
593 to disk
2020-02-07 11:18:22.511 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
594 to disk

so there's each 5000 - 4096 changes in memory, times 10. But when
actually calling the output plugin (at the commit record), we start with
loading changes back into memory from the start of each
subtransaction. That first entails spilling the tail of that transaction
to disk, and then loading the start:

2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG:  StartSubTransaction(1) name: 
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG:  StartSubTransaction(2) name: 
replay; blockState: SUB BEGIN; state: INPROGR, xid/subid/cid: 0/2/0
2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 585 
to disk
2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
585 into memory
2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 586 
to disk
2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
586 into memory
2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 587 
to disk
2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
587 into memory
2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 588 
to disk
2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
588 into memory
2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 589 
to disk
TopMemoryContext: 161440 total in 7 blocks; 80240 free (68 chunks); 81200 used
...

Because each transaction has 4096 changes in memory, we actually need
more memory here than we did during the decoding phase, where all but
the "current" subtransaction only have 5000 - 4096 changes in memory.

If we instead change the test to insert 4096*2 - 1 tuples each, we run
out of memory earlier:
2020-02-07 11:23:20.540 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
610 to disk
2020-02-07 11:23:20.565 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
611 to disk
2020-02-07 11:23:20.587 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
612 to disk
2020-02-07 11:23:20.608 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
613 to disk
2020-02-07 11:23:20.630 PST 

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Tomas Vondra

On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote:

Hi,

On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote:

And, the issue got reproduced with the same error:
WARNING:  problem in Generation Tuples: number of free chunks 0 in
block 0x7fe9e9e74010 exceeds 1018 allocated


That seems like a problem in generation.c - because this should be
unreachable, I think?

Tomas?



That's rather strange. How could we print this message? The code looks
like this

  if (block->nfree >= block->nchunks)
elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p 
exceeds %d allocated",
 name, block->nfree, block, block->nchunks);

so this says 0 >= 1018. Or am I missing something?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-07 Thread Mahendra Singh Thalor
On Thu, 6 Feb 2020 at 09:44, Amit Kapila  wrote:
>
> On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor 
wrote:
> >
> > On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada 
wrote:
> > >
> > > On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila 
wrote:
> > > >
> > > > On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <
sawada.m...@gmail.com> wrote:
> > > > >
> > > > > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <
robertmh...@gmail.com> wrote:
> > > > > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <
and...@anarazel.de> wrote:
> > > > > >>> I think the real question is whether the scenario is common
enough to
> > > > > >>> worry about.  In practice, you'd have to be extremely unlucky
to be
> > > > > >>> doing many bulk loads at the same time that all happened to
hash to
> > > > > >>> the same bucket.
> > > > > >>
> > > > > >> With a bunch of parallel bulkloads into partitioned tables
that really
> > > > > >> doesn't seem that unlikely?
> > > > > >
> > > > > > It increases the likelihood of collisions, but probably
decreases the
> > > > > > number of cases where the contention gets really bad.
> > > > > >
> > > > > > For example, suppose each table has 100 partitions and you are
> > > > > > bulk-loading 10 of them at a time.  It's virtually certain that
you
> > > > > > will have some collisions, but the amount of contention within
each
> > > > > > bucket will remain fairly low because each backend spends only
1% of
> > > > > > its time in the bucket corresponding to any given partition.
> > > > > >
> > > > >
> > > > > I share another result of performance evaluation between current
HEAD
> > > > > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> > > > >
> > > > > Type of table: normal table, unlogged table
> > > > > Number of child tables : 16, 64 (all tables are located on the
same tablespace)
> > > > > Number of clients : 32
> > > > > Number of trials : 100
> > > > > Duration: 180 seconds for each trials
> > > > >
> > > > > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores),
256GB
> > > > > RAM, NVMe SSD 1.5TB.
> > > > > Each clients load 10kB random data across all partitioned tables.
> > > > >
> > > > > Here is the result.
> > > > >
> > > > >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > > > > +--+-++--
> > > > >  16 | normal   | HEAD|   1643.833 |
> > > > >  16 | normal   | Patched |  1619.5404 |  0.985222
> > > > >  16 | unlogged | HEAD|  9069.3543 |
> > > > >  16 | unlogged | Patched |  9368.0263 |  1.032932
> > > > >  64 | normal   | HEAD|   1598.698 |
> > > > >  64 | normal   | Patched |  1587.5906 |  0.993052
> > > > >  64 | unlogged | HEAD|  9629.7315 |
> > > > >  64 | unlogged | Patched | 10208.2196 |  1.060073
> > > > > (8 rows)
> > > > >
> > > > > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > > > > whereas it increased 3% ~ 6% for unlogged tables. There were
> > > > > collisions at 0 ~ 5 relation extension lock slots between 2
relations
> > > > > in the 64 child tables case but it didn't seem to affect the tps.
> > > > >
> > > >
> > > > AFAIU, this resembles the workload that Andres was worried about.
I
> > > > think we should once run this test in a different environment, but
> > > > considering this to be correct and repeatable, where do we go with
> > > > this patch especially when we know it improves many workloads [1] as
> > > > well.  We know that on a pathological case constructed by Mithun
[2],
> > > > this causes regression as well.  I am not sure if the test done by
> > > > Mithun really mimics any real-world workload as he has tested by
> > > > making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
> > > >
> > > > Sawada-San, if you have a script or data for the test done by you,
> > > > then please share it so that others can also try to reproduce it.
> > >
> > > Unfortunately the environment I used for performance verification is
> > > no longer available.
> > >
> > > I agree to run this test in a different environment. I've attached the
> > > rebased version patch. I'm measuring the performance with/without
> > > patch, so will share the results.
> > >
> >
> > Thanks Sawada-san for patch.
> >
> > From last few days, I was reading this thread and was reviewing v13
patch.  To debug and test, I did re-base of v13 patch. I compared my
re-based patch and v14 patch. I think,  ordering of header files is not
alphabetically in v14 patch. (I haven't reviewed v14 patch fully because
before review, I wanted to test false sharing).  While debugging, I didn't
noticed any hang or lock related issue.
> >
> > I did some testing to test false sharing(bulk insert, COPY data, bulk
insert into partitions tables).  Below is the testing summary.
> >
> > Test setup(Bulk insert into partition tables):
> > autovacuum=off
> > shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min
> >
> > Basically, I created a table with 13 

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote:
> I performed the same test in pg11 and reproduced the issue on the
> commit prior to a4ccc1cef5a04 (Generational memory allocator).
> 
> ulimit -s 1024
> ulimit -v 30
> 
> wal_level = logical
> max_replication_slots = 4
> 
> [...]

> After that, I applied the "Generational memory allocator" patch and
> that solved the issue. From the error message, it is evident that the
> underlying code is trying to allocate a MaxTupleSize memory for each
> tuple. So, I re-introduced the following lines (which are removed by
> a4ccc1cef5a04) on top of the patch:

> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size 
> tuple_len)
> 
> alloc_len = tuple_len + SizeofHeapTupleHeader;
> 
> +   if (alloc_len < MaxHeapTupleSize)
> +   alloc_len = MaxHeapTupleSize;

Maybe I'm being slow here - but what does this actually prove? Before
the generation contexts were introduced we avoided fragmentation (which
would make things unusably slow) using a a brute force method (namely
forcing all tuple allocations to be of the same/maximum size).

Which means that yes, we'll need more memory than necessary. Do you
think you see anything but that here?

It's good that the situation is better now, but I don't think this means
we need to necessarily backpatch something nontrivial?

Greetings,

Andres Freund




Re: [Proposal] Global temporary tables

2020-02-07 Thread Pavel Stehule
pá 7. 2. 2020 v 18:28 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 07.02.2020 18:15, Robert Haas wrote:
> > On Wed, Feb 5, 2020 at 10:48 AM Konstantin Knizhnik
> >  wrote:
> > My answer is - yes.
> >> Just because:
> >> - Such behavior is compatible with regular tables. So it will not
> >> confuse users and doesn't require some complex explanations.
> >> - It is compatible with Oracle.
> >> - It is what DBA usually want when creating index.
> >> -
> >> There are several arguments against such behavior:
> >> - Concurrent building of index in multiple sessions can consume a lot of
> >> memory
> >> - Building index can increase query execution time (which can be not
> >> expected by clients)
> > I think those are good arguments, especially the second one. There's
> > no limit on how long building a new index might take, and it could be
> > several minutes. A user who was running a query that could have
> > completed in a few seconds or even milliseconds will be unhappy to
> > suddenly wait a long time for a new index to be built. And that is an
> > entirely realistic scenario, because the new index might be better,
> > but only marginally.
> Yes, I agree that this arguments are important.
> But IMHO less important than incompatible behavior (Pavel doesn't agree
> with word "incompatible" in this context
> since semantic of temp tables is in any case different with semantic of
> regular tables).
>
> Just want to notice that if we have huge GTT (so that creation of index
> takes significant amount of time)
> sequential scan of this table also will not be fast.
>
> But in any case, if we agree that we can control thus behavior using GUC
> or index property,
> then it is ok for me.
>
>
>
> >
> > Also, an important point to which I've already alluded a few times is
> > that creating an index can fail. Now, one way it can fail is that
> > there could be some problem writing to disk, or you could run out of
> > memory, or whatever. However, it can also fail because the new index
> > is UNIQUE and the data this backend has in the table doesn't conform
> > to the associated constraint. It will be confusing if all access to a
> > table suddenly starts complaining about uniqueness violations.
>
> Yes, building index can fail (as any other operation with database).
> What's wring with it?
> If it is fatal error, then backend is terminated and content of its temp
> table is disappeared.
> If it is non-fatal error, then current transaction is aborted:
>
>
> Session1:
> postgres=# create global temp table gtt(x integer);
> CREATE TABLE
> postgres=# insert into gtt values (generate_series(1,10));
> INSERT 0 10
>
> Session2:
> postgres=# insert into gtt values (generate_series(1,10));
> INSERT 0 10
> postgres=# insert into gtt values (1);
> INSERT 0 1
>

What when session 2 has active transaction? Then to be correct, you should
to wait with index creation to end of transaction.


> Session1:
> postgres=# create unique index on gtt(x);
> CREATE INDEX
>
> Sessin2:
> postgres=# explain select * from gtt where x=1;
> ERROR:  could not create unique index "gtt_x_idx"
> DETAIL:  Key (x)=(1) is duplicated.
>

This is little bit unexpected behave (probably nobody expect so any SELECT
fail with error "could not create index" - I understand exactly to reason
and context, but this side effect is something what I afraid.


>
> > I don't believe that the feature you are proposing can be correctly
> > implemented in 10 lines of code. I would be pleasantly surprised if it
> > can be done in 1000.
> >
> Right now I do not see any sources of extra complexity.
> Will be pleased if you can point them to me.
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote:
> And, the issue got reproduced with the same error:
> WARNING:  problem in Generation Tuples: number of free chunks 0 in
> block 0x7fe9e9e74010 exceeds 1018 allocated

That seems like a problem in generation.c - because this should be
unreachable, I think?

Tomas?

Regards,

Andres




Re: Assumptions about the number of parallel workers

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-07 09:44:34 +0100, Antonin Houska wrote:
> Those Gather nodes still have non-zero num_workers, see this part of
> standard_planner:
> 
> if (force_parallel_mode != FORCE_PARALLEL_OFF && top_plan->parallel_safe)
> {
> ...
> gather->num_workers = 1;
>   gather->single_copy = true;

Ick. Looks like you might be right...


> Also, if it num_workers was zero for any reason, my patch would probably make
> regression tests fail. However I haven't noticed any assertion failure.

That however, is not at all guaranteed. The regression tests don't run
(or at least not much) with force_parallel_mode set. You can try
yourself though, with something like

PGOPTIONS='-c force_parallel_mode=regress' make check

Greetings,

Andres Freund




Re: Internal key management system

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-07 20:44:31 +0900, Masahiko Sawada wrote:
> Yeah I'm not going to use pgcrypto for transparent data encryption.
> The KMS patch includes the new basic infrastructure for cryptographic
> functions (mainly AES-CBC). I'm thinking we can expand that
> infrastructure so that we can also use it for TDE purpose by
> supporting new cryptographic functions such as AES-CTR. Anyway, I
> agree to not have it depend on pgcrypto.

I thought for a minute, before checking the patch, that you were saying
above that the KMS patch includes its *own* implementation of
cryptographic functions.  I think it's pretty crucial that it continues
not to do that...

Greetings,

Andres Freund




Re: In PG12, query with float calculations is slower than PG11

2020-02-07 Thread Andres Freund
Hi,

On 2020-02-07 17:17:21 +0900, Amit Langote wrote:
> I did some tests using two relatively recent compilers: gcc 8 and
> clang-7 and here are the results:

Hm, these very much look like they've been done in an unoptimized build?

> 40.62%  postgres  postgres   [.] ExecInterpExpr
>  9.74%  postgres  postgres   [.] float8_accum
>  6.12%  postgres  libc-2.17.so   [.] __isinf
>  5.96%  postgres  postgres   [.] float8mul
>  5.33%  postgres  postgres   [.] dsqrt
>  3.90%  postgres  postgres   [.] ftod
>  3.53%  postgres  postgres   [.] Float8GetDatum
>  2.34%  postgres  postgres   [.] DatumGetFloat8
>  2.15%  postgres  postgres   [.] AggCheckCallContext
>  2.03%  postgres  postgres   [.] slot_deform_tuple
>  1.95%  postgres  libm-2.17.so   [.] __sqrt
>  1.19%  postgres  postgres   [.] check_float8_array

> HEAD
> 
> latency average = 549.071 ms
> 
> 31.74%  postgres  postgres   [.] ExecInterpExpr
> 11.02%  postgres  libc-2.17.so   [.] __isinf
> 10.58%  postgres  postgres   [.] float8_accum
>  4.84%  postgres  postgres   [.] check_float8_val
>  4.66%  postgres  postgres   [.] dsqrt
>  3.91%  postgres  postgres   [.] float8mul
>  3.56%  postgres  postgres   [.] ftod
>  3.26%  postgres  postgres   [.] Float8GetDatum
>  2.91%  postgres  postgres   [.] float8_mul
>  2.30%  postgres  postgres   [.] DatumGetFloat8
>  2.19%  postgres  postgres   [.] slot_deform_heap_tuple
>  1.81%  postgres  postgres   [.] AggCheckCallContext
>  1.31%  postgres  libm-2.17.so   [.] __sqrt
>  1.25%  postgres  postgres   [.] check_float8_array

Because DatumGetFloat8, Float8GetDatum, etc aren't functions that
normally stay separate.

Greetings,

Andres Freund




Re: In PG12, query with float calculations is slower than PG11

2020-02-07 Thread Tels

Moin,

On 2020-02-07 15:42, Emre Hasegeli wrote:

> The patch looks unduly invasive to me, but I think that it might be
> right that we should go back to a macro-based implementation, because
> otherwise we don't have a good way to be certain that the function
> parameter won't get evaluated first.

I'd first like to see some actual evidence of this being a problem,
rather than just the order of the checks.


There seem to be enough evidence of this being the problem.  We are
better off going back to the macro-based implementation.  I polished
Keisuke Kuroda's patch commenting about the performance issue, removed
the check_float*_val() functions completely, and added unlikely() as
Tom Lane suggested.  It is attached.  I confirmed with different
compilers that the macro, and unlikely() makes this noticeably faster.


Hm, the diff has the macro tests as:

 +  if (unlikely(isinf(val) && !(inf_is_valid)))
 ...
 +  if (unlikely((val) == 0.0 && !(zero_is_valid)))

But the comment does not explain that this test has to be in that
order, or the compiler will for non-constant arguments evalute
the (now) right-side first. E.g. if I understand this correctly:

 +  if (!(zero_is_valid) && unlikely((val) == 0.0)

would have the same problem of evaluating "zero_is_valid" (which
might be an isinf(arg1) || isinf(arg2)) first and so be the same thing
we try to avoid with the macro? Maybe adding this bit of info to the
comment makes it clearer?

Also, a few places use the macro as:

 +  CHECKFLOATVAL(result, true, true);

which evaluates to a complete NOP in both cases. IMHO this could be
replaced with a comment like:

 +  // No CHECKFLOATVAL() needed, as both inf and 0.0 are valid

(or something along the lines of "no error can occur"), as otherwise
CHECKFLOATVAL() implies to the casual reader that there are some checks
done, while in reality no real checks are done at all (and hopefully
the compiler optimizes everything away, which might not be true for
debug builds).

--
Best regards,

TelsFrom e869373ad093e668872f08833de2c5c614aab673 Mon Sep 17 00:00:00 2001
From: Emre Hasegeli 
Date: Fri, 7 Feb 2020 10:27:25 +
Subject: [PATCH] Bring back CHECKFLOATVAL() macro

The inline functions added by 6bf0bc842b caused the conditions of
overflow/underflow checks to be evaluated when no overflow/underflow
happen.  This slowed down floating point operations.  This commit brings
back the macro that was in use before 6bf0bc842b to fix the performace
regression.

Reported-by: Keisuke Kuroda 
Author: Keisuke Kuroda 
Discussion: https://www.postgresql.org/message-id/CANDwggLe1Gc1OrRqvPfGE%3DkM9K0FSfia0hbeFCEmwabhLz95AA%40mail.gmail.com
---
 src/backend/utils/adt/float.c   | 66 ++---
 src/backend/utils/adt/geo_ops.c |  2 +-
 src/include/utils/float.h   | 75 ++---
 3 files changed, 66 insertions(+), 77 deletions(-)

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index a90d4db215..5885719850 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -1184,21 +1184,21 @@ ftod(PG_FUNCTION_ARGS)
 
 
 /*
  *		dtof			- converts a float8 number to a float4 number
  */
 Datum
 dtof(PG_FUNCTION_ARGS)
 {
 	float8		num = PG_GETARG_FLOAT8(0);
 
-	check_float4_val((float4) num, isinf(num), num == 0);
+	CHECKFLOATVAL((float4) num, isinf(num), num == 0);
 
 	PG_RETURN_FLOAT4((float4) num);
 }
 
 
 /*
  *		dtoi4			- converts a float8 number to an int4 number
  */
 Datum
 dtoi4(PG_FUNCTION_ARGS)
@@ -1438,36 +1438,36 @@ dsqrt(PG_FUNCTION_ARGS)
 	float8		arg1 = PG_GETARG_FLOAT8(0);
 	float8		result;
 
 	if (arg1 < 0)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
  errmsg("cannot take square root of a negative number")));
 
 	result = sqrt(arg1);
 
-	check_float8_val(result, isinf(arg1), arg1 == 0);
+	CHECKFLOATVAL(result, isinf(arg1), arg1 == 0);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dcbrt			- returns cube root of arg1
  */
 Datum
 dcbrt(PG_FUNCTION_ARGS)
 {
 	float8		arg1 = PG_GETARG_FLOAT8(0);
 	float8		result;
 
 	result = cbrt(arg1);
-	check_float8_val(result, isinf(arg1), arg1 == 0);
+	CHECKFLOATVAL(result, isinf(arg1), arg1 == 0);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dpow			- returns pow(arg1,arg2)
  */
 Datum
 dpow(PG_FUNCTION_ARGS)
 {
@@ -1525,40 +1525,40 @@ dpow(PG_FUNCTION_ARGS)
 			/* The sign of Inf is not significant in this case. */
 			result = get_float8_infinity();
 		else if (fabs(arg1) != 1)
 			result = 0;
 		else
 			result = 1;
 	}
 	else if (errno == ERANGE && result != 0 && !isinf(result))
 		result = get_float8_infinity();
 
-	check_float8_val(result, isinf(arg1) || isinf(arg2), arg1 == 0);
+	CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), arg1 == 0);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dexp			- returns the exponential function of arg1
  */
 Datum
 dexp(PG_FUNCTION_ARGS)
 {
 	float8		arg1 = PG_GETARG_FLOAT8(0);
 	float8		result;
 
 	errno = 0;
 	

Re: [Proposal] Global temporary tables

2020-02-07 Thread Robert Haas
On Fri, Feb 7, 2020 at 12:28 PM Konstantin Knizhnik
 wrote:
> But in any case, if we agree that we can control thus behavior using GUC
> or index property,
> then it is ok for me.

Nope, I am not going to agree to that, and I don't believe that any
other committer will, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Global temporary tables

2020-02-07 Thread Konstantin Knizhnik

Fix GTT index initialization.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 33e2d28..93059ef 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -178,7 +178,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		for (block = first_block; block <= last_block; ++block)
 		{
 			CHECK_FOR_INTERRUPTS();
-			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data);
+			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data, false);
 			++blocks_done;
 		}
 	}
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2..39baddc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,19 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	/*
+	 * For global temp table only
+	 * use AccessExclusiveLock for ensure safety
+	 */
+	{
+		{
+			"on_commit_delete_rows",
+			"global temp table on commit options",
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1486,6 +1499,8 @@ bytea *
 default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 {
 	static const relopt_parse_elt tab[] = {
+		{"on_commit_delete_rows", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, on_commit_delete_rows)},
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
@@ -1586,13 +1601,17 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+	static const relopt_parse_elt tab[] = {
+		{"on_commit_delete_rows", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, on_commit_delete_rows)}
+	};
 	/*
 	 * There are no options for partitioned tables yet, but this is able to do
 	 * some validation.
 	 */
 	return (bytea *) build_reloptions(reloptions, validate,
 	  RELOPT_KIND_PARTITIONED,
-	  0, NULL, 0);
+	  sizeof(StdRdOptions), tab, lengthof(tab));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3fa4b76..a86de50 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -670,6 +670,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 			 * init fork of an unlogged relation.
 			 */
 			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ||
 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
  forkNum == INIT_FORKNUM))
 log_smgrcreate(newrnode, forkNum);
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 7d6acae..7c48e5c 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -396,6 +396,9 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence)
 		case RELPERSISTENCE_TEMP:
 			backend = BackendIdForTempRelations();
 			break;
+		case RELPERSISTENCE_SESSION:
+			backend = BackendIdForSessionRelations();
+			break;
 		case RELPERSISTENCE_UNLOGGED:
 		case RELPERSISTENCE_PERMANENT:
 			backend = InvalidBackendId;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8880586..22ce895 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3707,7 +3707,7 @@ reindex_relation(Oid relid, int flags, int options)
 		if (flags & REINDEX_REL_FORCE_INDEXES_UNLOGGED)
 			persistence = RELPERSISTENCE_UNLOGGED;
 		else if (flags & REINDEX_REL_FORCE_INDEXES_PERMANENT)
-			persistence = RELPERSISTENCE_PERMANENT;
+			persistence = rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ? RELPERSISTENCE_SESSION : RELPERSISTENCE_PERMANENT;
 		else
 			persistence = rel->rd_rel->relpersistence;
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1..9747835 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -92,6 +92,10 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 			backend = InvalidBackendId;
 			needs_wal = false;
 			break;
+		case RELPERSISTENCE_SESSION:
+			backend = BackendIdForSessionRelations();
+			needs_wal = false;
+			break;
 		case RELPERSISTENCE_PERMANENT:
 			backend = InvalidBackendId;
 			needs_wal = true;
@@ -367,7 +371,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf.data);
+		smgrread(src, forkNum, blkno, buf.data, false);
 
 		if (!PageIsVerified(page, blkno))
 			ereport(ERROR,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c9e6060..1f5e52b 

allow frontend use of the backend's core hashing functions

2020-02-07 Thread Robert Haas
Late last year, I did some work to make it possible to use simplehash
in frontend code.[1] However, a hash table is not much good unless one
also has some hash functions that one can use to hash the keys that
need to be inserted into that hash table. I initially thought that
solving this problem was going to be pretty annoying, but when I
looked at it again today I found what I think is a pretty simple way
to adapt things so that the hashing routines we use in the backend are
easily available to frontend code.

Here are some patches for that. It may make sense to combine some of
these in terms of actually getting this committed, but I have
separated them here so that it is, hopefully, easy to see what I did
and why I did it. There are three basic problems which are solved by
the three preparatory patches:

0001 - hashfn.c has a couple of routines that depend on bitmapsets,
and bitmapset.c is currently backend-only. Fix by moving this code
near related code in bitmapset.c.

0002 - some of the prototypes for functions in hashfn.c are in
hsearch.h, mixed with the dynahash stuff, and others are in
hashutils.c. Fix by making hashutils.h the one true header for
hashfn.c.

0003 - Some of hashfn.c's routines return Datum, but that's
backend-only. Fix by renaming the functions and changing the return
types to uint32 and uint64, and add static inline wrappers with the
old names that convert to Datum. Just changing the return types of the
existing functions seemed like it would've required a lot more code
churn, and also seems like it could cause silent breakage in the
future.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] 
http://postgr.es/m/CA+Tgmob8oyh02NrZW=xCScB+5GyJ-jVowE3+TWTUmPF=fsg...@mail.gmail.com


0001-Move-bitmap_hash-and-bitmap_match-to-bitmapset.c.patch
Description: Binary data


0004-Move-src-backend-utils-hash-hashfn.c-to-src-common.patch
Description: Binary data


0003-Adapt-hashfn.c-and-hashutils.h-for-frontend-use.patch
Description: Binary data


0002-Put-all-the-prototypes-for-hashfn.c-into-the-same-he.patch
Description: Binary data


Re: [Proposal] Global temporary tables

2020-02-07 Thread Konstantin Knizhnik




On 07.02.2020 18:15, Robert Haas wrote:

On Wed, Feb 5, 2020 at 10:48 AM Konstantin Knizhnik
 wrote:
My answer is - yes.

Just because:
- Such behavior is compatible with regular tables. So it will not
confuse users and doesn't require some complex explanations.
- It is compatible with Oracle.
- It is what DBA usually want when creating index.
-
There are several arguments against such behavior:
- Concurrent building of index in multiple sessions can consume a lot of
memory
- Building index can increase query execution time (which can be not
expected by clients)

I think those are good arguments, especially the second one. There's
no limit on how long building a new index might take, and it could be
several minutes. A user who was running a query that could have
completed in a few seconds or even milliseconds will be unhappy to
suddenly wait a long time for a new index to be built. And that is an
entirely realistic scenario, because the new index might be better,
but only marginally.

Yes, I agree that this arguments are important.
But IMHO less important than incompatible behavior (Pavel doesn't agree 
with word "incompatible" in this context
since semantic of temp tables is in any case different with semantic of 
regular tables).


Just want to notice that if we have huge GTT (so that creation of index 
takes significant amount of time)

sequential scan of this table also will not be fast.

But in any case, if we agree that we can control thus behavior using GUC 
or index property,

then it is ok for me.





Also, an important point to which I've already alluded a few times is
that creating an index can fail. Now, one way it can fail is that
there could be some problem writing to disk, or you could run out of
memory, or whatever. However, it can also fail because the new index
is UNIQUE and the data this backend has in the table doesn't conform
to the associated constraint. It will be confusing if all access to a
table suddenly starts complaining about uniqueness violations.


Yes, building index can fail (as any other operation with database).
What's wring with it?
If it is fatal error, then backend is terminated and content of its temp 
table is disappeared.

If it is non-fatal error, then current transaction is aborted:


Session1:
postgres=# create global temp table gtt(x integer);
CREATE TABLE
postgres=# insert into gtt values (generate_series(1,10));
INSERT 0 10

Session2:
postgres=# insert into gtt values (generate_series(1,10));
INSERT 0 10
postgres=# insert into gtt values (1);
INSERT 0 1

Session1:
postgres=# create unique index on gtt(x);
CREATE INDEX

Sessin2:
postgres=# explain select * from gtt where x=1;
ERROR:  could not create unique index "gtt_x_idx"
DETAIL:  Key (x)=(1) is duplicated.


I don't believe that the feature you are proposing can be correctly
implemented in 10 lines of code. I would be pleasantly surprised if it
can be done in 1000.


Right now I do not see any sources of extra complexity.
Will be pleased if you can point them to me.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: [HACKERS] emergency outage requiring database restart

2020-02-07 Thread Merlin Moncure
On Tue, Jan 3, 2017 at 1:05 PM Peter Eisentraut
 wrote:
>
> On 11/7/16 5:31 PM, Merlin Moncure wrote:
> > Regardless, it seems like you might be on to something, and I'm
> > inclined to patch your change, test it, and roll it out to production.
> > If it helps or at least narrows the problem down, we ought to give it
> > consideration for inclusion (unless someone else can think of a good
> > reason not to do that, heh!).
>
> Any results yet?

Not yet.   But I do have some interesting findings.  At this point I
do not think the problem is within  pl/sh itself, but that when a
process is invoked from pl/sh misbehaves that misbehavior can
penetrate into the database processes.  I also believe that this
problem is fd related, so that the 'close on exec' might reasonably
fix it.  All cases of database damage I have observed remain
completely mitigated by enabling database checksums.

Recently, a sqsh process kicked off via pl/sh crashed with signal 11
but the database process was otherwise intact and fine.  This is
strong supporting evidence to my points above, I think.  I've also
turned up a fairly reliable reproduction case from some unrelated
application changes.  If I can demonstrate that close on exec flag
works and prevents these occurrences we can close the book on this.

merlin




Re: Index Skip Scan

2020-02-07 Thread Dmitry Dolgov
> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> wrote in 
> > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthali...@gmail.com> 
> > > wrote in
> > > We could add an additional parameter "in_cursor" to
> > > ExecSupportBackwardScan and let skip scan return false if in_cursor is
> > > true, but I'm not sure it's acceptable.
> > 
> > I also was thinking about whether it's possible to use
> > ExecSupportBackwardScan here, but skip scan is just a mode of an
> > index/indexonly scan. Which means that ExecSupportBackwardScan also need
> > to know somehow if this mode is being used, and then, since this
> > function is called after it's already decided to use skip scan in the
> > resulting plan, somehow correct the plan (exclude skipping and try to
> > find next best path?) - do I understand your suggestion correct?
> 
> I didn't thought so hardly, but a bit of confirmation told me that
> IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
> things are not that simple.

Yes, I've mentioned that already in one of the previous emails :) The
simplest way I see to achieve what we want is to do something like in
attached modified version with a new hasDeclaredCursor field. It's not a
final version though, but posted just for discussion, so feel free to
suggest any improvements or alternatives.
>From 22e6b4ccd5f79ca069bd5cd90ba3696dd97f76ea Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  71 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  17 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 373 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while 

Re: proposal: schema variables

2020-02-07 Thread Pavel Stehule
Hi

rebase

Regards

Pavel


0002-transactional-variables-20200207.patch.gz
Description: application/gzip


0001-schema-variables-20200207.patch.gz
Description: application/gzip


Does recovery write to backup_label ?

2020-02-07 Thread Chapman Flack
Just saw this in a PG 11.6 cluster starting a recovery:

2020-02-07 10:45:40 EST FATAL:  42501: could not open file
"backup_label": Permission denied
2020-02-07 10:45:40 EST LOCATION:  fsync_fname_ext, fd.c:3531

The label file was written with mode 0400 by a script that got
the contents from pg_stop_backup(boolean,boolean).

But during recovery, it is being poked at by fsync_fname_ext
which wants to open it O_RDWR.

I had assumed the label file would be treated as readonly
during recovery.

If the file needs to have 0600 permissions, should there be
a note in the nonexclusive-mode backup docs to say so?

Regards,
-Chap





Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-07 Thread Ashutosh Bapat
Hi Andy,
What might help is to add more description to your email message like
giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d
int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+  QUERY PLAN
+---
+ Seq Scan on select_distinct_a
+(1 row)

>From this example, it seems that the distinct operation can be dropped
because (a, b) is a primary key. Is my understanding correct?

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results. PostgreSQL has similar
concept of allowing non-grouping expression as part of targetlist when
those expressions can be proved to be functionally dependent on the GROUP
BY clause. See check_functional_grouping() and its caller. I think,
DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using
unique indexes for eliminating DISTINCT should be discouraged.
3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
as well
4. The patch works only at the query level, but that functionality can be
expanded generally to other places which add Unique/HashAggregate/Group
nodes if the underlying relation can be proved to produce distinct rows.
But that's probably more work since we will have to label paths with unique
keys similar to pathkeys.
5. Have you tested this OUTER joins, which can render inner side nullable?

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan  wrote:

> update the patch with considering the semi/anti join.
>
> Can anyone help to review this patch?
>
> Thanks
>
>
> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan  wrote:
>
>> Hi:
>>
>> I wrote a patch to erase the distinctClause if the result is unique by
>> definition,  I find this because a user switch this code from oracle
>> to PG and find the performance is bad due to this,  so I adapt pg for
>> this as well.
>>
>> This patch doesn't work for a well-written SQL,  but some drawback
>> of a SQL may be not very obvious,  since the cost of checking is pretty
>> low as well,  so I think it would be ok to add..
>>
>> Please see the patch for details.
>>
>> Thank you.
>>
>

-- 
--
Best Wishes,
Ashutosh Bapat


Re: improve transparency of bitmap-only heap scans

2020-02-07 Thread Alexey Bashtanov

Hello,


It took me a while to figure out what those names mean.  "unfetched",
as you call it on the code, may be more descriptive than "avoided" for
the new label.  However I think the other two are more confusing.  It
may be a good idea to change them together with this.

It'll be sad if this patch is forgotten only because of the words choice.
I've changed it all to "unfetched" for at least not to call the same 
thing differently
in the code and in the output, and also rebased it and fit in 80 lines 
width limit.


Best, Alex
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..c7200d2a21 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2777,6 +2777,8 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 {
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
+		ExplainPropertyInteger("Unfetched Heap Blocks", NULL,
+			   planstate->unfetched_pages, es);
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
 			   planstate->exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
@@ -2784,10 +2786,14 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (planstate->unfetched_pages > 0 || planstate->exact_pages > 0
+		   || planstate->lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
+			if (planstate->unfetched_pages > 0)
+appendStringInfo(es->str, " unfetched=%ld",
+	planstate->unfetched_pages);
 			if (planstate->exact_pages > 0)
 appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
 			if (planstate->lossy_pages > 0)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index ae8a11da30..0456f8592b 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -231,17 +231,20 @@ BitmapHeapNext(BitmapHeapScanState *node)
  * node->return_empty_tuples.
  */
 node->return_empty_tuples = tbmres->ntuples;
+node->unfetched_pages++;
 			}
 			else if (!table_scan_bitmap_next_block(scan, tbmres))
 			{
 /* AM doesn't think this block is valid, skip */
 continue;
 			}
-
-			if (tbmres->ntuples >= 0)
-node->exact_pages++;
 			else
-node->lossy_pages++;
+			{
+if (tbmres->ntuples >= 0)
+	node->exact_pages++;
+else
+	node->lossy_pages++;
+			}
 
 			/* Adjust the prefetch target */
 			BitmapAdjustPrefetchTarget(node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d5b38b879..0bd75c329c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1564,6 +1564,7 @@ typedef struct ParallelBitmapHeapState
  *		pvmbuffer		   ditto, for prefetched pages
  *		exact_pages		   total number of exact pages retrieved
  *		lossy_pages		   total number of lossy pages retrieved
+ *		unfetched_pages	   total number of pages not retrieved due to vm
  *		prefetch_iterator  iterator for prefetching ahead of current page
  *		prefetch_pages	   # pages prefetch iterator is ahead of current
  *		prefetch_targetcurrent target prefetch distance
@@ -1588,6 +1589,7 @@ typedef struct BitmapHeapScanState
 	Buffer		pvmbuffer;
 	long		exact_pages;
 	long		lossy_pages;
+	long		unfetched_pages;
 	TBMIterator *prefetch_iterator;
 	int			prefetch_pages;
 	int			prefetch_target;


Re: Assumptions about the number of parallel workers

2020-02-07 Thread Robert Haas
On Wed, Feb 5, 2020 at 4:49 AM Antonin Houska  wrote:
> I can't figure out why ExecGather/ExecGatherMerge do check whether num_workers
> is non-zero. I think the code would be a bit clearer if these tests were
> replaced with Assert() statements, as the attached patch does.

Hmm. There are some cases where we plan on using a Gather node but
then can't actually fire up parallelism because we run out of DSM
segments or we run out of background workers. But the Gather is just
part of the plan, so it would still have num_workers > 0 in those
cases. This might just have been a thinko on my part, but I'm not
totally sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Proposal] Global temporary tables

2020-02-07 Thread Robert Haas
On Wed, Feb 5, 2020 at 10:48 AM Konstantin Knizhnik
 wrote:
> > I don't understand. A global temporary table, as I understand it, is a
> > table for which each session sees separate contents. So you would
> > never need to populate it with existing data.
> Session 1:
> create global temp table gtt(x integer);
> insert into gtt values (generate_series(1,10));
>
> Session 2:
> insert into gtt values (generate_series(1,20));
>
> Session1:
> create index on gtt(x);
> explain select * from gtt where x = 1;
>
> Session2:
> explain select * from gtt where x = 1;
> ??? Should we use index here?

OK, I see where you're coming from now.

> My answer is - yes.
> Just because:
> - Such behavior is compatible with regular tables. So it will not
> confuse users and doesn't require some complex explanations.
> - It is compatible with Oracle.
> - It is what DBA usually want when creating index.
> -
> There are several arguments against such behavior:
> - Concurrent building of index in multiple sessions can consume a lot of
> memory
> - Building index can increase query execution time (which can be not
> expected by clients)

I think those are good arguments, especially the second one. There's
no limit on how long building a new index might take, and it could be
several minutes. A user who was running a query that could have
completed in a few seconds or even milliseconds will be unhappy to
suddenly wait a long time for a new index to be built. And that is an
entirely realistic scenario, because the new index might be better,
but only marginally.

Also, an important point to which I've already alluded a few times is
that creating an index can fail. Now, one way it can fail is that
there could be some problem writing to disk, or you could run out of
memory, or whatever. However, it can also fail because the new index
is UNIQUE and the data this backend has in the table doesn't conform
to the associated constraint. It will be confusing if all access to a
table suddenly starts complaining about uniqueness violations.

> That is all - just 10 line of code.

I don't believe that the feature you are proposing can be correctly
implemented in 10 lines of code. I would be pleasantly surprised if it
can be done in 1000.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: In PG12, query with float calculations is slower than PG11

2020-02-07 Thread Emre Hasegeli
> > The patch looks unduly invasive to me, but I think that it might be
> > right that we should go back to a macro-based implementation, because
> > otherwise we don't have a good way to be certain that the function
> > parameter won't get evaluated first.
>
> I'd first like to see some actual evidence of this being a problem,
> rather than just the order of the checks.

There seem to be enough evidence of this being the problem.  We are
better off going back to the macro-based implementation.  I polished
Keisuke Kuroda's patch commenting about the performance issue, removed
the check_float*_val() functions completely, and added unlikely() as
Tom Lane suggested.  It is attached.  I confirmed with different
compilers that the macro, and unlikely() makes this noticeably faster.
From e869373ad093e668872f08833de2c5c614aab673 Mon Sep 17 00:00:00 2001
From: Emre Hasegeli 
Date: Fri, 7 Feb 2020 10:27:25 +
Subject: [PATCH] Bring back CHECKFLOATVAL() macro

The inline functions added by 6bf0bc842b caused the conditions of
overflow/underflow checks to be evaluated when no overflow/underflow
happen.  This slowed down floating point operations.  This commit brings
back the macro that was in use before 6bf0bc842b to fix the performace
regression.

Reported-by: Keisuke Kuroda 
Author: Keisuke Kuroda 
Discussion: https://www.postgresql.org/message-id/CANDwggLe1Gc1OrRqvPfGE%3DkM9K0FSfia0hbeFCEmwabhLz95AA%40mail.gmail.com
---
 src/backend/utils/adt/float.c   | 66 ++---
 src/backend/utils/adt/geo_ops.c |  2 +-
 src/include/utils/float.h   | 75 ++---
 3 files changed, 66 insertions(+), 77 deletions(-)

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index a90d4db215..5885719850 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -1184,21 +1184,21 @@ ftod(PG_FUNCTION_ARGS)
 
 
 /*
  *		dtof			- converts a float8 number to a float4 number
  */
 Datum
 dtof(PG_FUNCTION_ARGS)
 {
 	float8		num = PG_GETARG_FLOAT8(0);
 
-	check_float4_val((float4) num, isinf(num), num == 0);
+	CHECKFLOATVAL((float4) num, isinf(num), num == 0);
 
 	PG_RETURN_FLOAT4((float4) num);
 }
 
 
 /*
  *		dtoi4			- converts a float8 number to an int4 number
  */
 Datum
 dtoi4(PG_FUNCTION_ARGS)
@@ -1438,36 +1438,36 @@ dsqrt(PG_FUNCTION_ARGS)
 	float8		arg1 = PG_GETARG_FLOAT8(0);
 	float8		result;
 
 	if (arg1 < 0)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
  errmsg("cannot take square root of a negative number")));
 
 	result = sqrt(arg1);
 
-	check_float8_val(result, isinf(arg1), arg1 == 0);
+	CHECKFLOATVAL(result, isinf(arg1), arg1 == 0);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dcbrt			- returns cube root of arg1
  */
 Datum
 dcbrt(PG_FUNCTION_ARGS)
 {
 	float8		arg1 = PG_GETARG_FLOAT8(0);
 	float8		result;
 
 	result = cbrt(arg1);
-	check_float8_val(result, isinf(arg1), arg1 == 0);
+	CHECKFLOATVAL(result, isinf(arg1), arg1 == 0);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dpow			- returns pow(arg1,arg2)
  */
 Datum
 dpow(PG_FUNCTION_ARGS)
 {
@@ -1525,40 +1525,40 @@ dpow(PG_FUNCTION_ARGS)
 			/* The sign of Inf is not significant in this case. */
 			result = get_float8_infinity();
 		else if (fabs(arg1) != 1)
 			result = 0;
 		else
 			result = 1;
 	}
 	else if (errno == ERANGE && result != 0 && !isinf(result))
 		result = get_float8_infinity();
 
-	check_float8_val(result, isinf(arg1) || isinf(arg2), arg1 == 0);
+	CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2), arg1 == 0);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dexp			- returns the exponential function of arg1
  */
 Datum
 dexp(PG_FUNCTION_ARGS)
 {
 	float8		arg1 = PG_GETARG_FLOAT8(0);
 	float8		result;
 
 	errno = 0;
 	result = exp(arg1);
 	if (errno == ERANGE && result != 0 && !isinf(result))
 		result = get_float8_infinity();
 
-	check_float8_val(result, isinf(arg1), false);
+	CHECKFLOATVAL(result, isinf(arg1), false);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dlog1			- returns the natural logarithm of arg1
  */
 Datum
 dlog1(PG_FUNCTION_ARGS)
 {
@@ -1573,21 +1573,21 @@ dlog1(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG),
  errmsg("cannot take logarithm of zero")));
 	if (arg1 < 0)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG),
  errmsg("cannot take logarithm of a negative number")));
 
 	result = log(arg1);
 
-	check_float8_val(result, isinf(arg1), arg1 == 1);
+	CHECKFLOATVAL(result, isinf(arg1), arg1 == 1);
 	PG_RETURN_FLOAT8(result);
 }
 
 
 /*
  *		dlog10			- returns the base 10 logarithm of arg1
  */
 Datum
 dlog10(PG_FUNCTION_ARGS)
 {
@@ -1603,21 +1603,21 @@ dlog10(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG),
  errmsg("cannot take logarithm of zero")));
 	if (arg1 < 0)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_ARGUMENT_FOR_LOG),
  errmsg("cannot take logarithm of a negative number")));
 
 	result = 

Re: ALTER tbl rewrite loses CLUSTER ON index (consider moving indisclustered to pg_class)

2020-02-07 Thread Justin Pryzby
On Thu, Feb 06, 2020 at 02:24:47PM -0300, Alvaro Herrera wrote:
> On 2020-Feb-06, Justin Pryzby wrote:
> 
> > I wondered if it wouldn't be better if CLUSTER ON was stored in pg_class as 
> > the
> > Oid of a clustered index, rather than a boolean in pg_index.
> 
> Maybe.  Do you want to try a patch?

I think the attached is 80% complete (I didn't touch pg_dump).

One objection to this change would be that all relations (including indices)
end up with relclustered fields, and pg_index already has a number of bools, so
it's not like this one bool is wasting a byte.

I think relisclustered was a's clever way of avoiding that overhead (c0ad5953).
So I would be -0.5 on moving it to pg_class..

But I think 0001 and 0002 are worthy.  Maybe the test in 0002 should live
somewhere else.
>From 7eea0a17e495fe13379ffd589b551f2f145f5672 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 6 Feb 2020 21:48:13 -0600
Subject: [PATCH v1 1/3] Update comment obsolete since b9b8831a

---
 src/backend/commands/cluster.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e9d7a7f..3adcbeb 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1539,9 +1539,9 @@ get_tables_to_cluster(MemoryContext cluster_context)
 
 	/*
 	 * Get all indexes that have indisclustered set and are owned by
-	 * appropriate user. System relations or nailed-in relations cannot ever
-	 * have indisclustered set, because CLUSTER will refuse to set it when
-	 * called with one of them as argument.
+	 * appropriate user. Shared relations cannot ever have indisclustered
+	 * set, because CLUSTER will refuse to set it when called with one as
+	 * an argument.
 	 */
 	indRelation = table_open(IndexRelationId, AccessShareLock);
 	ScanKeyInit(,
-- 
2.7.4

>From 4777be522a7aa8b8c77b13f765cbd02043438f2a Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 7 Feb 2020 08:12:50 -0600
Subject: [PATCH v1 2/3] Give developer a helpful kick in the pants if they
 change natts in one place but not another

---
 src/backend/bootstrap/bootstrap.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index bfc629c..d5e1888 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -25,7 +25,9 @@
 #include "access/xlog_internal.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
+#include "catalog/pg_class.h"
 #include "catalog/pg_collation.h"
+#include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "common/link-canary.h"
 #include "libpq/pqsignal.h"
@@ -49,6 +51,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/relmapper.h"
+#include "utils/syscache.h"
 
 uint32		bootstrap_data_checksum_version = 0;	/* No checksum */
 
@@ -602,6 +605,26 @@ boot_openrel(char *relname)
 	TableScanDesc scan;
 	HeapTuple	tup;
 
+	/* Check that pg_class data is consistent now, rather than failing obscurely later */
+	struct { Oid oid; int natts; }
+		checknatts[] = {
+		{RelationRelationId, Natts_pg_class,},
+		{TypeRelationId, Natts_pg_type,},
+		{AttributeRelationId, Natts_pg_attribute,},
+		{ProcedureRelationId, Natts_pg_proc,},
+	};
+
+	for (int i=0; irelnatts);
+		ReleaseSysCache(tuple);
+	}
+
 	if (strlen(relname) >= NAMEDATALEN)
 		relname[NAMEDATALEN - 1] = '\0';
 
-- 
2.7.4

>From ed886f8202486dea8069b719d35a5d0db7f3277c Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 6 Feb 2020 12:56:34 -0600
Subject: [PATCH v1 3/3] Make cluster a property of table in pg_index..

..rather than of indexes in pg_index.

The only issue with this is that it makes pg_class larger, and the new column
applies not only to tables, but to indices.
---
 doc/src/sgml/catalogs.sgml |  14 +--
 src/backend/catalog/heap.c |   1 +
 src/backend/catalog/index.c|   6 --
 src/backend/commands/cluster.c | 172 +
 src/backend/commands/tablecmds.c   |   5 +-
 src/backend/utils/cache/relcache.c |   1 -
 src/bin/psql/describe.c|   4 +-
 src/include/catalog/pg_class.dat   |   2 +-
 src/include/catalog/pg_class.h |   3 +
 src/include/catalog/pg_index.h |   1 -
 10 files changed, 93 insertions(+), 116 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index a10b665..8efeaff 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1752,6 +1752,13 @@ SCRAM-SHA-256$iteration count:
  
 
  
+  relclustered
+  oid
+  
+  The OID of the index last clustered, or zero
+ 
+
+ 
   relpages
   int4
   
@@ -3808,13 +3815,6 @@ SCRAM-SHA-256$iteration count:
  
 
  
-  indisclustered
-  bool
-  
-  If true, the table was last clustered on this index
- 
-
- 
   indisvalid
   bool
   
diff --git 

Re: Is custom MemoryContext prohibited?

2020-02-07 Thread Robert Haas
On Wed, Feb 5, 2020 at 10:23 PM Andres Freund  wrote:
> I still don't get what reason there is to not use T_MemoryContext as the
> magic number, instead of something randomly new. It's really not
> problematic to expose those numerical values. And it means that the
> first bit of visual inspection is going to be the same as it always has
> been, and the same as it works for most other types one regularly
> inspects in postgres.

Without trying to say that my thought process is necessarily correct,
I can explain my thought process.

Many years ago, I had Tom slap down a patch of mine for making
something participate in the Node system unnecessarily. He pointed
out, then and at other times, that there was no reason for everything
to be part of the giant enum just because many things need to be. At
the time, I was a bit perplexed by his feedback, but over time I've
come to see the point. We've got lots of "enum" fields all over the
backend whose purpose it is to decide whether a particular object is
of one sub-type or another. We've also got NodeTag, which is that same
thing at a very large scale. I used to think that the reason why we
had jammed everything into NodeTag was just programming laziness or
some kind of desire for consistency, but now I think that the real
point is that making something a node allows it to participate in
readfuncs.c, outfuncs.c, copyfuncs.c, equalfuncs.c; so that a complex
data structure made entirely of nodes can be easily copied,
serialized, etc. The MemoryContext stuff participates in none of that
machinery, and it would be difficult to make it do so, and therefore
does not really need to be part of the Node system at all. The fact
that it *is* a part of the node system is just a historical accident,
or so I think. Sure, it's not an inconvenient thing to see a NodeTag
on random stuff that you're inspecting with a debugger, but if we took
that argument to its logical conclusion we would, I think, end up
needing to add node tags to a lot of stuff that doesn't have them now
- like TupleTableSlots, for example.

Also, as a general rule, every Node of a given type is expected to
have the same structure, which wouldn't be true here, because there
are multiple types of memory contexts that can exist, and
T_MemoryContext would identify them all. It's true that there are some
other weird exceptions, but it doesn't seem like a great idea to
create more.

Between those concerns, and those I wrote about in my last post, it
seemed to me that it made more sense to try to break the dependency
between palloc.h and nodes.h rather than to embrace it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: In PG12, query with float calculations is slower than PG11

2020-02-07 Thread Emre Hasegeli
> Fwiw, also tried the patch that Kuroda-san had posted yesterday.

I run the same test case too:

clang version 7.0.0:

HEAD 2548.119 ms
with patch 2320.974 ms

clang version 8.0.0:

HEAD 2431.766 ms
with patch 2419.439 ms

clang version 9.0.0:

HEAD 2477.493 ms
with patch 2365.509 ms

gcc version 7.4.0:

HEAD 2451.261 ms
with patch 2343.393 ms

gcc version 8.3.0:

HEAD 2540.626 ms
with patch 2299.653 ms




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-07 Thread Kasahara Tatsuhito
On Fri, Feb 7, 2020 at 10:09 PM Fujii Masao  wrote:
> Pushed! Thanks!
Thanks Fujii.


--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: Is custom MemoryContext prohibited?

2020-02-07 Thread Robert Haas
On Wed, Feb 5, 2020 at 10:09 PM Andres Freund  wrote:
> I wasn't advocating for making plannodes.h etc frontend usable. I think
> that's a fairly different discussion than making enum NodeTag,
> pg_list.h, memutils.h available.  I don't see them having access to the
> numerical value of node tag for backend structs as something actually
> problematic (I'm pretty sure you can do that today already if you really
> want to - but why would you?).
>
> I don't buy that having a separate magic number for various types that
> we may want to use both frontend and backend is better than largely just
> having one set of such magic type identifiers.

To be honest, and I realize this is probably going to blow your mind
and/or make you think that I'm completely insane, one concern that I
have here is that I have seen multiple people fail to understand that
the frontend and backend are, ah, not the same process. And they try
to write code in frontend environments that makes no sense whatsoever
there. The fact that nodes.h could hypothetically be included in
frontend code doesn't really contribute to confusion in this area, but
I'm concerned that including it in every file might, because it means
that a whole lot of backend-only stuff suddenly becomes visible in any
code that anyone writes anywhere. And as things stand that would the
effect of adding #include "utils/palloc.h" to "postgres_fe.h". Perhaps
I worrying too much.

On a broader level, I am not convinced that having one "enum" to rule
them all is a good design. If we go that direction, then it means that
frontend code code that wants to add its own node types (and why
shouldn't it want to do that?) would have to have them be visible to
the backend and to all other frontend processes. That doesn't seem
like a disaster, but I don't think it's great. I also don't really
like the fact that we have one central registry of node types that has
to be edited to add more node types, because it means that extensions
are not free to do so. I know we're some distance from allowing any
real extensibility around new node types and perhaps we never will,
but on principle a centralized registry sucks, and I'd prefer a more
decentralized solution if we could find one that would be acceptable.
I don't know what that would be, though. Even though I'm not as
trenchant about debuggability as you and Tom, having a magic number at
the beginning of every type of node in lieu of an enum would drive me
nuts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




control max length of parameter values logged

2020-02-07 Thread Alexey Bashtanov

Hello

Patch ba79cb5 (for full discussion see [1]) introduces a feature to log 
bind parameter values on error,
which greatly helps to reproduce errors artificially having only server 
log -- thanks everyone who

reviewed and improved it!

However, it cuts the values, as some of the reviewers were worried log 
could fill up too quickly.
This applies both to the new case of logging parameter values and to the 
existing ones due to

log_min_duration_statement or log_statement.

This is a backwards-incompatible change, and also ruins the idea of 
reproducible errors -- sorry Tom

I failed to second this idea [2] in time, before the change was pushed.

I personally don't think that we necessarily need to cut the values, we 
can rely on the users
being careful when using this feature -- in the same way we trusted them 
use similarly dangerous
log_min_duration_statement and especially log_statement for ages. At 
least it's better than having
no option to disable it. Alvaro's opinion was different [3]. What do you 
think
of adding a parameter to limit max logged parameter length? See patch 
attached.


Best, Alex

[1] https://postgr.es/m/0146a67b-a22a-0519-9082-bc29756b9...@imap.cc
[2] https://postgr.es/m/11425.1575927321%40sss.pgh.pa.us
[3] https://postgr.es/m/20191209200531.GA19848@alvherre.pgsql
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..0f40246c2d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6645,6 +6645,28 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
+ 
+  log_parameter_max_length (boolean)
+  
+   log_parameter_max_length configuration parameter
+  
+  
+  
+   
+Controls whether to log parameters in full or cut them to a certain length.
+Bind parameters can appear in the log as a result of 
+,
+ or
+
+settings.
+Zero (the default) disables cutting.
+If this value is specified without units, it is taken as bytes.
+Due to multibyte character and ellipsis, truncated values may be slightly shorter.
+Only superusers can change this setting.
+   
+  
+ 
+
  
   log_statement (enum)
   
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0a6f80963b..acc31899d6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1839,13 +1839,22 @@ exec_bind_message(StringInfo input_message)
 		if (knownTextValues == NULL)
 			knownTextValues =
 palloc0(numParams * sizeof(char *));
-		/*
-		 * Note: must copy at least two more full characters
-		 * than BuildParamLogString wants to see; otherwise it
-		 * might fail to include the ellipsis.
-		 */
-		knownTextValues[paramno] =
-			pnstrdup(pstring, 64 + 2 * MAX_MULTIBYTE_CHAR_LEN);
+
+		if (log_parameter_max_length != 0)
+		{
+			/*
+			* Note: must copy at least two more full characters
+			* than BuildParamLogString wants to see;
+			* otherwise it might fail to include the ellipsis.
+			*/
+			knownTextValues[paramno] =
+pnstrdup(pstring,
+		 log_parameter_max_length
+			+ 2 * MAX_MULTIBYTE_CHAR_LEN);
+		}
+		else
+			knownTextValues[paramno] = pstrdup(pstring);
+
 		MemoryContextSwitchTo(oldcxt);
 	}
 	if (pstring != pbuf.data)
@@ -1908,7 +1917,9 @@ exec_bind_message(StringInfo input_message)
 		 */
 		if (log_parameters_on_error)
 			params->paramValuesStr =
-BuildParamLogString(params, knownTextValues, 64);
+BuildParamLogString(params,
+	knownTextValues,
+	log_parameter_max_length);
 	}
 	else
 		params = NULL;
@@ -2397,7 +2408,7 @@ errdetail_params(ParamListInfo params)
 	{
 		char   *str;
 
-		str = BuildParamLogString(params, NULL, 0);
+		str = BuildParamLogString(params, NULL, log_parameter_max_length);
 		if (str && str[0] != '\0')
 			errdetail("parameters: %s", str);
 	}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f390..f27efc6c24 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -543,6 +543,7 @@ int			log_min_messages = WARNING;
 int			client_min_messages = NOTICE;
 int			log_min_duration_sample = -1;
 int			log_min_duration_statement = -1;
+int			log_parameter_max_length = 0;
 int			log_temp_files = -1;
 double		log_statement_sample_rate = 1.0;
 double		log_xact_sample_rate = 0;
@@ -2834,6 +2835,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"log_parameter_max_length", PGC_SUSET, LOGGING_WHAT,
+			gettext_noop("When logging a parameter value, only log first N bytes."),
+			gettext_noop("Zero to print values in full."),
+			GUC_UNIT_BYTE
+		},
+		_parameter_max_length,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"bgwriter_delay", PGC_SIGHUP, RESOURCES_BGWRITER,
 			gettext_noop("Background writer sleep time between rounds."),
diff 

Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-07 Thread Fujii Masao




On 2020/02/07 17:28, Kasahara Tatsuhito wrote:

Hi,

On Fri, Feb 7, 2020 at 5:02 PM Fujii Masao  wrote:

BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
and currtid_byrelname() so that they also call table_beginscan().
I'm not sure what those functions are, but probably we should fix them
so that table_beginscan_tid() is called instead. Thought?

+1, sorry, I overlooked it.

Both functions are used to check whether a valid tid or not with a
relation name (or oid),
and both perform a tid scan internally.
So, these functions should call table_beginscan_tid().

Perhaps unnecessary, I will attach a patch.


Pushed! Thanks!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-02-07 Thread Etsuro Fujita
On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger  wrote:
> The patches apply and pass all tests.  A review of the patch vs. master looks 
> reasonable.

Thanks for the review!

> The partition_join.sql test has multiple levels of partitioning, but when 
> your patch extends that test with “advanced partition-wise join”, none of the 
> tables for the new section have multiple levels.  I spent a little while 
> reviewing the code and inventing multiple level partitioning tests for 
> advanced partition-wise join and did not encounter any problems.  I don’t 
> care whether you use this particular example, but do you want to have 
> multiple level partitioning in the new test section?

Yes, I do.

> CREATE TABLE alpha (a double precision, b double precision) PARTITION BY 
> RANGE (a);
> CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM ('-Infinity') TO 
> (0) PARTITION BY RANGE (b);
> CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO ('Infinity') 
> PARTITION BY RANGE (b);
> CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity') TO 
> ('NaN');
> CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM 
> ('-Infinity') TO (0);
> CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0) TO 
> ('Infinity');
> CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM 
> ('Infinity') TO ('NaN');
> CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM 
> ('-Infinity') TO (0);
> CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0) TO 
> ('Infinity');
> CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM 
> ('Infinity') TO ('NaN');
> INSERT INTO alpha (a, b)
> (SELECT * FROM
> (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), 
> ('Infinity'::float8)) a,
> (VALUES (-1.0::float8), (0.0::float8), (1.0::float8), 
> ('Infinity'::float8)) b
> );
> ANALYZE alpha;
> ANALYZE alpha_neg;
> ANALYZE alpha_pos;
> ANALYZE alpha_nan;
> ANALYZE alpha_neg_neg;
> ANALYZE alpha_neg_pos;
> ANALYZE alpha_neg_nan;
> ANALYZE alpha_pos_neg;
> ANALYZE alpha_pos_pos;
> ANALYZE alpha_pos_nan;
> CREATE TABLE beta (a double precision, b double precision) PARTITION BY RANGE 
> (a, b);
> CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0, 0);
> CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0, 5);
> CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5, 5);
> INSERT INTO beta (a, b)
> (SELECT * FROM
> (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> );
> ANALYZE beta;
> ANALYZE beta_lo;
> ANALYZE beta_me;
> ANALYZE beta_hi;
> EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND alpha.b 
> = beta.b) WHERE alpha.a = 1 AND beta.b = 1;
>   QUERY PLAN
> ---
>  Nested Loop  (cost=0.00..2.11 rows=1 width=32)
>->  Seq Scan on alpha_pos_pos alpha  (cost=0.00..1.06 rows=1 width=16)
>  Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
>->  Seq Scan on beta_hi beta  (cost=0.00..1.04 rows=1 width=16)
>  Filter: ((b = '1'::double precision) AND (a = '1'::double precision))
> (5 rows)

Hmm, I'm not sure this is a good test case for that, because this
result would be due to partition pruning applied to each side of the
join before considering partition-wise join; you could get the same
result even with enable_partitionwise_join=off.  I think it's
important that the partition-wise join logic doesn't break this query,
though.

Best regards,
Etsuro Fujita




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-02-07 Thread Etsuro Fujita
On Thu, Feb 6, 2020 at 2:03 AM Mark Dilger  wrote:
> > On Feb 5, 2020, at 4:51 AM, Etsuro Fujita  wrote:

> >> On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger
> >>  wrote:
> >>> I have added tests checking correctness and showing some partition 
> >>> pruning limitations.  Find three patches, attached.
> >>>
> >>> The v31-0001-… patch merely applies your patches as a starting point for 
> >>> the next two patches.  It is your work, not mine.
> >>>
> >>> The v31-0002-… patch adds more regression tests for range partitioning.  
> >>> The commit message contains my comments about that.
> >>>
> >>> The v31-0003-… patch adds more regression tests for list partitioning, 
> >>> and again, the commit message contains my comments about that.

> > I tested the new test patches.  The patches are applied to the
> > partition_join.sql regression test cleanly, but the new tests failed
> > in my environment (even with make check LANG=C).  I think I should set
> > the right locale for the testing.  Which one did you use?
>
> I did not set a locale in the tests.  My environment has:
>
> LANG="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_CTYPE="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_ALL=

Thanks for the information!

> >  Maybe I'm
> > missing something, but let me comment on the new tests.  This is the
> > one proposed in the v31-0002 patch:
> >
> > +EXPLAIN (COSTS OFF)
> > +SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
> > WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
> > +QUERY PLAN
> > +--
> > + Hash Join
> > +   Hash Cond: (t2.a = t1.a)
> > +   ->  Append
> > + ->  Seq Scan on beta_a t2_1
> > + ->  Seq Scan on beta_b t2_2
> > + ->  Seq Scan on beta_c t2_3
> > + ->  Seq Scan on beta_d t2_4
> > + ->  Seq Scan on beta_e t2_5
> > + ->  Seq Scan on beta_f t2_6
> > + ->  Seq Scan on beta_g t2_7
> > + ->  Seq Scan on beta_h t2_8
> > + ->  Seq Scan on beta_default t2_9
> > +   ->  Hash
> > + ->  Append
> > +   ->  Seq Scan on alpha_e t1_1
> > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> > +   ->  Seq Scan on alpha_default t1_2
> > + Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
> > +(18 rows)
> >
> > The commit message says:
> >
> >When joining with
> >
> >alpha.a = beta.a and alpha.a IN ('äbç', 'ὀδυσσεύς')
> >
> >the planner does the right thing for one side of the query, but
> >hits all partitions for the other side, which it doesn't need to
> >do.
> >
> > Yeah, I agree that the resulting plan is not efficient.  The reason
> > for that would be that the planner doesn't generate a qual on the
> > outer side matching the ScalarArrayOpExpr qual "a = ANY
> > ('{äbç,ὀδυσσεύς}'::text[])" on the inner side, which I think would be
> > a restriction caused by the equivalence machinery not by the
> > partitionwise join logic IIUC.
>
> It’s fine if this is beyond the scope of the patch.
>
> >  I think the critique would be useful,
> > so I don't object to adding this test case, but the critique would be
> > more about query planning that is actually not related to the
> > partitionwise join logic, so I'm not sure that the partition_join.sql
> > regression test is the best place to add it.  I guess that there would
> > be much better places than partition_join.sql.
>
> You don’t need to add the test anywhere.  It’s enough for me that you looked 
> at it and considered whether the partition-wise join patch should do anything 
> differently in this case.  Again, it sounds like this is beyond the scope of 
> the patch.

OK

> > (This is nitpicking;
> > but another thing I noticed about this test case is that the join
> > query contains only a single join condition "t1.a = t2.a", but the
> > queried tables alpha and beta are range-partitioned by multiple
> > columns a and b, so the query should have a join condition for each of
> > the columns like "t1.a = t2.a AND t1.b = t2.b" if adding this as a
> > test case for partitionwise join.)
>
> Well, it is important that partition-wise join does not break such queries.  
> I added the column ‘b’ to the partitioning logic to verify that did not 
> confuse the code in your patch.

OK, thanks for the testing!

Best regards,
Etsuro Fujita




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-07 Thread Kuntal Ghosh
On Tue, Feb 4, 2020 at 2:40 PM Amit Kapila  wrote:
>
> I don't think we can just back-patch that part of code as it is linked
> to the way we are maintaining a cache (~8MB) for frequently allocated
> objects.  See the comments around the definition of
> max_cached_tuplebufs.  But probably, we can do something once we reach
> such a limit, basically, once we know that we have already allocated
> max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we
> don't need to allocate more of that size.  Does this make sense?
>

Yeah, this makes sense. I've attached a patch that implements the
same. It solves the problem reported earlier. This solution will at
least slow down the process of going OOM even for very small sized
tuples.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


0001-Restrict-memory-allocation-in-reorderbuffer-context.patch
Description: Binary data


Re: Internal key management system

2020-02-07 Thread Masahiko Sawada
On Fri, 7 Feb 2020 at 11:36, Andres Freund  wrote:
>
> Hi,
>
> On 2020-02-07 11:18:29 +0900, Masahiko Sawada wrote:
> > Another idea we discussed is to internally integrate pgcrypto with the
> > key management system.
>
> Perhaps this has already been discussed (I only briefly looked): I'd
> strongly advise against having any new infrastrure depend on
> pgcrypto. Its code quality imo is well below our standards and contains
> serious red flags like very outdated copies of cryptography algorithm
> implementations.  I think we should consider deprecating and removing
> it, not expanding its use.  It certainly shouldn't be involved in any
> potential disk encryption system at a later stage.

Thank you for the advise.

Yeah I'm not going to use pgcrypto for transparent data encryption.
The KMS patch includes the new basic infrastructure for cryptographic
functions (mainly AES-CBC). I'm thinking we can expand that
infrastructure so that we can also use it for TDE purpose by
supporting new cryptographic functions such as AES-CTR. Anyway, I
agree to not have it depend on pgcrypto.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: typedef SegmentNumber

2020-02-07 Thread Julien Rouhaud
On Fri, Feb 07, 2020 at 01:00:00PM +1300, Thomas Munro wrote:
> Hello,
>
> Here's a rebase of a refactoring patch that got lost behind a filing
> cabinet on another thread even though there seemed to be some
> agreement that we probably want something like this[1].  It introduces
> a new type SegmentNumber, instead of using BlockNumber to represent
> segment numbers.

+1, and looks good to me!




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-07 Thread Dilip Kumar
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila  wrote:
>
> On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar  wrote:
> >
> > Fixed in the latest version sent upthread.
> >
>
> Okay, thanks.  I haven't looked at the latest version of patch series
> as I was reviewing the previous version and I think all of these
> comments are in the patch which is not modified.  Here are my
> comments:
>
> I think we don't need to maintain
> v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> discussion in one of the above emails [1] as its usage is not clear.
>
> v8-0008-Add-support-for-streaming-to-built-in-replication
> 1.
> -  information.  The allowed options are slot_name and
> -  synchronous_commit
> +  information.  The allowed options are slot_name,
> +  synchronous_commit, work_mem
> +  and streaming.
>
> As per the discussion above [1], I don't think we need work_mem here.
> You might want to remove the other usage from the patch as well.

After putting more thought on this it appears that there could be some
use cases for setting the work_mem from the subscription,  Assume a
case where data are coming from two different origins and based on the
origin ids different slots might collect different type of changes,
So isn't it good to have different work_mem for different slots?  I am
not saying that the current way of implementing is the best one but
that we can improve.  First, we need to decide whether we have a use
case for this or not.  Please let me know your thought on the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-02-07 Thread Julien Rouhaud
On Thu, Feb 06, 2020 at 02:59:09PM -0500, Robert Haas wrote:
> On Wed, Feb 5, 2020 at 9:32 AM Julien Rouhaud  wrote:
> > There's also the possibility to reserve 1 bit of the hash to know if
> > this is a utility command or not, although I don't recall right now
> > all the possible issues with utility commands and some special
> > handling of them.  I'll work on it before the next commitfest.
>
> FWIW, I don't really see why it would be bad to have 0 mean that
> "there's no query ID for some reason" without caring whether that's
> because the current statement is a utility statement or because
> there's no statement in progress at all or whatever else. The user
> probably doesn't need our help to distinguish between "no statement"
> and "utility statement", right?

Sure, but if we don't fix that it means that we also won't expose any queryid
for utility statement, even if pg_stat_statements is configured to track those
(with a very poor queryid handling, but still).

While looking at this again, I realized that pg_stat_statements doesn't compute
a queryid during the post parse analysis hook just to make sure that no query
identifier will be set during executorStart and the rest of executor functions.

AFAICT, that can't happen anyway since pg_plan_queries() will discard any
computed queryid for utility statements.  This seems to be an oversight due to
original pg_stat_statements implementation, so I fixed this.

Then, as processUtility is called between parse analysis and executor, I think
that we can simply work around this by computing utility statements query
identifier during parse analysis, removing it in pgss_ProcessUtility and
keeping a copy of it for the pgss_store calls in that function, as done in the
attached v5.

This fixes everything except EXECUTE statements, which has to get the
underlying query's queryid.  The problem is that EXECUTE won't get through
parse analysis, so while it's correctly handled for execution and pgss_store,
it's not being exposed in pg_stat_activity and log_line_prefix.  To fix it, I
added an extra call to pgstat_report_queryid in executorStart.  As this
function is a no-op if a queryid is already exposed, this shouldn't cause any
harm and fix any other cases of query execution that don't go through parse
analysis.

Finally, DEALLOCATE is entirely ignored by pg_stat_statements, so those
statements will always be reported with a NULL/0 queryid, but this is
consistent as it's also not present in pg_stat_statements() SRF.
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index 6f82a671ee..2da810ade6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -112,6 +112,14 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM 
/ 100;
 
 #define JUMBLE_SIZE1024/* query serialization 
buffer size */
 
+/*
+ * Utility statements that pgss_ProcessUtility and pgss_post_parse_analyze
+ * ignores.
+ */
+#define PGSS_HANDLED_UTILITY(n)(!IsA(n, ExecuteStmt) && \
+   !IsA(n, 
PrepareStmt) && \
+   !IsA(n, 
DeallocateStmt))
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -308,7 +316,8 @@ static void pgss_ProcessUtility(PlannedStmt *pstmt, const 
char *queryString,

ProcessUtilityContext context, ParamListInfo params,

QueryEnvironment *queryEnv,
DestReceiver 
*dest, char *completionTag);
-static uint64 pgss_hash_string(const char *str, int len);
+static const char *pgss_clean_querytext(const char *query, int *location, int 
*len);
+static uint64 pgss_compute_utility_queryid(const char *query, int query_len);
 static void pgss_store(const char *query, uint64 queryId,
   int query_location, int query_len,
   double total_time, uint64 rows,
@@ -792,16 +801,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query)
return;
 
/*
-* Utility statements get queryId zero.  We do this even in cases where
-* the statement contains an optimizable statement for which a queryId
-* could be derived (such as EXPLAIN or DECLARE CURSOR).  For such 
cases,
-* runtime control will first go through ProcessUtility and then the
-* executor, and we don't want the executor hooks to do anything, since 
we
-* are already measuring the statement's costs at the utility level.
+* We compute a queryId now so that it can get exported in out
+* PgBackendStatus.  pgss_ProcessUtility will later discard it to 

Fix comment for max_cached_tuplebufs definition

2020-02-07 Thread Kuntal Ghosh
Hello Hackers,

While working on some issue in logical decoding, I found some
inconsistencies in the comment for defining max_cached_tuplebufs in
reorderbuffer.c. It only exists till PG10 because after that the
definition got removed by the generational memory allocator patch. The
variable is defined as follows in reorderbuffer.c:
static const Size max_cached_tuplebufs = 4096 * 2;  /* ~8MB */

And it gets compared with rb->nr_cached_tuplebufs in
ReorderBufferReturnTupleBuf as follows:
if (tuple->alloc_tuple_size == MaxHeapTupleSize &&
rb->nr_cached_tuplebufs < max_cached_tuplebufs)

   {
rb->nr_cached_tuplebufs++;
}

So, what this variable actually tracks is 4096 * 2 times
MaxHeapTupleSize amount of memory which is approximately 64MB. I've
attached a patch to modify the comment.

But, I'm not sure whether the intention was to keep 8MB cache only. In
that case, I can come up with another patch.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-comment-for-max_cached_tuplebufs-definition.patch
Description: Binary data


Re: [HACKERS] WIP: Aggregation push-down

2020-02-07 Thread Antonin Houska
Richard Guo  wrote:

> Hi,
> 
> I've been looking at the 'make_join_rel()' part of the patch, and I'm
> aware that, if we are joining A and B, a 'grouped join rel (AB)' would
> be created besides the 'plain join rel (AB)', and paths are added by 1)
> applying partial aggregate to the paths of the 'plain join rel (AB)', or
> 2) joining grouped A to plain B or joining plain A to grouped B.
> 
> This is a smart idea. One issue I can see is that some logic would have
> to be repeated several times. For example, the validity check for the
> same proposed join would be performed at most three times by
> join_is_legal().
> 
> I'm thinking of another idea that, instead of using a separate
> RelOptInfo for the grouped rel, we add in RelOptInfo a
> 'grouped_pathlist' for the grouped paths, just like how we implement
> parallel query via adding 'partial_pathlist'.
> 
> For base rel, if we decide it can produce grouped paths, we create the
> grouped paths by applying partial aggregation to the paths in 'pathlist'
> and add them into 'grouped_pathlist'.
> 
> For join rel (AB), we can create the grouped paths for it by:
> 1) joining a grouped path from 'A->grouped_pathlist' to a plain path
> from 'B->pathlist', or vice versa;
> 2) applying partial aggregation to the paths in '(AB)->pathlist'.
> 
> This is basically the same idea, just moves the grouped paths from the
> grouped rel to a 'grouped_pathlist'. With it we should not need to make
> any code changes to 'make_join_rel()'. Most code changes would happen in
> 'add_paths_to_joinrel()'.
> 
> Will this idea work? Is it better or worse?

I tried such approach in an earlier version of the patch [1], and I think the
reason also was to avoid repeated calls of functions like
join_is_legal(). However there were objections against such approach,
e.g. [2], and I admit that it was more invasive than what the current patch
version does.

Perhaps we can cache the result of join_is_legal() that we get for the plain
relation, and use it for the group relation. I'll consider that. Thanks.

[1] https://www.postgresql.org/message-id/18007.1513957437%40localhost
[2] 
https://www.postgresql.org/message-id/CA%2BTgmob8og%2B9HzMg1vM%2B3LwDm2f_bHUi9%2Bg1bqLDTjqpw5s%2BnQ%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: In PG12, query with float calculations is slower than PG11

2020-02-07 Thread Amit Langote
Fwiw, also tried the patch that Kuroda-san had posted yesterday.

On Fri, Feb 7, 2020 at 5:17 PM Amit Langote  wrote:
> Latency and profiling results:
>
> gcc 8 (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3))
> 
>
> 11.6
>
> latency average = 463.968 ms
>
> 40.62%  postgres  postgres   [.] ExecInterpExpr
>  9.74%  postgres  postgres   [.] float8_accum
>  6.12%  postgres  libc-2.17.so   [.] __isinf
>  5.96%  postgres  postgres   [.] float8mul
>  5.33%  postgres  postgres   [.] dsqrt
>  3.90%  postgres  postgres   [.] ftod
>  3.53%  postgres  postgres   [.] Float8GetDatum
>  2.34%  postgres  postgres   [.] DatumGetFloat8
>  2.15%  postgres  postgres   [.] AggCheckCallContext
>  2.03%  postgres  postgres   [.] slot_deform_tuple
>  1.95%  postgres  libm-2.17.so   [.] __sqrt
>  1.19%  postgres  postgres   [.] check_float8_array
>
> HEAD
>
> latency average = 549.071 ms
>
> 31.74%  postgres  postgres   [.] ExecInterpExpr
> 11.02%  postgres  libc-2.17.so   [.] __isinf
> 10.58%  postgres  postgres   [.] float8_accum
>  4.84%  postgres  postgres   [.] check_float8_val
>  4.66%  postgres  postgres   [.] dsqrt
>  3.91%  postgres  postgres   [.] float8mul
>  3.56%  postgres  postgres   [.] ftod
>  3.26%  postgres  postgres   [.] Float8GetDatum
>  2.91%  postgres  postgres   [.] float8_mul
>  2.30%  postgres  postgres   [.] DatumGetFloat8
>  2.19%  postgres  postgres   [.] slot_deform_heap_tuple
>  1.81%  postgres  postgres   [.] AggCheckCallContext
>  1.31%  postgres  libm-2.17.so   [.] __sqrt
>  1.25%  postgres  postgres   [.] check_float8_array
>
> HEAD + patch
>
> latency average = 546.624 ms
>
> 33.51%  postgres  postgres   [.] ExecInterpExpr
> 10.35%  postgres  postgres   [.] float8_accum
> 10.06%  postgres  libc-2.17.so   [.] __isinf
>  4.58%  postgres  postgres   [.] dsqrt
>  4.14%  postgres  postgres   [.] check_float8_val
>  4.03%  postgres  postgres   [.] ftod
>  3.54%  postgres  postgres   [.] float8mul
>  2.96%  postgres  postgres   [.] Float8GetDatum
>  2.38%  postgres  postgres   [.] slot_deform_heap_tuple
>  2.23%  postgres  postgres   [.] DatumGetFloat8
>  2.09%  postgres  postgres   [.] float8_mul
>  1.88%  postgres  postgres   [.] AggCheckCallContext
>  1.65%  postgres  libm-2.17.so   [.] __sqrt
>  1.22%  postgres  postgres   [.] check_float8_array

HEAD + Kuroda-san's patch (compiled with gcc 8)

latency average = 484.604 ms

37.41%  postgres  postgres[.] ExecInterpExpr
10.83%  postgres  postgres[.] float8_accum
 5.62%  postgres  postgres[.] dsqrt
 4.23%  postgres  libc-2.17.so[.] __isinf
 4.05%  postgres  postgres[.] float8mul
 3.85%  postgres  postgres[.] ftod
 3.18%  postgres  postgres[.] Float8GetDatum
 2.81%  postgres  postgres[.] slot_deform_heap_tuple
 2.63%  postgres  postgres[.] DatumGetFloat8
 2.46%  postgres  postgres[.] float8_mul
 1.91%  postgres  libm-2.17.so[.] __sqrt

> clang-7 (clang version 7.0.1 (tags/RELEASE_701/final))
> =
>
> 11.6
>
> latency average = 419.014 ms
>
> 47.57%  postgres  postgres   [.] ExecInterpExpr
>  7.99%  postgres  postgres   [.] float8_accum
>  5.96%  postgres  postgres   [.] dsqrt
>  4.88%  postgres  postgres   [.] float8mul
>  4.23%  postgres  postgres   [.] ftod
>  3.30%  postgres  postgres   [.] slot_deform_tuple
>  3.19%  postgres  postgres   [.] DatumGetFloat8
>  1.92%  postgres  libm-2.17.so   [.] __sqrt
>  1.72%  postgres  postgres   [.] check_float8_array
>
> HEAD
>
> latency average = 452.958 ms
>
> 40.55%  postgres  postgres   [.] ExecInterpExpr
> 10.61%  postgres  postgres   [.] float8_accum
>  4.58%  postgres  postgres   [.] dsqrt
>  3.59%  postgres  postgres   [.] slot_deform_heap_tuple
>  3.54%  postgres  postgres   [.] check_float8_val
>  3.48%  postgres  postgres   [.] ftod
>  3.42%  postgres  postgres   [.] float8mul
>  3.22%  postgres  postgres   [.] DatumGetFloat8
>  2.69%  postgres  postgres   [.] Float8GetDatum
>  2.46%  postgres  postgres   [.] float8_mul
>  2.29%  postgres  libm-2.17.so   [.] __sqrt
>  1.47%  postgres  postgres   [.] check_float8_array
>
> HEAD + patch
>
> latency average = 452.533 ms
>
> 41.05%  postgres  postgres   [.] ExecInterpExpr
> 10.15%  

Re: Assumptions about the number of parallel workers

2020-02-07 Thread Antonin Houska
Andres Freund  wrote:

> Hi,
> 
> On 2020-02-05 10:50:05 +0100, Antonin Houska wrote:
> > I can't figure out why ExecGather/ExecGatherMerge do check whether 
> > num_workers
> > is non-zero. I think the code would be a bit clearer if these tests were
> > replaced with Assert() statements, as the attached patch does.
> 
> It's probably related to force_parallel_mode. With that we'll IIRC
> generate gather nodes even if num_workers == 0.

Those Gather nodes still have non-zero num_workers, see this part of
standard_planner:

if (force_parallel_mode != FORCE_PARALLEL_OFF && top_plan->parallel_safe)
{
...
gather->num_workers = 1;
gather->single_copy = true;

Also, if it num_workers was zero for any reason, my patch would probably make
regression tests fail. However I haven't noticed any assertion failure.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: ALTER tbl rewrite loses CLUSTER ON index

2020-02-07 Thread Amit Langote
On Fri, Feb 7, 2020 at 2:24 AM Alvaro Herrera  wrote:
> On 2020-Feb-06, Justin Pryzby wrote:
> > I wondered if it wouldn't be better if CLUSTER ON was stored in pg_class as 
> > the
> > Oid of a clustered index, rather than a boolean in pg_index.
>
> Maybe.  Do you want to try a patch?

+1

Thanksm
Amit




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-07 Thread Kyotaro Horiguchi
At Fri, 7 Feb 2020 17:01:59 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/02/07 15:07, Kasahara Tatsuhito wrote:
> > Hi,
> > On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi
> >  wrote:
> >> It seems that nkeys and key are useless. Since every table_beginscan_*
> >> functions have distinct parameter sets, don't we remove them from
> >> table_beginscan_tid?
> > Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0
> > and * key is set to NULL,
> > so these are not used at the moment.
> > I removed unnecessary arguments from table_beginscan_tid().
> > Attache the v5 patch.
> 
> Thanks for updating the patch! The patch looks good to me.
> So barring any objection, I will push it and back-patch to v12 *soon*
> so that the upcoming minor version can contain it.
> 
> BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
> and currtid_byrelname() so that they also call table_beginscan().
> I'm not sure what those functions are, but probably we should fix them
> so that table_beginscan_tid() is called instead. Thought?

At least they don't seem to need table_beginscan(), to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-07 Thread Kasahara Tatsuhito
Hi,

On Fri, Feb 7, 2020 at 5:02 PM Fujii Masao  wrote:
> BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
> and currtid_byrelname() so that they also call table_beginscan().
> I'm not sure what those functions are, but probably we should fix them
> so that table_beginscan_tid() is called instead. Thought?
+1, sorry, I overlooked it.

Both functions are used to check whether a valid tid or not with a
relation name (or oid),
and both perform a tid scan internally.
So, these functions should call table_beginscan_tid().

Perhaps unnecessary, I will attach a patch.

Best regards,
-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_issues_v6.patch
Description: Binary data


Re: setLastTid() and currtid()

2020-02-07 Thread Fujii Masao
On Fri, Apr 12, 2019 at 1:44 PM Michael Paquier  wrote:
>
> On Thu, Apr 11, 2019 at 02:06:13PM -0400, Tom Lane wrote:
> > Yeah, if we could simplify the tableam API, that would be sufficient
> > reason to remove the stuff in v12, IMO.  But if it is outside of that
> > API, I'd counsel waiting till v13.
>
> Yes, I agree that simplifying the table AM interface would be a reason
> fine enough to delete this code for v12.  If not, v13 sounds better at
> this stage.

Now we are in the dev of v13, so it's time to rip the functions out?

Regards,

-- 
Fujii Masao




Re: In PG12, query with float calculations is slower than PG11

2020-02-07 Thread Amit Langote
On Fri, Feb 7, 2020 at 4:54 PM Andres Freund  wrote:
> On February 6, 2020 11:42:30 PM PST, keisuke kuroda 
>  wrote:
> >Hi,
> >
> >I have been testing with newer compiler (clang-7)
> >and result is a bit different at least with clang-7.
> >Compiling PG 12.1 (even without patch) with clang-7
> >results in __isinf() no longer being a bottleneck,
> >that is, you don't see it in profiler at all.
>
> I don't think that's necessarily the right conclusion. What's quite possibly 
> happening is that you do not see the external isinf function anymore, because 
> it is implemented as an intrinsic,  but that there still are more 
> computations being done. Due to the changed order of the isinf checks. You'd 
> have to compare with 11 using the same compiler.

I did some tests using two relatively recent compilers: gcc 8 and
clang-7 and here are the results:

Setup:

create table realtest (a real, b real, c real, d real, e real);
insert into realtest select i, i, i, i, i from generate_series(1, 100) i;

Test query:

/tmp/query.sql
select avg(2*dsqrt(a)), avg(2*dsqrt(b)), avg(2*dsqrt(c)),
avg(2*dsqrt(d)), avg(2*dsqrt(e)) from realtest;

pgbench -n -T 60 -f /tmp/query.sql

Latency and profiling results:

gcc 8 (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3))


11.6

latency average = 463.968 ms

40.62%  postgres  postgres   [.] ExecInterpExpr
 9.74%  postgres  postgres   [.] float8_accum
 6.12%  postgres  libc-2.17.so   [.] __isinf
 5.96%  postgres  postgres   [.] float8mul
 5.33%  postgres  postgres   [.] dsqrt
 3.90%  postgres  postgres   [.] ftod
 3.53%  postgres  postgres   [.] Float8GetDatum
 2.34%  postgres  postgres   [.] DatumGetFloat8
 2.15%  postgres  postgres   [.] AggCheckCallContext
 2.03%  postgres  postgres   [.] slot_deform_tuple
 1.95%  postgres  libm-2.17.so   [.] __sqrt
 1.19%  postgres  postgres   [.] check_float8_array

HEAD

latency average = 549.071 ms

31.74%  postgres  postgres   [.] ExecInterpExpr
11.02%  postgres  libc-2.17.so   [.] __isinf
10.58%  postgres  postgres   [.] float8_accum
 4.84%  postgres  postgres   [.] check_float8_val
 4.66%  postgres  postgres   [.] dsqrt
 3.91%  postgres  postgres   [.] float8mul
 3.56%  postgres  postgres   [.] ftod
 3.26%  postgres  postgres   [.] Float8GetDatum
 2.91%  postgres  postgres   [.] float8_mul
 2.30%  postgres  postgres   [.] DatumGetFloat8
 2.19%  postgres  postgres   [.] slot_deform_heap_tuple
 1.81%  postgres  postgres   [.] AggCheckCallContext
 1.31%  postgres  libm-2.17.so   [.] __sqrt
 1.25%  postgres  postgres   [.] check_float8_array

HEAD + patch

latency average = 546.624 ms

33.51%  postgres  postgres   [.] ExecInterpExpr
10.35%  postgres  postgres   [.] float8_accum
10.06%  postgres  libc-2.17.so   [.] __isinf
 4.58%  postgres  postgres   [.] dsqrt
 4.14%  postgres  postgres   [.] check_float8_val
 4.03%  postgres  postgres   [.] ftod
 3.54%  postgres  postgres   [.] float8mul
 2.96%  postgres  postgres   [.] Float8GetDatum
 2.38%  postgres  postgres   [.] slot_deform_heap_tuple
 2.23%  postgres  postgres   [.] DatumGetFloat8
 2.09%  postgres  postgres   [.] float8_mul
 1.88%  postgres  postgres   [.] AggCheckCallContext
 1.65%  postgres  libm-2.17.so   [.] __sqrt
 1.22%  postgres  postgres   [.] check_float8_array


clang-7 (clang version 7.0.1 (tags/RELEASE_701/final))
=

11.6

latency average = 419.014 ms

47.57%  postgres  postgres   [.] ExecInterpExpr
 7.99%  postgres  postgres   [.] float8_accum
 5.96%  postgres  postgres   [.] dsqrt
 4.88%  postgres  postgres   [.] float8mul
 4.23%  postgres  postgres   [.] ftod
 3.30%  postgres  postgres   [.] slot_deform_tuple
 3.19%  postgres  postgres   [.] DatumGetFloat8
 1.92%  postgres  libm-2.17.so   [.] __sqrt
 1.72%  postgres  postgres   [.] check_float8_array

HEAD

latency average = 452.958 ms

40.55%  postgres  postgres   [.] ExecInterpExpr
10.61%  postgres  postgres   [.] float8_accum
 4.58%  postgres  postgres   [.] dsqrt
 3.59%  postgres  postgres   [.] slot_deform_heap_tuple
 3.54%  postgres  postgres   [.] check_float8_val
 3.48%  postgres  postgres   [.] ftod
 3.42%  postgres  postgres   [.] float8mul
 3.22%  postgres  postgres   [.] DatumGetFloat8
 2.69%  postgres  postgres   [.] Float8GetDatum
 2.46%  postgres  postgres   [.] float8_mul
 2.29%  postgres  libm-2.17.so   [.] __sqrt
 

Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-07 Thread Fujii Masao




On 2020/02/07 15:07, Kasahara Tatsuhito wrote:

Hi,

On Fri, Feb 7, 2020 at 1:27 PM Kyotaro Horiguchi
 wrote:

It seems that nkeys and key are useless. Since every table_beginscan_*
functions have distinct parameter sets, don't we remove them from
table_beginscan_tid?

Yeah, actually, when calling table_beginscan_tid(), nkeys is set to 0
and * key is set to NULL,
so these are not used at the moment.

I removed unnecessary arguments from table_beginscan_tid().

Attache the v5 patch.


Thanks for updating the patch! The patch looks good to me.
So barring any objection, I will push it and back-patch to v12 *soon*
so that the upcoming minor version can contain it.

BTW, commit 147e3722f7 causing the issue changed currtid_byreloid()
and currtid_byrelname() so that they also call table_beginscan().
I'm not sure what those functions are, but probably we should fix them
so that table_beginscan_tid() is called instead. Thought?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters