procedures and plpgsql PERFORM

2017-12-13 Thread Ashutosh Bapat
Hi,
We allow a function to be invoked as part of PERFORM statement in plpgsql
do $$
begin perform pg_relation_size('t1'); end; $$ language plpgsql;
DO

But we do not allow a procedure to be invoked this way
 create procedure dummy_proc(a int) as $$
begin null; end;
$$ language plpgsql;
CREATE PROCEDURE

do $$
begin perform dummy_proc(1); end; $$ language plpgsql;
ERROR:  dummy_proc(integer) is a procedure
LINE 1: SELECT dummy_proc(1)
   ^
HINT:  To call a procedure, use CALL.
QUERY:  SELECT dummy_proc(1)
CONTEXT:  PL/pgSQL function inline_code_block line 2 at PERFORM

The documentation of PERFORM [1] says
"For any SQL command that does not return rows, for example INSERT
without a RETURNING clause, you can execute the command within a
PL/pgSQL function just by writing the command."

Procedures fit that category and like functions, I think, we should
allow them be invoked directly without any quoting and CALL
decoration.

[1] 
https://www.postgresql.org/docs/devel/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-NORESULT
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Walsender timeouts and large transactions

2017-12-13 Thread Craig Ringer
On 7 December 2017 at 01:22, Petr Jelinek 
wrote:

> On 05/12/17 21:07, Robert Haas wrote:
> >
> > Generally we write if (a && b) { ... } not if (a) { if (b) .. }
> >
>
> It's rather ugly with && because one of the conditions is two line, but
> okay here you go. I am keeping the brackets even if normally don't for
> one-liners because it's completely unreadable without them IMHO.
>
>

Yeah, that's why I passed on that FWIW. Sometimes breaking up a condition
is nice. Personally I intensely dislike the convention of


if (big_condition
&& big_condition)
   one_linerdo_something;


as awfully unreadable, but I guess code convention means you live with
things you don't like.


Anyway, I've just hit this bug in the wild for the umpteenth time this
year, and I'd like to know what I can do to help progress it to
commit+backport.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL

2017-12-13 Thread Justin Pryzby
On Thu, Dec 14, 2017 at 08:51:06AM +0700, Ali Akbar wrote:
> 2017-12-13 15:37 GMT+07:00 Amit Langote :
> 
> > On 2017/12/13 15:59, Ali Akbar wrote:
> > >
> > > Thanks for the link to those thread.
> > >
> > > Judging from the discussion there, it will be a long way to prevent DROP
> > > NOT NULL.
> >
> > Yeah, I remembered that discussion when writing my email, but was for some
> > reason convinced that everything's fine even without the elaborate
> > book-keeping of inheritance information for NOT NULL constraints.  Thanks
> > Michael for reminding.
> >
> 
> Patch for adding check in pg_upgrade. Going through git history, the check
> for inconsistency in NOT NULL constraint has ben there since a long time
> ago. In this patch the check will be applied for all old cluster version.
> I'm not sure in which version was the release of table inheritance.

Here are some spelling and grammar fixes to that patch:

but nullabe in its children: nullable
child column is not market: marked
with adding: by adding
and restart: and restarting
the problem columns: the problematic columns
9.5, 9.6, 10: 9.5, 9.6, and 10
restore, that will cause error.: restore phase of pg_upgrade, that would cause 
an error.

Justin



pg_(total_)relation_size and partitioned tables

2017-12-13 Thread Amit Langote
Hi.

You may have guessed from $subject that the two don't work together.

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1, 2) partition by list (a);
create table p11 partition of p1 for values in (1);
create table p12 partition of p1 for values in (2);
insert into p select i % 2 + 1 from generate_series(1, 1000) i;

On HEAD:

select pg_relation_size('p');
 pg_relation_size
--
0
(1 row)

select pg_total_relation_size('p');
 pg_total_relation_size

  0
(1 row)

After applying the attached patch:

select pg_relation_size('p');
 pg_relation_size
--
49152
(1 row)

select pg_total_relation_size('p');
 pg_total_relation_size

  98304
(1 row)

The patch basically accumulates and returns the sizes of all leaf
partitions when passed a partitioned table.  Will add it to next CF.

Thanks,
Amit
From 4a5db12ad0155e043500d87caf2cc9438eb87e23 Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 14 Dec 2017 13:56:08 +0900
Subject: [PATCH] Teach dbsize.c functions about partitioned tables

---
 src/backend/utils/adt/dbsize.c | 101 ++---
 1 file changed, 85 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 58c0b01bdc..145f511c6c 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -17,6 +17,7 @@
 #include "access/htup_details.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_tablespace.h"
 #include "commands/dbcommands.h"
@@ -307,13 +308,17 @@ calculate_relation_size(RelFileNode *rfn, BackendId 
backend, ForkNumber forknum)
return totalsize;
 }
 
-Datum
-pg_relation_size(PG_FUNCTION_ARGS)
+/*
+ * Computes and stores in *size the file size of the specified fork of the
+ * relation with OID relOid.
+
+ * Returns true if size was sucessfully calculated and stored in *size, false
+ * otherwise.
+ */
+static bool
+pg_relation_size_internal(Oid relOid, char *forkName, int64 *size)
 {
-   Oid relOid = PG_GETARG_OID(0);
-   text   *forkName = PG_GETARG_TEXT_PP(1);
Relationrel;
-   int64   size;
 
rel = try_relation_open(relOid, AccessShareLock);
 
@@ -321,17 +326,52 @@ pg_relation_size(PG_FUNCTION_ARGS)
 * Before 9.2, we used to throw an error if the relation didn't exist, 
but
 * that makes queries like "SELECT pg_relation_size(oid) FROM pg_class"
 * less robust, because while we scan pg_class with an MVCC snapshot,
-* someone else might drop the table. It's better to return NULL for
-* already-dropped tables than throw an error and abort the whole query.
+* someone else might drop the table. It's better to return that we
+* couldn't get the size for already-dropped tables than throw an error
+* and abort the whole query.
 */
if (rel == NULL)
-   PG_RETURN_NULL();
+   return false;
+
+   /*
+* For a partitioned table, get the size by accumulating that of its
+* leaf partitions.
+*/
+   if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   {
+   PartitionDesc   partdesc = rel->rd_partdesc;
+   int i;
 
-   size = calculate_relation_size(&(rel->rd_node), rel->rd_backend,
-  
forkname_to_number(text_to_cstring(forkName)));
+   *size = 0;
+   for (i = 0; i < partdesc->nparts; i++)
+   {
+   int64   partsize;
+
+   if (pg_relation_size_internal(partdesc->oids[i], 
forkName,
+   
  ))
+   *size += partsize;
+   }
+   }
+   else
+   *size = calculate_relation_size(&(rel->rd_node), 
rel->rd_backend,
+   
forkname_to_number(forkName));
 
relation_close(rel, AccessShareLock);
 
+   return true;
+}
+
+Datum
+pg_relation_size(PG_FUNCTION_ARGS)
+{
+   Oid relOid = PG_GETARG_OID(0);
+   text   *forkName = PG_GETARG_TEXT_PP(1);
+   int64   size;
+
+   if (!pg_relation_size_internal(relOid, text_to_cstring(forkName),
+  ))
+   PG_RETURN_NULL();
+
PG_RETURN_INT64(size);
 }
 
@@ -508,22 +548,51 @@ calculate_total_relation_size(Relation rel)
return size;
 }
 
-Datum
-pg_total_relation_size(PG_FUNCTION_ARGS)
+static int64
+pg_total_relation_size_internal(Oid 

incorrect error message, while dropping PROCEDURE

2017-12-13 Thread Rushabh Lathia
Hi,

Currently if some one try to drop the PROCEDURE and
it don't have privilege or it's not an owner, than error message
still indicate object as FUNCTION.

Example:

postgres@37737=#drop procedure pro;
ERROR:  must be owner of function pro

This doesn't look correct specially that now we have separate
object type as OBJECT_PROCEDURE.  It seems like we need
to introduce new AclObjectKind for PROCEDURE and do the
necessary changes to pass the correct AclObjectKind.

PFA patch, where introduced new AclObjectKind (ACL_KIND_PROCEDURE),
msg for the new AclObjectKind, and passed it through at
appropriate places.

Also update the necessary "make check" expected output changes.

Regards,

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index e481cf3..680ef18 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -257,6 +257,7 @@ restrict_and_check_grant(bool is_grant, AclMode avail_goptions, bool all_privs,
 			whole_mask = ACL_ALL_RIGHTS_DATABASE;
 			break;
 		case ACL_KIND_PROC:
+		case ACL_KIND_PROCEDURE:
 			whole_mask = ACL_ALL_RIGHTS_FUNCTION;
 			break;
 		case ACL_KIND_LANGUAGE:
@@ -2565,7 +2566,8 @@ ExecGrant_Function(InternalGrant *istmt)
 		this_privileges =
 			restrict_and_check_grant(istmt->is_grant, avail_goptions,
 	 istmt->all_privs, istmt->privileges,
-	 funcId, grantorId, ACL_KIND_PROC,
+	 funcId, grantorId,
+	 (istmt->objtype == ACL_OBJECT_PROCEDURE) ? ACL_KIND_PROCEDURE : ACL_KIND_PROC,
 	 NameStr(pg_proc_tuple->proname),
 	 0, NULL);
 
@@ -3360,6 +3362,8 @@ static const char *const no_priv_msg[MAX_ACL_KIND] =
 	gettext_noop("permission denied for database %s"),
 	/* ACL_KIND_PROC */
 	gettext_noop("permission denied for function %s"),
+	/* ACL_KIND_PROCEDURE */
+	gettext_noop("permission denied for procedure %s"),
 	/* ACL_KIND_OPER */
 	gettext_noop("permission denied for operator %s"),
 	/* ACL_KIND_TYPE */
@@ -3412,6 +3416,8 @@ static const char *const not_owner_msg[MAX_ACL_KIND] =
 	gettext_noop("must be owner of database %s"),
 	/* ACL_KIND_PROC */
 	gettext_noop("must be owner of function %s"),
+	/* ACL_KIND_PROCEDURE */
+	gettext_noop("must be owner of procedure %s"),
 	/* ACL_KIND_OPER */
 	gettext_noop("must be owner of operator %s"),
 	/* ACL_KIND_TYPE */
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 9553675..d67a68b 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -2261,7 +2261,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address,
 		case OBJECT_PROCEDURE:
 		case OBJECT_ROUTINE:
 			if (!pg_proc_ownercheck(address.objectId, roleid))
-aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_PROC,
+aclcheck_error(ACLCHECK_NOT_OWNER, objtype == OBJECT_PROCEDURE ? ACL_KIND_PROCEDURE : ACL_KIND_PROC,
 			   NameListToString((castNode(ObjectWithArgs, object))->objname));
 			break;
 		case OBJECT_OPERATOR:
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index 2a9c901..7541146 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -2270,7 +2270,7 @@ ExecuteCallStmt(ParseState *pstate, CallStmt *stmt)
 
 	aclresult = pg_proc_aclcheck(fexpr->funcid, GetUserId(), ACL_EXECUTE);
 	if (aclresult != ACLCHECK_OK)
-		aclcheck_error(aclresult, ACL_KIND_PROC, get_func_name(fexpr->funcid));
+		aclcheck_error(aclresult, ACL_KIND_PROCEDURE, get_func_name(fexpr->funcid));
 	InvokeFunctionExecuteHook(fexpr->funcid);
 
 	nargs = list_length(fexpr->args);
diff --git a/src/include/utils/acl.h b/src/include/utils/acl.h
index 254a811..44037a0 100644
--- a/src/include/utils/acl.h
+++ b/src/include/utils/acl.h
@@ -191,6 +191,7 @@ typedef enum AclObjectKind
 	ACL_KIND_SEQUENCE,			/* pg_sequence */
 	ACL_KIND_DATABASE,			/* pg_database */
 	ACL_KIND_PROC,/* pg_proc */
+	ACL_KIND_PROCEDURE,			/* pg_proc procedure */
 	ACL_KIND_OPER,/* pg_operator */
 	ACL_KIND_TYPE,/* pg_type */
 	ACL_KIND_LANGUAGE,			/* pg_language */
diff --git a/src/test/regress/expected/create_procedure.out b/src/test/regress/expected/create_procedure.out
index 5538ef2..9a5be38 100644
--- a/src/test/regress/expected/create_procedure.out
+++ b/src/test/regress/expected/create_procedure.out
@@ -73,7 +73,7 @@ GRANT INSERT ON cp_test TO regress_user1;
 REVOKE EXECUTE ON PROCEDURE ptest1(text) FROM PUBLIC;
 SET ROLE regress_user1;
 CALL ptest1('a');  -- error
-ERROR:  permission denied for function ptest1
+ERROR:  permission denied for procedure ptest1
 RESET ROLE;
 GRANT EXECUTE ON PROCEDURE ptest1(text) TO regress_user1;
 SET ROLE regress_user1;
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index e6994f0..c92993d 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -696,7 +696,7 @@ ERROR:  

Re: Protect syscache from bloating with negative cache entries

2017-12-13 Thread Kyotaro HORIGUCHI
At Fri, 1 Dec 2017 14:12:20 -0800, Andres Freund  wrote in 
<20171201221220.z5e6wtlpl264w...@alap3.anarazel.de>
> On 2017-12-01 17:03:28 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
> > >> I have no faith in either of these proposals, because they both assume
> > >> that the problem only arises over the course of many minutes.  In the
> > >> recent complaint about pg_dump causing relcache bloat, it probably does
> > >> not take nearly that long for the bloat to occur.
> > 
> > > To me that's a bit of a different problem than what I was discussing
> > > here.  It also actually doesn't seem that hard - if your caches are
> > > growing fast, you'll continually get hash-resizing of the
> > > various. Adding cache-pruning to the resizing code doesn't seem hard,
> > > and wouldn't add meaningful overhead.
> > 
> > That's an interesting way to think about it, as well, though I'm not
> > sure it's quite that simple.  If you tie this to cache resizing then
> > the cache will have to grow up to the newly increased size before
> > you'll prune it again.  That doesn't sound like it will lead to nice
> > steady-state behavior.
> 
> Yea, it's not perfect - but if we do pruning both at resize *and* on
> regular intervals, like once-a-minute as I was suggesting, I don't think
> it's that bad. The steady state won't be reached within seconds, true,
> but the negative consequences of only attempting to shrink the cache
> upon resizing when the cache size is growing fast anyway doesn't seem
> that large.
> 
> I don't think we need to be super accurate here, there just needs to be
> *some* backpressure.
> 
> I've had cases in the past where just occasionally blasting the cache
> away would've been good enough.

Thank you very much for the valuable suggestions. I still would
like to solve this problem and the
a-counter-freely-running-in-minute(or several seconds)-resolution
and pruning-too-long-unaccessed-entries-on-resizing seems to me
to work enough for at least several known bloat cases. This still
has a defect that this is not workable for a very quick
bloating. I'll try thinking about the remaining issue.

If no one has immediate objection to the direction, I'll come up
with an implementation.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] UPDATE of partition key

2017-12-13 Thread Amit Langote
Thanks for the updated patches, Amit.

Some review comments.

Forgot to remove the description of update_rri and num_update_rri in the
header comment of ExecSetupPartitionTupleRouting().

-
+extern void pull_child_partition_columns(Relation rel,
+ Relation parent,
+ Bitmapset **partcols);

It seems you forgot to remove this declaration in partition.h, because I
don't find it defined or used anywhere.

I think some of the changes that are currently part of the main patch are
better taken out into their own patches, because having those diffs appear
in the main patch is kind of distracting.  Just like you now have a patch
that introduces a PartitionTupleRouting structure.  I know that leads to
too many patches, but it helps to easily tell less substantial changes
from the substantial ones.

1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps.

2. Patch that introduces has_partition_attrs() in place of
   is_partition_attr()

3. Patch to change the names of map_partition_varattnos() arguments

4. Patch that does the refactoring involving ExecConstrains(),
   ExecPartitionCheck(), and the introduction of
   ExecPartitionCheckEmitError()


Regarding ExecSetupChildParentMap(), it seems to me that it could simply
be declared as

static void ExecSetupChildParentMap(ModifyTableState *mtstate);

Looking at the places from where it's called, it seems that you're just
extracting information from mtstate and passing the same for the rest of
its arguments.

mt_is_tupconv_perpart seems like it's unnecessary.  Its function could be
fulfilled by inspecting the state of some other fields of
ModifyTableState.  For example, in the case of an update (operation ==
CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always
assume that mt_childparent_tupconv_maps has entries for all partitions.
If it's NULL, then there would be only entries for partitions that have
sub-plans.

tupconv_map_for_subplan() looks like it could be done as a macro.

Thanks,
Amit




Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-13 Thread Amit Kapila
On Thu, Dec 14, 2017 at 2:32 AM, Robert Haas  wrote:
> On Tue, Dec 12, 2017 at 9:37 PM, Amit Kapila  wrote:
>>> Uh, should I just revert that commit entirely first, and then we can
>>> commit the new fix afterward?
>>
>> Yes. I have already extracted the test case of that commit to the new
>> patch which is what we need from that commit.
>
> OK, done.
>

Thanks.  I think now we can proceed with
fix_accum_instr_parallel_workers_v8.patch posted above which will fix
the original issue and the problem we have found in sort and hash
nodes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD

2017-12-13 Thread Amit Kapila
On Thu, Dec 14, 2017 at 2:51 AM, Robert Haas  wrote:
> On Wed, Dec 13, 2017 at 12:35 AM, Kuntal Ghosh
>  wrote:
>> I've also verified the backward scan case with the query provided by
>> Thomas. In standby,
>> 2. explain analyze select * from t1 where a+1>a order by a desc; and
>> the parallel workers hang.
>> The patch fixes the issue.
>
> Committed and back-patched to v10.
>

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL

2017-12-13 Thread Ali Akbar
2017-12-13 15:37 GMT+07:00 Amit Langote :

> On 2017/12/13 15:59, Ali Akbar wrote:
> >
> > Thanks for the link to those thread.
> >
> > Judging from the discussion there, it will be a long way to prevent DROP
> > NOT NULL.
>
> Yeah, I remembered that discussion when writing my email, but was for some
> reason convinced that everything's fine even without the elaborate
> book-keeping of inheritance information for NOT NULL constraints.  Thanks
> Michael for reminding.
>

Patch for adding check in pg_upgrade. Going through git history, the check
for inconsistency in NOT NULL constraint has ben there since a long time
ago. In this patch the check will be applied for all old cluster version.
I'm not sure in which version was the release of table inheritance.

Thanks,
Ali Akbar
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 1b9083597c..29bafdff74 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -26,6 +26,7 @@ static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
+static void check_for_not_null_inheritance(ClusterInfo *cluster);
 static char *get_canonical_locale_name(int category, const char *locale);
 
 
@@ -99,6 +100,7 @@ check_and_dump_old_cluster(bool live_check)
 	check_for_prepared_transactions(_cluster);
 	check_for_reg_data_type_usage(_cluster);
 	check_for_isn_and_int8_passing_mismatch(_cluster);
+	check_for_not_null_inheritance(_cluster);
 
 	/*
 	 * Pre-PG 10 allowed tables with 'unknown' type columns and non WAL logged
@@ -1096,6 +1098,105 @@ check_for_pg_role_prefix(ClusterInfo *cluster)
 	check_ok();
 }
 
+/*
+ * check_for_not_null_inheritance()
+ *
+ *  Currently pg_upgrade will fail if there are any inconsistencies in NOT NULL
+ *  constraint inheritance: In Postgres version 9.5, 9.6, 10 we can have a column
+ *  that is NOT NULL in parent, but nullabe in its children. But during schema
+ *  restore, that will cause error.
+ */
+static void
+check_for_not_null_inheritance(ClusterInfo *cluster)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+	bool		found = false;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for NOT NULL inconsistencies in inheritance");
+
+	snprintf(output_path, sizeof(output_path), "not_null_inconsistent_columns.txt");
+
+	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		PGresult   *res;
+		bool		db_used = false;
+		int			ntups;
+		int			rowno;
+		int			i_nspname,
+	i_relname,
+	i_attname;
+		DbInfo	   *active_db = >dbarr.dbs[dbnum];
+		PGconn	   *conn = connectToServer(cluster, active_db->db_name);
+
+		res = executeQueryOrDie(conn,
+"WITH RECURSIVE parents AS ( "
+"	SELECT	i.inhrelid, i.inhparent "
+"	FROM	pg_catalog.pg_inherits i "
+"	UNION ALL "
+"	SELECT	p.inhrelid, i.inhparent "
+"   FROM	parents p "
+"	JOIN	pg_catalog.pg_inherits i "
+"		ON	i.inhrelid = p.inhparent "
+") "
+"SELECT	n.nspname, c.relname, ac.attname  "
+"FROM	parents p, "
+"		pg_catalog.pg_attribute ac, "
+"		pg_catalog.pg_attribute ap, "
+"		pg_catalog.pg_class c, "
+"		pg_catalog.pg_namespace n "
+"WHERE	NOT ac.attnotnull AND "
+"		ac.attrelid = p.inhrelid AND "
+"		ap.attrelid = p.inhparent AND "
+"		ac.attname = ap.attname AND "
+"		ap.attnotnull AND "
+"		c.oid = ac.attrelid AND "
+"		c.relnamespace = n.oid");
+
+		ntups = PQntuples(res);
+		i_nspname = PQfnumber(res, "nspname");
+		i_relname = PQfnumber(res, "relname");
+		i_attname = PQfnumber(res, "attname");
+		for (rowno = 0; rowno < ntups; rowno++)
+		{
+			found = true;
+			if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+pg_fatal("could not open file \"%s\": %s\n", output_path,
+		 strerror(errno));
+			if (!db_used)
+			{
+fprintf(script, "Database: %s\n", active_db->db_name);
+db_used = true;
+			}
+			fprintf(script, "  %s.%s.%s\n",
+	PQgetvalue(res, rowno, i_nspname),
+	PQgetvalue(res, rowno, i_relname),
+	PQgetvalue(res, rowno, i_attname));
+		}
+
+		PQclear(res);
+
+		PQfinish(conn);
+	}
+
+	if (script)
+		fclose(script);
+
+	if (found)
+	{
+		pg_log(PG_REPORT, "fatal\n");
+		pg_fatal("Your installation contains has inconsistencies in NOT NULL\n"
+ "constraint inheritance: child column is not market NOT NULL\n"
+ "while parent column has NOT NULL constraint. You can fix the\n"
+ "inconsistency with adding NOT NULL constraint and restart the\n"
+ "upgrade.\n"
+ "A list of the problem columns is in the file:\n"
+ "%s\n\n", output_path);
+	}
+	else
+		check_ok();
+}
 
 /*
  * get_canonical_locale_name


access/parallel.h lacks PGDLLIMPORT

2017-12-13 Thread Thomas Munro
Hi hackers,

I suppose that extensions are supposed to be allowed to use the
facilities in access/parallel.h.  I noticed in passing when I wrote a
throwaway test harness that my Windows built drone complained:

test_sharedtuplestore.obj : error LNK2001: unresolved external symbol
ParallelWorkerNumber
[C:\projects\postgres\test_sharedtuplestore.vcxproj]
.\Release\test_sharedtuplestore\test_sharedtuplestore.dll : fatal
error LNK1120: 1 unresolved externals
[C:\projects\postgres\test_sharedtuplestore.vcxproj]

I suppose that all three of these might need that, if they're part of
the API for parallel worker management:

extern volatile bool ParallelMessagePending;
extern int  ParallelWorkerNumber;
extern bool InitializingParallelWorker;

I'm less sure about the other two but at least ParallelWorkerNumber is
essential for anything that needs to coordinate access to input/output
arrays or similar.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2017-12-13 Thread Amit Langote
Hi David.

On 2017/12/13 18:48, David Rowley wrote:
> On 12 December 2017 at 22:13, Amit Langote
>  wrote:
>> Attached updated patches.
> 
> Thanks for sending the updated patches.
> 
> I don't have a complete review at the moment, but the following code
> in set_append_rel_pathlist() should be removed.
> 
> /* append_rel_list contains all append rels; ignore others */
> if (appinfo->parent_relid != parentRTindex)
> continue;

Will fix that right away, thanks.

Thanks,
Amit




Re: [HACKERS] Parallel Hash take II

2017-12-13 Thread Andres Freund
Hi,

Looking at the latest version of the tuplestore patch:


diff --git a/src/backend/utils/sort/sharedtuplestore.c 
b/src/backend/utils/sort/sharedtuplestore.c
new file mode 100644
index 000..d1233221a58
--- /dev/null
+++ b/src/backend/utils/sort/sharedtuplestore.c
@@ -0,0 +1,583 @@
+/*-
+ *
+ * sharedtuplestore.c
+ *   Simple mechanism for sharing tuples between backends.
+ *
+ * This module provides a shared temporary tuple storage mechanism, providing
+ * a parallel-aware subset of the features of tuplestore.c.  Multiple backends
+ * can write to a SharedTuplestore, and then multiple backends can later scan
+ * the stored tuples.  Currently, the only scan type supported is a parallel
+ * scan where each backend reads an arbitrary subset of the tuples that were
+ * written.

Cool.


+/* Chunk written to disk. */
+typedef struct SharedTuplestoreChunk
+{
+   int ntuples;/* Number of tuples in 
this chunk. */
+   booloverflow;   /* Continuation of previous 
chunk? */
+   chardata[FLEXIBLE_ARRAY_MEMBER];
+} SharedTuplestoreChunk;

Ah. I was thinking we could have the 'overflow' variable be an int,
indicating the remaining length of the oversized tuple. That'd allow us
to skip ahead to the end of the oversized tuple in concurrent processes
after hitting it.



+/*
+ * Write a tuple.  If a meta-data size was provided to sts_initialize, then a
+ * pointer to meta data of that size must be provided.
+ */
+void
+sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data,
+MinimalTuple tuple)
+{
+   size_t  size;
+
+   /* Do we have our own file yet? */
+   if (accessor->write_file == NULL)
+   {
+   SharedTuplestoreParticipant *participant;
+   charname[MAXPGPATH];
+
+   /* Create one.  Only this backend will write into it. */
+   sts_filename(name, accessor, accessor->participant);
+   accessor->write_file = BufFileCreateShared(accessor->fileset, 
name);
+
+   /* Set up the shared state for this backend's file. */
+   participant = 
>sts->participants[accessor->participant];
+   participant->writing = true;/* for assertions only */
+   }
+
+   /* Do we have space? */
+   size = accessor->sts->meta_data_size + tuple->t_len;
+   if (accessor->write_pointer + size >= accessor->write_end)
+   {
+   if (accessor->write_chunk == NULL)
+   {
+   /* First time through.  Allocate chunk. */
+   accessor->write_chunk = (SharedTuplestoreChunk *)
+   MemoryContextAllocZero(accessor->context,
+  
STS_CHUNK_PAGES * BLCKSZ);
+   accessor->write_chunk->ntuples = 0;
+   accessor->write_pointer = 
>write_chunk->data[0];
+   accessor->write_end = (char *)
+   accessor->write_chunk + STS_CHUNK_PAGES * 
BLCKSZ;
+   }
+   else
+   {
+   /* See if flushing helps. */
+   sts_flush_chunk(accessor);
+   }
+
+   /* It may still not be enough in the case of a gigantic tuple. 
*/
+   if (accessor->write_pointer + size >= accessor->write_end)
+   {
+   size_t  written;
+
+   /*
+* We'll write the beginning of the oversized tuple, 
and then
+* write the rest in some number of 'overflow' chunks.
+*/
+   if (accessor->write_pointer + 
accessor->sts->meta_data_size >=
+   accessor->write_end)
+   elog(ERROR, "meta-data too long");

That seems more like an Assert than a proper elog? Given that we're
calculating size just a few lines above...


+   if (accessor->sts->meta_data_size > 0)
+   memcpy(accessor->write_pointer, meta_data,
+  accessor->sts->meta_data_size);
+   written = accessor->write_end - accessor->write_pointer 
-
+   accessor->sts->meta_data_size;
+   memcpy(accessor->write_pointer + 
accessor->sts->meta_data_size,
+  tuple, written);

Also, shouldn't the same Assert() be here as well if you have it above?

+static MinimalTuple
+sts_read_tuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{
+   MinimalTuple tuple;
+   uint32  size;
+   size_t  remaining_size;
+   size_t  this_chunk_size;
+  

Re: Top-N sorts verses parallelism

2017-12-13 Thread Jeff Janes
On Tue, Dec 12, 2017 at 10:46 PM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> Hi hackers,
>
> The start-up cost of bounded (top-N) sorts is sensitive at the small
> end of N, and the
> comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't
> seem to correspond to reality.



Do we want the cost-estimate to be accurate for the worst case (which is
not terribly unlikely to happen, input order is opposite of desired output
order), or for the average case?

Cheers,

Jeff


Re: Top-N sorts verses parallelism

2017-12-13 Thread Robert Haas
On Wed, Dec 13, 2017 at 1:46 AM, Thomas Munro
 wrote:
> Hi hackers,
>
> The start-up cost of bounded (top-N) sorts is sensitive at the small
> end of N, and the
> comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't
> seem to correspond to reality.  Here's a contrived example that comes
> from a benchmark:
>
>   create table foo as
> select generate_series(1, 100)::int a, (random() * 1000)::int b;
>   analyze foo;
>   select * from foo order by b limit N;
>
> But the actual time doesn't really change on this random development
> system: it's around 300ms.  I stopped at limit 133 because for this
> particular query, at 134 a Gather Merge plan kicked in and I got
> degree 3 parallel sorting and I got my answer just shy of three times
> faster.  The speed up definitely paid for the parallel_setup_cost and
> only one tuple was sent from each worker to the leader because the
> limit was pushed down.

I get:

rhaas=# explain analyze select * from foo order by b limit 1;
QUERY PLAN
--
 Limit  (cost=19425.00..19425.00 rows=1 width=8) (actual
time=183.129..183.129 rows=1 loops=1)
   ->  Sort  (cost=19425.00..21925.00 rows=100 width=8) (actual
time=183.128..183.128 rows=1 loops=1)
 Sort Key: b
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on foo  (cost=0.00..14425.00 rows=100
width=8) (actual time=0.046..89.440 rows=100 loops=1)
 Planning time: 0.083 ms
 Execution time: 183.171 ms
(7 rows)

If we assume the costing of the Seq Scan node is right, then the
startup cost of the Sort node is too low.  For the Seq Scan node, each
1ms of execution time corresponds to 161.28 cost units.  If that were
accurate, then the 183.128 - 89.440 = 93.668 cost units of startup
cost for the Sort ought to have a cost of 14425.00 + (93.668 * 161.28)
= 29531.77504, but the actual cost is only 19425.00.  In other words,
the ratio of the startup cost of the sort to the total cost of the seq
scan ought to be about 2:1, to match the ratio of the execution times,
but it's actually more like 4:3.

I also see that it switches to Gather Merge at LIMIT 134, but if I
lower random_page_cost and seq_page_cost from the default values to
0.1, then for me it switches earlier, at 63.  Of course, at that point
the sort cost is now 3.5x the Seq Scan cost, so now things have gone
too far in the other direction and we're still not getting the plan
right, but I am out of time to investigate this for the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-12-13 Thread Peter Geoghegan
On Wed, Oct 18, 2017 at 12:45 PM, Peter Geoghegan  wrote:
> Bringing it back to the concrete freeze-the-dead issue, and the
> question of an XID-cutoff for safely interrogating CLOG: I guess it
> will be possible to assess a HOT chain as a whole. We can probably do
> this safely while holding a super-exclusive lock on the buffer. I can
> probably find a way to ensure this only needs to happen in a rare slow
> path, when it looks like the invariant might be violated but we need
> to make sure (I'm already following this pattern in a couple of
> places). Realistically, there will be some amount of "try it and see"
> here.

I would like to point out for the record/archives that I now believe
that Andres' pending do-over fix for the "Freeze the dead" bug [1]
will leave things in *much* better shape when it comes to
verification. Andres' patch neatly addresses *all* of the concerns
that I raised on this thread. The high-level idea of relfrozenxid as a
unambiguous cut-off point at which it must be safe to interrogate the
CLOG is restored.

Off hand, I'd say that the only interlock amcheck verification now
needs when verifying heap pages against the CLOG is a VACUUM-style
SHARE UPDATE EXCLUSIVE lock on the heap relation being verified. Every
heap tuple must either be observed to be frozen, or must only have
hint bits that are observably in agreement with CLOG. The only
complicated part is the comment that explains why this is
comprehensive and correct (i.e. does not risk false positives or false
negatives). We end up with something that is a bit like a "correct by
construction" design.

The fact that Andres also proposes to add a bunch of new defensive
"can't happen" hard elog()s (mostly by promoting assertions) should
validate the design of tuple + multixact freezing, in the same way
that I hope amcheck can.

[1] https://postgr.es/m/20171114030341.movhteyakqeqx...@alap3.anarazel.de
-- 
Peter Geoghegan



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-13 Thread Robert Haas
On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila  wrote:
> I have tried to reproduce this situation by adding error case in
> worker (just before worker returns success ('X' message)) and by
> having breakpoint in WaitForParallelWorkersToFinish.  What, I noticed
> is that worker is not allowed to exit till it receives some ack from
> master (basically worker waits at SendProcSignal after sending
> message) and error is reported properly.

SendProcSignal doesn't look to me like it can do anything that can
block, so I don't understand this.

> So, if I understand correctly, this means that it will return an error
> even if there is a problem in one worker (exited or not started) even
> though other workers would have easily finished the work. Won't such
> an error can lead to situations where after running the query for one
> hour the user might see some failure just because one of the many
> workers is not started (or some similar failure) even though the query
> would have been completed without that?  If so, that doesn't sound
> like a sensible behavior.

I think it's the *only* sensible behavior.  parallel.c does not know
that the query would have completed successfully without that worker,
or at least it doesn't know that it would have returned the right
answer.  It *might* be the case that the caller wasn't depending on
that worker to do anything, but parallel.c doesn't know that.

> This also doesn't appear to be completely safe.  If we add
> proc_exit(1) after attaching to error queue (say after
> pq_set_parallel_master) in the worker, then it will lead to *hang* as
> anyone_alive will still be false and as it will find that the sender
> is set for the error queue, it won't return any failure.  Now, I think
> even if we check worker status (which will be stopped) and break after
> the new error condition, it won't work as it will still return zero
> rows in the case reported by you above.

Hmm, there might still be a problem there.  I was thinking that once
the leader attaches to the queue, we can rely on the leader reaching
"ERROR: lost connection to parallel worker" in HandleParallelMessages.
However, that might not work because nothing sets
ParallelMessagePending in that case.  The worker will have detached
the queue via shm_mq_detach_callback, but the leader will only
discover the detach if it actually tries to read from that queue.

> I think if before making an assumption not to scan locally if we have
> a check to see whether workers are started, then my patch should work
> fine and we don't need to add any additional checks.  Also, it won't
> create any performance issue as we will perform such a check only when
> force_parallel_mode is not off or parallel leader participation is off
> which I think are not common cases.

My instinct is that we're going to have a lot of subtle bugs if
clients of parallel.c can't count on nworkers_launched to be accurate.
If clients didn't need that value, we could just have nworkers and let
the callers figure it out, but nobody wants that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-13 Thread Robert Haas
On Tue, Dec 12, 2017 at 9:37 PM, Amit Kapila  wrote:
>> Uh, should I just revert that commit entirely first, and then we can
>> commit the new fix afterward?
>
> Yes. I have already extracted the test case of that commit to the new
> patch which is what we need from that commit.

OK, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] A design for amcheck heapam verification

2017-12-13 Thread Peter Geoghegan
On Mon, Dec 11, 2017 at 9:35 PM, Michael Paquier
 wrote:
> +   /*
> +* Generate random seed, or use caller's.  Seed should always be a 
> positive
> +* value less than or equal to PG_INT32_MAX, to ensure that any random 
> seed
> +* can be recreated through callerseed if the need arises.  (Don't assume
> +* that RAND_MAX cannot exceed PG_INT32_MAX.)
> +*/
> +   seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
> This could use pg_backend_random() instead.

I don't see the need for a cryptographically secure source of
randomness for any current Bloom filter caller, including the test
harness. There are many uses of random() just like this throughout the
backend, and far more random() calls than pg_backend_random() calls.
srandom() seeds random per backend, ensuring non-deterministic
behavior across backends.

> +--
> +-- These tests don't produce any interesting output, unless they fail.  For 
> an
> +-- explanation of the arguments, and the values used here, see README.
> +--
> +SELECT test_bloomfilter(power => 23,
> +nelements => 838861,
> +seed => -1,
> +tests => 1);
> This could also test the reproducibility of the tests with a fixed
> seed number and at least two rounds, a low number of elements could be
> more appropriate to limit the run time.

The runtime is already dominated by pg_regress overhead. As it says in
the README, using a fixed seed in the test harness is pointless,
because it won't behave in a fixed way across platforms. As long as we
cannot ensure deterministic behavior, we may as well fully embrace
non-determinism.

> I would vote for simplifying the initialization routine and just
> directly let the caller decide it. Are there implementation
> complications if this is not a power of two? I cannot guess one by
> looking at the code.

Yes, there are. It's easier to reason about the implementation with
these restrictions.

> So I would expect people using this API
> are smart enough to do proper sizing. Note that some callers may
> accept even a larger false positive rate. In short, this basically
> brings us back to Thomas' point upthread with a size estimation
> routine, and an extra routine doing the initialization:
> https://www.postgresql.org/message-id/CAEepm=0dy53x1hg5dmyzmpv_kn99crxzqbto8gmiosxnyrx...@mail.gmail.com
> Did you consider it? Splitting the size estimation and the actual
> initialization has a lot of benefits in my opinion.

Callers don't actually need to worry about these details. I think it's
the wrong call to complicate the interface to simplify the
implementation.

Thomas was not arguing for the caller being able to specify a false
positive rate to work backwards from -- that's not an unreasonable
thing, but it seems unlikely to be of use to amcheck, or parallel hash
join. Thomas was arguing for making the Bloom filter suitable for use
with DSM. I ended up incorporating most of his feedback. The only
things I were not added were a DSM-orientated routine for getting the
amount of memory required, as well as a separate DSM-orientated
routine for initializing caller's shared memory buffer. That's likely
inconvenient for most callers, and could be restrictive if and when
Bloom filters use resources other than memory, such as temp files.

> +/*
> + * What proportion of bits are currently set?
> + *
> + * Returns proportion, expressed as a multiplier of filter size.
> + *
> [...]
> + */
> +double
> +bloom_prop_bits_set(bloom_filter *filter)
> Instead of that, having a function that prints direct information
> about the filter's state would be enough with the real number of
> elements and the number of bits set in the filter. I am not sure that
> using rates is a good idea, sometimes rounding can cause confusion.

The bloom filter doesn't track the number of elements itself. Callers
that care can do it themselves. bloom_prop_bits_set() isn't very
important, even compared to other types of instrumentation. It's
moderately useful as a generic indicator of whether or not the Bloom
filter was totally overwhelmed. That's about it.

-- 
Peter Geoghegan



Re: [HACKERS] Custom compression methods

2017-12-13 Thread Tomas Vondra


On 12/13/2017 05:55 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> On 12/13/2017 01:54 AM, Robert Haas wrote:
> 
>>> 3. Compression is only applied to large-ish values.  If you are just
>>> making the data type representation more compact, you probably want to
>>> apply the new representation to all values.  If you are compressing in
>>> the sense that the original data gets smaller but harder to interpret,
>>> then you probably only want to apply the technique where the value is
>>> already pretty wide, and maybe respect the user's configured storage
>>> attributes.  TOAST knows about some of that kind of stuff.
>>
>> Good point. One such parameter that I really miss is compression level.
>> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
>> not seem quite possible with compression happening in a datatype.
> 
> Hmm, actually isn't that the sort of thing that you would tweak using a
> column-level option instead of a compression method?
>   ALTER TABLE ALTER COLUMN SET (compression_level=123)
> The only thing we need for this is to make tuptoaster.c aware of the
> need to check for a parameter.
> 

Wouldn't that require some universal compression level, shared by all
supported compression algorithms? I don't think there is such thing.

Defining it should not be extremely difficult, although I'm sure there
will be some cumbersome cases. For example what if an algorithm "a"
supports compression levels 0-10, and algorithm "b" only supports 0-3?

You may define 11 "universal" compression levels, and map the four
levels for "b" to that (how). But then everyone has to understand how
that "universal" mapping is defined.

Another issue is that there are algorithms without a compression level
(e.g. pglz does not have one, AFAICS), or with somewhat definition (lz4
does not have levels, and instead has "acceleration" which may be an
arbitrary positive integer, so not really compatible with "universal"
compression level).

So to me the

ALTER TABLE ALTER COLUMN SET (compression_level=123)

seems more like an unnecessary hurdle ...

>>> I don't think TOAST needs to be entirely transparent for the
>>> datatypes.  We've already dipped our toe in the water by allowing some
>>> operations on "short" varlenas, and there's really nothing to prevent
>>> a given datatype from going further.  The OID problem you mentioned
>>> would presumably be solved by hard-coding the OIDs for any built-in,
>>> privileged compression methods.
>>
>> Stupid question, but what do you mean by "short" varlenas?
> 
> Those are varlenas with 1-byte header rather than the standard 4-byte
> header.
> 

OK, that's what I thought. But that is still pretty transparent to the
data types, no?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Surjective functional indexes

2017-12-13 Thread Konstantin Knizhnik

Thank you for review.

On 13.12.2017 14:29, Simon Riggs wrote:

On 4 December 2017 at 15:35, Konstantin Knizhnik
 wrote:

On 30.11.2017 05:02, Michael Paquier wrote:

On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs 
wrote:

On 15 September 2017 at 16:34, Konstantin Knizhnik
 wrote:


Attached please find yet another version of the patch.

Thanks. I'm reviewing it.

Two months later, this patch is still waiting for a review (you are
listed as well as a reviewer of this patch). The documentation of the
patch has conflicts, please provide a rebased version. I am moving
this patch to next CF with waiting on author as status.

Attached please find new patch with resolved documentation conflict.

I’ve not posted a review because it was my hope that I could fix up
the problems with this clever and useful patch and just commit it. I
spent 2 days trying to do that but some problems remain and I’ve run
out of time.

Patch 3 has got some additional features that on balance I don’t think
we need. Patch 1 didn’t actually work which was confusing when I tried
to go back to that.

Patch 3 Issues

* Auto tuning added 2 extra items to Stats for each relation.  That is
too high a price to pay, so we should remove that. If we can’t do
autotuning without that, please remove it.


Right now size of this structure is 168 bytes. Adding two extra fields 
increase it to 184 bytes - 9%. Is it really so critical?

What is the typical/maximal number of relation in a database?
I can not believe that there can be more than thousand non-temporary 
relations in any database.
Certainly application can generate arbitrary number of temporary tables. 
But millions of such tables cause catalog bloating and will have awful 
impact on performance in any case.
And for any reasonable number of relations (< million), extra memory 
used by this statistic is negligible.
Also I think that these two fields can be useful not only for projection 
indexes.
It can be also used by DBA and optimizer because them more precisely 
explain relation update pattern.


I really love you idea with autotune and it will be pity to reject it 
just because of extra 16 bytes per relation.

* Patch needs a proper test case added. We shouldn’t just have a
DEBUG1 statement in there for testing - very ugly. Please rewrite
tests using functions that show how many changes have occurred during
the current  statement/transaction.

* Parameter coding was specific to that section of code. We need a
capability to parse that parameter from other locations, just as we do
with all other reloptions. It looks like it was coded that way because
of difficulties with code placement. Please solve this with generic
code, not just code that solves only the current issue. I’d personally
be happier without any option at all, but if y’all want it, then
please add the code in the right place.

Sorry, I do not completely understand what you mean by "parameter coding".
Do you mean IsProjectionFunctionalIndex function and filling relation 
options in it?

It is the only reported issue which I do not understand how to fix.


* The new coding for heap_update() uses the phrase “warm”, which is
already taken by another patch. Please avoid confusing things by
re-using the same term for different things.
The use of these technical terms like projection and surjective
doesn’t seem to add anything useful to the patch

* Rename parameter to recheck_on_update

* Remove random whitespace changes in the patch

So at this stage the concept is good and works, but the patch is only
just at the prototype stage and nowhere near committable. I hope you
can correct these issues and if you do I will be happy to review and
perhaps commit.


Thank you once again for your efforts in improving this patch.
I will fix all reported issues.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: plpgsql test layout

2017-12-13 Thread Peter Eisentraut
On 12/12/17 22:59, Michael Paquier wrote:
> Attached is what I have some up with, based on Peter's v2.

This has been committed.  Thanks!

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: That mode-700 check on DATADIR again

2017-12-13 Thread David Steele

On 12/11/17 9:41 PM, Chapman Flack wrote:

I have, more or less, this classic question:

https://www.postgresql.org/message-id/4667C403.1070807%40t3go.de





However, when you stat a file with a POSIX ACL, you get shown the
ACL's 'mask' entry (essentially the ceiling of all the 'extra' ACL
entries) in place of the old-fashioned group bits. So in a
non-ACL-aware ls or stat, the above looks like:

[data]# ls -ld
drwxr-x---+ 22 postgres postgres 4096 Dec 11 18:14 .

... and postgres refuses to start because it mistakes the r-x for
'group' permissions. ACLs have added new semantics to POSIX
permissions, and postgres doesn't understand them when it makes this
hey-don't-shoot-your-foot check.


I'm working on a patch that allows $PGDATA to have group r-x so that a 
non-privileged user in the group can do a file-level backup.


However, it looks like it would work for your case as well based on your 
ACL.


I'm planning to have the patch ready sometime next week and I'll respond 
here once it goes into the CF.  Reviews would be welcome!


Thanks,
--
-David
da...@pgmasters.net



Re: WIP: a way forward on bootstrap data

2017-12-13 Thread Peter Eisentraut
On 12/13/17 04:06, John Naylor wrote:
> There doesn't seem to be any interest in bootstrap data at the moment,
> but rather than give up just yet, I've added a couple features to make
> a data migration more compelling:

I took a brief look at your patches, and there appear to be a number of
good cleanups in there at least.  But could you please send patches in
git format-patch format with commit messages, so we don't have to guess
what each patch does?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: Exclude unlogged tables from base backups

2017-12-13 Thread David Steele

On 12/13/17 10:04 AM, Stephen Frost wrote:


Just to be clear- the new base backup code doesn't actually *do* the
non-init fork removal, it simply doesn't include the non-init fork in
the backup when there is an init fork, right?


It does *not* do the unlogged non-init fork removal.  The code I 
refactored in reinit.c is about identifying the forks, not removing 
them.  That code is reused to determine what to exclude from the backup.


I added the regression tests to ensure that the behavior of reinit.c is 
unchanged after the refactor.



We certainly wouldn't want a basebackup actually running around removing
the main fork for unlogged tables on a running and otherwise healthy
system. ;)


That would not be good.

--
-David
da...@pgmasters.net



Re: server crash with CallStmt

2017-12-13 Thread Peter Eisentraut
On 12/13/17 04:46, Ashutosh Bapat wrote:
> I think it shouldn't reach this code. Like a normal function
> invocation, an aggregate invocation should not be allowed in a CALL
> statement. Here's patch to fix it. I thought that window function
> invocations and casts had similar problem, but they are prohibited by
> backend grammar.

Committed your fix, thanks.  I took out the hint, because calling an
aggregate function is a bit more complicated than just replacing the
CALL by SELECT.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Something for the TODO list: deprecating abstime and friends

2017-12-13 Thread Mark Dilger

> On Dec 13, 2017, at 12:07 AM, Andres Freund  wrote:
> 
> Hi,
> 
> On 2017-07-17 12:54:31 -0700, Mark Dilger wrote:
>> These types provide a 4-byte datatype for storing real-world second
>> precision timestamps, as occur in many log files.
> 
> These seem to be getting less common IME, most products have higher
> resolution these days.  If this were nicely written, maintainable, and
> documented code my position would differ, but it really isn't.
> 
> 
>> That said, I am fully aware that these are deprecated and expect you
>> will remove them, at which time I'll have to keep them in my tree
>> and politely refuse to merge in your change which removes them.
> 
> It'd be way less work to package abstime as an extension if you want to
> do so.

After proposing to submit a patch for the secondstamp datatype (which
I mentioned upthread), Robert objected to the idea of data on disk changing
meaning, which was a part of the idea that Tom seemed to be willing to
accept.  Since I couldn't get both Tom and Robert on board with any
particular design, I silently withdrew from the development of any such
patch.

This has happened on several proposals I have made on this list over
the years.  Unless there is fairly unanimous support from the committers,
I don't bother following through with a proposal, given the improbability
of it getting accepted.

I would happily finish and submit that prior proposal if there were general
agreement that it is a good design.  I have no interest in making abstime
into an extension, however.

mark




Re: PATCH: Exclude unlogged tables from base backups

2017-12-13 Thread Stephen Frost
David,

* David Steele (da...@pgmasters.net) wrote:
> On 12/12/17 8:48 PM, Stephen Frost wrote:
> > I don't think there is, because, as David points out, the unlogged
> > tables are cleaned up first and then WAL replay happens during recovery,
> > so the init fork will cause the relation to be overwritten, but then
> > later the logged 'drop table' and subsequent re-use of the relfilenode
> > to create a new table (or persistence change) will all be in the WAL and
> > will be replayed over top and will take care of this.
> 
> Files can be copied in any order, so if an OID is recycled the backup
> could copy its first, second, or nth incarnation.  It doesn't really
> matter since all of it will be clobbered by WAL replay.
> 
> The new base backup code just does the non-init fork removal in advance,
> following the same rules that would apply on recovery given the same
> file set.

Just to be clear- the new base backup code doesn't actually *do* the
non-init fork removal, it simply doesn't include the non-init fork in
the backup when there is an init fork, right?

We certainly wouldn't want a basebackup actually running around removing
the main fork for unlogged tables on a running and otherwise healthy
system. ;)

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] eval_const_expresisions and ScalarArrayOpExpr

2017-12-13 Thread Dmitry Dolgov
> On 12 September 2017 at 15:29, Tom Lane  wrote:
>
> This patch no longer applies cleanly on HEAD, so here's a rebased version
> (no substantive changes).  As before, I think the most useful review task
> would be to quantify whether it makes planning noticeably slower.

I tried to experiment a bit with this patch, hope it may be helpful.
Basically
I've made some number of pgbench runs over an instance with/without the
patch,
assuming, that since the only difference would be in the planner, this will
show performance impact from the patch. Also, I've manually collected some
statistics for "Planning time" from explain analyze (I suppose it's also a
valid
test). As a test target I've used queries like `select * from test where 1
in
(1, 2)` (as a side note - an involved table was empty) for different number
of
elements in an array, and for situations when this condition is true/false.

So, for:

pgbench -c 50 -s 100 -j 50 -n -t 1000 -f const_eval.sql

I've got this data (avg latency):

typew/ patchw/o patch

short array 5.203   5.564

long array  9.293   9.667

Just out of curiosity I've also made the same test for constant arrays
instead
of integers, but in this case numbers for with and without the patch were
almost identical.

For explain analyze (avg planning time):

typew/ patchw/o patch

short array 0.214   0.226

long array  0.374   0.320

I'm not sure why for the long array case with the patch I have smaller
latency
(I thought, that more complexity should have negative impact on the
performance), but a bit longer planning time. But at the end it looks that
the difference is not that big.

Speaking about the code, everything looks fine. As for me, some variables
could
be named in a more self-explanatory way (e.g `ece_evaluate_expr`, where
`ece`
is `eval_const_expresisions`, or `saop`, which is `ScalarArrayOp`), but it's
minor.

I wonder if it makes sense in this patch also deal with the following
situation?

explain analyze select * from test where 1 in (select 1);
 QUERY PLAN
---
Result  (cost=0.02..35.52 rows=2550 width=4) (actual time=0.036..0.036
rows=0 loops=1)
   One-Time Filter: (hashed SubPlan 1)
   ->  Seq Scan on test  (cost=0.02..35.52 rows=2550 width=4) (actual
time=0.009..0.009 rows=0 loops=1)
   SubPlan 1
->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002
rows=1 loops=1)
Planning time: 0.381 ms
Execution time: 0.360 ms

It looks quite similar from a user perspective (although since a subplan is
involved, I assume it may be quite different in terms of implementation).


Re: PATCH: Exclude unlogged tables from base backups

2017-12-13 Thread David Steele
On 12/12/17 8:48 PM, Stephen Frost wrote:
> Andres,
> 
> * Andres Freund (and...@anarazel.de) wrote:
>> On 2017-12-12 18:04:44 -0500, David Steele wrote:
>>> If the forks are written out of order (i.e. main before init), which is
>>> definitely possible, then I think worst case is some files will be backed up
>>> that don't need to be.  The main fork is unlikely to be very large at that
>>> point so it doesn't seem like a big deal.
>>>
>>> I don't see this as any different than what happens during recovery. The
>>> unlogged forks are cleaned / re-inited before replay starts which is the
>>> same thing we are doing here.
>>
>> It's quite different - in the recovery case there's no other write
>> activity going on. But on a normally running cluster the persistence of
>> existing tables can get changed, and oids can get recycled.  What
>> guarantees that between the time you checked for the init fork the table
>> hasn't been dropped, the oid reused and now a permanent relation is in
>> its place?
> 
> We *are* actually talking about the recovery case here because this is a
> backup that's happening and WAL replay will be happening after the
> pg_basebackup is done and then the backup restored somewhere and PG
> started up again.
> 
> If the persistence is changed then the table will be written into the
> WAL, no?  All of the WAL generated during a backup (which is what we're
> talking about here) has to be replayed after the restore is done and is
> before the database is considered consistent, so none of this matters,
> as far as I can see, because the drop table or alter table logged or
> anything else will be in the WAL that ends up getting replayed.

Yes - that's the way I see it.  At least when I'm not tired from a day
of coding like I was last night...

> I don't think there is, because, as David points out, the unlogged
> tables are cleaned up first and then WAL replay happens during recovery,
> so the init fork will cause the relation to be overwritten, but then
> later the logged 'drop table' and subsequent re-use of the relfilenode
> to create a new table (or persistence change) will all be in the WAL and
> will be replayed over top and will take care of this.

Files can be copied in any order, so if an OID is recycled the backup
could copy its first, second, or nth incarnation.  It doesn't really
matter since all of it will be clobbered by WAL replay.

The new base backup code just does the non-init fork removal in advance,
following the same rules that would apply on recovery given the same
file set.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Custom compression methods

2017-12-13 Thread Ildus Kurbangaliev
On Tue, 12 Dec 2017 15:52:01 -0500
Robert Haas  wrote:

> 
> Yes.  I wonder if \d or \d+ can show it somehow.
> 

Yes, in current version of the patch, \d+ shows current compression.
It can be extended to show a list of current compression methods.

Since we agreed on ALTER syntax, i want to clear things about CREATE.
Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or CREATE
COMPRESSION METHOD? I like the access method approach, and it
simplifies the code, but I'm just not sure a compression is an access
method or not.

Current implementation
--

To avoid extra patches I also want to clear things about current
implementation. Right now there are two tables, "pg_compression" and
"pg_compression_opt". When compression method is linked to a column it
creates a record in pg_compression_opt. This record's Oid is stored in
the varlena. These Oids kept in first column so I can move them in
pg_upgrade but in all other aspects they behave like usual Oids. Also
it's easy to restore them.

Compression options linked to a specific column. When tuple is
moved between relations it will be decompressed.

Also in current implementation SET COMPRESSION contains WITH syntax
which is used to provide extra options to compression method.

What could be changed
-

As Alvaro mentioned COMPRESSION METHOD is practically an access method,
so it could be created as CREATE ACCESS METHOD .. TYPE COMPRESSION.
This approach simplifies the patch and "pg_compression" table could be
removed. So compression method is created with something like:

CREATE ACCESS METHOD .. TYPE COMPRESSION HANDLER
awesome_compression_handler;

Syntax of SET COMPRESSION changes to SET COMPRESSION .. PRESERVE which
is useful to control rewrites and for pg_upgrade to make dependencies
between moved compression options and compression methods from pg_am
table.

Default compression is always pglz and if users want to change they run:

ALTER COLUMN  SET COMPRESSION awesome PRESERVE pglz;

Without PRESERVE it will rewrite the whole relation using new
compression. Also the rewrite removes all unlisted compression options
so their compresssion methods could be safely dropped.

"pg_compression_opt" table could be renamed to "pg_compression", and
compression options will be stored there.

I'd like to keep extra compression options, for example pglz can be
configured with them. Syntax would be slightly changed:

SET COMPRESSION pglz WITH (min_comp_rate=25) PRESERVE awesome;

Setting the same compression method with different options will create
new compression options record for future tuples but will not
rewrite table.

-- 

Regards,
Ildus Kurbangaliev



Re: [HACKERS] Surjective functional indexes

2017-12-13 Thread Simon Riggs
On 4 December 2017 at 15:35, Konstantin Knizhnik
 wrote:
> On 30.11.2017 05:02, Michael Paquier wrote:
>>
>> On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs 
>> wrote:
>>>
>>> On 15 September 2017 at 16:34, Konstantin Knizhnik
>>>  wrote:
>>>
 Attached please find yet another version of the patch.
>>>
>>> Thanks. I'm reviewing it.
>>
>> Two months later, this patch is still waiting for a review (you are
>> listed as well as a reviewer of this patch). The documentation of the
>> patch has conflicts, please provide a rebased version. I am moving
>> this patch to next CF with waiting on author as status.
>
> Attached please find new patch with resolved documentation conflict.

I’ve not posted a review because it was my hope that I could fix up
the problems with this clever and useful patch and just commit it. I
spent 2 days trying to do that but some problems remain and I’ve run
out of time.

Patch 3 has got some additional features that on balance I don’t think
we need. Patch 1 didn’t actually work which was confusing when I tried
to go back to that.

Patch 3 Issues

* Auto tuning added 2 extra items to Stats for each relation.  That is
too high a price to pay, so we should remove that. If we can’t do
autotuning without that, please remove it.

* Patch needs a proper test case added. We shouldn’t just have a
DEBUG1 statement in there for testing - very ugly. Please rewrite
tests using functions that show how many changes have occurred during
the current  statement/transaction.

* Parameter coding was specific to that section of code. We need a
capability to parse that parameter from other locations, just as we do
with all other reloptions. It looks like it was coded that way because
of difficulties with code placement. Please solve this with generic
code, not just code that solves only the current issue. I’d personally
be happier without any option at all, but if y’all want it, then
please add the code in the right place.

* The new coding for heap_update() uses the phrase “warm”, which is
already taken by another patch. Please avoid confusing things by
re-using the same term for different things.
The use of these technical terms like projection and surjective
doesn’t seem to add anything useful to the patch

* Rename parameter to recheck_on_update

* Remove random whitespace changes in the patch

So at this stage the concept is good and works, but the patch is only
just at the prototype stage and nowhere near committable. I hope you
can correct these issues and if you do I will be happy to review and
perhaps commit.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-13 Thread Alexander Korotkov
On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev  wrote:

> 0002-pg-trgm-strict_word-similarity.patch – implementation of
>> strict_word_similarity() with comments, docs and tests.
>>
> After some looking in
>
> 1)
> repeated piece of code:
> +   if (strategy == SimilarityStrategyNumber)
> +   nlimit = similarity_threshold;
> +   else if (strategy == WordSimilarityStrategyNumber)
> +   nlimit = word_similarity_threshold;
> +   else /* strategy == StrictWordSimilarityStrategyNumber */
> +   nlimit = strict_word_similarity_threshold;
> Isn't it better to move that piece to separate function?
>

Good point.  Moved to separate function.

2)
> calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
>  bool check_only, bool word_bounds)
>
> Seems, two bools args are replaceble to  bitwise-ORed flag. It will
> simplify adding new options in future.


Yep.  I've introduced flags.

Also, I've adjusted tests to make them stable (found example where TOP-8
distances are unique).
Please, find revised patch in attachment.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0002-pg-trgm-strict_word-similarity-2.patch
Description: Binary data


Re: [HACKERS] path toward faster partition pruning

2017-12-13 Thread David Rowley
On 12 December 2017 at 22:13, Amit Langote
 wrote:
> Attached updated patches.

Thanks for sending the updated patches.

I don't have a complete review at the moment, but the following code
in set_append_rel_pathlist() should be removed.

/* append_rel_list contains all append rels; ignore others */
if (appinfo->parent_relid != parentRTindex)
continue;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: server crash with CallStmt

2017-12-13 Thread Ashutosh Bapat
On Wed, Dec 13, 2017 at 12:55 PM, Rushabh Lathia
 wrote:
> HI,
>
> Consider the below test:
>
> postgres@110311=#call sum(2+2);
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> Call stack:
>
> (gdb) bt
> #0  0x7fef9b5851d7 in raise () from /lib64/libc.so.6
> #1  0x7fef9b5868c8 in abort () from /lib64/libc.so.6
> #2  0x009ec5b5 in ExceptionalCondition (conditionName=0xb7e12b
> "!(((bool) 0))", errorType=0xb7e04c "FailedAssertion",
> fileName=0xb7e040 "parse_agg.c", lineNumber=349) at assert.c:54
> #3  0x005dc3b6 in check_agglevels_and_constraints (pstate=0x2d1ba90,
> expr=0x2d1bd80) at parse_agg.c:349
> #4  0x005dc104 in transformAggregateCall (pstate=0x2d1ba90,
> agg=0x2d1bd80, args=0x2d1c3c0, aggorder=0x0, agg_distinct=0 '\000') at
> parse_agg.c:237
> #5  0x005f4df7 in ParseFuncOrColumn (pstate=0x2d1ba90,
> funcname=0x2d10578, fargs=0x2d1c3c0, last_srf=0x0, fn=0x2d10760, proc_call=1
> '\001',
> location=5) at parse_func.c:726
>
>
> Here it's hitting Assert() because pstate->p_expr_kind is set
> to EXPR_KIND_NONE for the CallStmt function resolution.
>

I think it shouldn't reach this code. Like a normal function
invocation, an aggregate invocation should not be allowed in a CALL
statement. Here's patch to fix it. I thought that window function
invocations and casts had similar problem, but they are prohibited by
backend grammar.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 2f20516..5ba6065 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -336,6 +336,16 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs,
 		Form_pg_aggregate classForm;
 		int			catDirectArgs;
 
+		if (proc_call)
+			ereport(ERROR,
+	(errcode(ERRCODE_UNDEFINED_FUNCTION),
+	 errmsg("%s is not a procedure",
+			func_signature_string(funcname, nargs,
+  argnames,
+  actual_arg_types)),
+	 errhint("To call an aggregate function, use SELECT."),
+	 parser_errposition(pstate, location)));
+
 		tup = SearchSysCache1(AGGFNOID, ObjectIdGetDatum(funcid));
 		if (!HeapTupleIsValid(tup)) /* should not happen */
 			elog(ERROR, "cache lookup failed for aggregate %u", funcid);


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-12-13 Thread Masahiko Sawada
On Wed, Dec 13, 2017 at 4:30 PM, Andres Freund  wrote:
> On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:
>> When we add extra blocks on a relation do we access to the disk? I
>> guess we just call lseek and write and don't access to the disk. If so
>> the performance degradation regression might not be much.
>
> Usually changes in the file size require the filesystem to perform
> metadata operations, which in turn requires journaling on most
> FSs. Which'll often result in synchronous disk writes.
>

Thank you. I understood the reason why this measurement should use two
different filesystems.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Something for the TODO list: deprecating abstime and friends

2017-12-13 Thread Andres Freund
Hi,

On 2017-07-17 12:54:31 -0700, Mark Dilger wrote:
> These types provide a 4-byte datatype for storing real-world second
> precision timestamps, as occur in many log files.

These seem to be getting less common IME, most products have higher
resolution these days.  If this were nicely written, maintainable, and
documented code my position would differ, but it really isn't.


> That said, I am fully aware that these are deprecated and expect you
> will remove them, at which time I'll have to keep them in my tree
> and politely refuse to merge in your change which removes them.

It'd be way less work to package abstime as an extension if you want to
do so.

Greetings,

Andres Freund