date:20180205

Re: csv format for psql

2018-02-05 Thread Pavel Stehule

2018-01-31 13:58 GMT+01:00 Daniel Verite :

> Pavel Stehule wrote:
>
> > This format is too important, so some special short or long option can be
> > practical (it will be printed in help)
> >
> > some like --csv
>
> I guess -C/--csv could be used, like there already is -H/--html.
>

I prefer just long option only. Maybe in future we can use short "C" for
something else better. There is only few free short commands.

Regards

Pavel



>
> > I found one issue - PostgreSQL default field separator is "|". Maybe good
> > time to use more common "," ?
> >
> > Or when field separator was not explicitly defined, then use "," for CSV,
> > and "|" for other. Although it can be little bit messy
>
> Currently there's a strong analogy between the unaligned
> mode and csv mode. In particular they use the existing pset
> variables fieldsep, fieldsep_zero, recordsep, recordsep_zero
> in the same way. If we want to make csv special with regard
> to the delimiters,  that complicates the user interface
> For instance if fieldsep was changed automatically by
> \pset format csv, it's not obvious if/when we should switch it
> back to its previous state, and how the fieldsep switch done
> automatically would mix or conflict with other \pset
> commands issued by the user.
> Or we need to duplicate these variables. Or duplicate
> only fieldsep, having a fieldsep_csv, leaving out the
> other variables and not being as close to the unaligned
> mode.
> These options don't appeal to me much compared
> to the simplicity of the current patch.
>
> Also, although the comma is the separator defined by the
> RFC4180 and the default for COPY CSV, people also use the
> semicolon extensively (because it's what Excel does I guess),
> which somehow mitigates the importance of comma as the
> default value.
>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>

Re: JIT compiling with LLVM v9.1

2018-02-05 Thread Pierre Ducroquet

On Sunday, February 4, 2018 12:45:50 AM CET Andreas Karlsson wrote:
> On 02/02/2018 10:48 AM, Pierre Ducroquet wrote:
> > I have successfully built the JIT branch against LLVM 4.0.1 on Debian
> > testing. This is not enough for Debian stable (LLVM 3.9 is the latest
> > available there), but it's a first step.
> > I've split the patch in four files. The first three fix the build issues,
> > the last one fixes a runtime issue.
> > I think they are small enough to not be a burden for you in your
> > developments. But if you don't want to carry these ifdefs right now, I
> > maintain them in a branch on a personal git and rebase as frequently as I
> > can.
> 
> I tested these patches and while the code built for me and passed the
> test suite on Debian testing I have a weird bug where the very first
> query fails to JIT while the rest work as they should. I think I need to
> dig into LLVM's codebase to see what it is, but can you reproduce this
> bug at your machine?
> 
> Code to reproduce:
> 
> SET jit_expressions = true;
> SET jit_above_cost = 0;
> SELECT 1;
> SELECT 1;
> 
> Output:
> 
> postgres=# SELECT 1;
> ERROR:  failed to jit module
> postgres=# SELECT 1;
>   ?column?
> --
>  1
> (1 row)
> 
> Config:
> 
> Version: You patches applied on top of
> 302b7a284d30fb0e00eb5f0163aa933d4d9bea10
> OS: Debian testing
> llvm/clang: 4.0.1-8
> 
> Andreas

Hi

Indeed, thanks for reporting this. I scripted the testing but failed to see 
it, I forgot to set on_error_stop.
I will look into this and fix it.

Thanks

 Pierre

Re: Boolean partitions syntax

2018-02-05 Thread Kyotaro HORIGUCHI

Hello,

At Fri, 02 Feb 2018 18:04:44 -0500, Tom Lane  wrote in 
<14732.1517612...@sss.pgh.pa.us>
> Robert Haas  writes:
> > On Fri, Feb 2, 2018 at 4:40 PM, Peter Eisentraut
> >  wrote:
> >> There might be other options, but one way to solve this would be to
> >> treat partition bounds as a general expression in the grammar and then
> >> check in post-parse analysis that it's a constant.
> 
> > Yeah -- isn't the usual way of handling this to run the user's input
> > through eval_const_expressions and see if the result is constant?

At Mon, 29 Jan 2018 13:21:54 +0900, Amit Langote 
 wrote in 
<6844d7f9-8219-a9ff-88f9-82c05fc90...@lab.ntt.co.jp>
> Partition bound literals as captured gram.y don't have any type
> information attached.  They're carried over in a A_Const to
> transformPartitionBoundValue() and coerced to the target partition key
> type there.  Note that each of the the partition bound literal datums
> received from gram.y is castNode(A_Const)'d in parse_utilcmds.c.

eval_const_expressions is already running under
transformPartitionBoundValue to resolve remaining coercion.  I
suppose we can use AexprConst to restrict the syntax within
appropriate variations.  Please find the attached patch.


It allows the following syntax as a by-prodcut.

| create table c4 partition of t for values in (numeric(1,0) '5');

Parser accepts arbitrary defined types but it does no harm.

| create table c2 partition of t for values from (line '{0,1,0}') to (1);
| ERROR:  specified value cannot be cast to type double precision for column "a"

It rejects unacceptable functions but the message may look
somewhat unfriendly.

| =# create table c1 partition of t for values in (random());
| ERROR:  syntax error at or near ")"
| LINE 1: create table c1 partition of t for values in (random());
|  ^
(marker is placed under the closing parenthesis of "random()")

| =# create table c1 partition of t for values in (random(0) 'x');
| ERROR:  type "random" does not exist
| LINE 1: create table c1 partition of t for values in (random(0) 'x')...
(marker is placed under the first letter of the "random".)


> Not sure we want to go quite that far: at minimum it would imply invoking
> arbitrary stuff during a utility statement, which we generally try to
> avoid.  Still, copy-from-query does that, so we can certainly make it
> work if we wish.
> 
> Perhaps more useful to discuss: would that truly be the semantics we want,
> or should we just evaluate the expression and have done?  It's certainly
> arguable that "IN (random())" ought to draw an error, not compute some
> random value and use that.  But if you are insistent on partition bounds
> being immutable in any strong sense, you've already got problems, because
> e.g. a timestamptz literal's interpretation isn't necessarily fixed.
> It's only after we've reduced the original input to Datum form that we
> can make any real promises about the value not moving.  So I'm not seeing
> where is the bright line between "IN ('today')" and "IN (random())".
> 
>   regards, tom lane

The patch leaves the ambiguity of values like 'today' but doesn't
accept arbitrary functions. Howerver, it needs additional message
for errors that never happen since the patch adds a new item in
ParseExprKind...

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 5329432..c5d8526 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2790,9 +2790,7 @@ hash_partbound:
 		;
 
 partbound_datum:
-			Sconst			{ $$ = makeStringConst($1, @1); }
-			| NumericOnly	{ $$ = makeAConst($1, @1); }
-			| NULL_P		{ $$ = makeNullAConst(@1); }
+		AexprConst			{ $$ = $1; }
 		;
 
 partbound_datum_list:
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index 6a9f1b0..9bbe9b1 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -506,6 +506,12 @@ check_agglevels_and_constraints(ParseState *pstate, Node *expr)
 			else
 err = _("grouping operations are not allowed in partition key expression");
 
+		case EXPR_KIND_PARTITION_BOUNDS:
+			if (isAgg)
+err = _("aggregate functions are not allowed in partition bounds value");
+			else
+err = _("grouping operations are not allowed in partition bounds value");
+
 			break;
 
 		case EXPR_KIND_CALL:
@@ -891,6 +897,9 @@ transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 		case EXPR_KIND_PARTITION_EXPRESSION:
 			err = _("window functions are not allowed in partition key expression");
 			break;
+		case EXPR_KIND_PARTITION_BOUNDS:
+			err = _("window functions are not allowed in partition bounds value");
+			break;
 		case EXPR_KIND_CALL:
 			err = _("window functions are not allowed in CALL arguments");
 			break;
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index b2f5e46..d1f9b02 100644
--- a/src/bac

Crash in partition-wise join involving dummy partitioned relation

2018-02-05 Thread Ashutosh Bapat

Hi,
I noticed a crash in partition-wise involving dummy partitioned
tables. Here's simple testcase

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM') FROM
generate_series(0, 599) i WHERE i % 2 = 0;
ANALYZE prt1;

CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (250) TO (500);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (500) TO (600);
INSERT INTO prt2 SELECT i % 25, i, to_char(i, 'FM') FROM
generate_series(0, 599) i WHERE i % 3 = 0;
ANALYZE prt2;

SET enable_partition_wise_join TO true;

EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a = 1 AND
a = 2) t1 RIGHT JOIN prt2 t2 ON t1.a = t2.b, prt1 t3 WHERE t2.b =
t3.a;

t1 is an empty partitioned relation, with partition scheme matching
that of t2. Thus build_joinrel_partition_info() deems t1 RIGHT JOIN t2
as partitioned and sets part_scheme, nparts and other partition
properties except part_rels. Later in try_partition_wise_join(), the
function bails out since t1 is dummy because of following code

/*
 * set_rel_pathlist() may not create paths in children of an empty
 * partitioned table and so we can not add paths to child-joins. So, deem
 * such a join as unpartitioned. When a partitioned relation is deemed
 * empty because all its children are empty, dummy path will be set in
 * each of the children.  In such a case we could still consider the join
 * as partitioned, but it might not help much.
 */
if (IS_DUMMY_REL(rel1) || IS_DUMMY_REL(rel2))
return;

So, part_rels is never set for relation t1 LEFT JOIN t2. When
build_joinrel_partition_info() processes (t1 LEFT JOIN t2, t3), it
expects part_rels to be set for (t1 LEFT JOIN t2) since it's deemed to
be partitioned and following assertion fails

Assert(REL_HAS_ALL_PART_PROPS(outer_rel) &&
   REL_HAS_ALL_PART_PROPS(inner_rel));

When I wrote this code, I thought that some join order of an N-way
join involving a dummy relation would have both the joining relations
partitioned with part_rels set i.e. child-join created. But that was a
wrong assumption. Any two-way join involving a dummy relation can not
have child-joins and hence can not be deemed as partitioned. For a 3
way join involving dummy relation, every two-way join involving that
dummy relation won't have child-joins and hence the 3 way join can not
have child-join. Similarly we can use induction to prove that any
N-way join involving a dummy relation will not have child-joins and
hence won't be partitioned. We can detect this case during
build_joinrel_partition_info(). One of the joining relations presented
to that function will involve the dummy relation and would have been
deemed as unpartitioned when it was processed. We don't need any dummy
relation handling in try_partition_wise_join().

Here's patch taking that approach.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5bff90e..6e842f9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3425,20 +3425,8 @@ generate_partition_wise_join_paths(PlannerInfo *root, RelOptInfo *rel)
 	if (!IS_JOIN_REL(rel))
 		return;
 
-	/*
-	 * If we've already proven this join is empty, we needn't consider any
-	 * more paths for it.
-	 */
-	if (IS_DUMMY_REL(rel))
-		return;
-
-	/*
-	 * We've nothing to do if the relation is not partitioned. An outer join
-	 * relation which had an empty inner relation in every pair will have the
-	 * rest of the partitioning properties set except the child-join
-	 * RelOptInfos. See try_partition_wise_join() for more details.
-	 */
-	if (rel->nparts <= 0 || rel->part_rels == NULL)
+	/* We've nothing to do if the relation is not partitioned. */
+	if (!IS_PARTITIONED_REL(rel))
 		return;
 
 	/* Guard against stack overflow due to overly deep partition hierarchy. */
@@ -3452,6 +3440,8 @@ generate_partition_wise_join_paths(PlannerInfo *root, RelOptInfo *rel)
 	{
 		RelOptInfo *child_rel = part_rels[cnt_parts];
 
+		Assert(child_rel != NULL);
+
 		/* Add partition-wise join paths for partitioned child-joins. */
 		generate_partition_wise_join_paths(root, child_rel);
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index a35d068..f74afdb 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1319,17 +1319,6 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		return;

Re: WIP Patch: Precalculate stable functions, infrastructure v1

2018-02-05 Thread Marina Polyakova


Hello!

Thank you for reporting! I'll try to get it on our buildfarm..

On 05-02-2018 0:10, Thomas Munro wrote:

On Thu, Feb 1, 2018 at 6:01 PM, Marina Polyakova
 wrote:
This is the 8-th version of the patch for the precalculation of stable 
or
immutable functions, stable or immutable operators and other 
nonvolatile
expressions. This is a try to fix the most problems (I'm sorry, it 
took some
time..) that Tom Lane and Andres Freund mentioned in [1], [2] and [3]. 
It is
based on the top of master, and on my computer make check-world 
passes. And

I'll continue work on it.


Hi Marina,

FYI I saw a repeatable crash in the contrib regression tests when
running make check-world with this patch applied.

test hstore_plperl... FAILED (test process exited with exit 
code 2)
test hstore_plperlu   ... FAILED (test process exited with exit 
code 2)

test create_transform ... FAILED

I'm not sure why it passes for you but fails here, but we can see from
the backtrace[1] that ExecInitExprRec is receiving a null node pointer
on this Ubuntu Trusty GCC 4.8 amd64 system.

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/337255374


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Query running for very long time (server hanged) with parallel append

2018-02-05 Thread Amit Khandekar

On 2 February 2018 at 20:46, Robert Haas  wrote:
> On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar  wrote:
>> The query is actually hanging because one of the workers is in a small
>> loop where it iterates over the subplans searching for unfinished
>> plans, and it never comes out of the loop (it's a bug which I am yet
>> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
>> each iteration; it's a small loop that does not pass control to any
>> other functions .
>
> Uh, sounds like we'd better fix that bug.

The scenario is this : One of the workers w1 hasn't yet chosen the
first plan, and all the plans are already finished. So w1 has it's
node->as_whichplan equal to -1. So the below condition in
choose_next_subplan_for_worker() never becomes true for this worker :

if (pstate->pa_next_plan == node->as_whichplan)
{
   /* We've tried everything! */
   pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
   LWLockRelease(&pstate->pa_lock);
   return false;
}

What I think is : we should save the information about which plan we
started the search with, before the loop starts. So, we save the
starting plan value like this, before the loop starts:
   initial_plan = pstate->pa_next_plan;

And then break out of the loop when we come back to to this initial plan :
if (initial_plan == pstate->pa_next_plan)
   break;

Now, suppose it so happens that initial_plan is a non-partial plan.
And then when we wrap around to the first partial plan, we check
(initial_plan == pstate->pa_next_plan) which will never become true
because initial_plan is less than first_partial_plan.

So what I have done in the patch is : have a flag 'should_wrap_around'
indicating that we should not wrap around. This flag is true when
initial_plan is a non-partial plan, in which case we know that we will
have covered all plans by the time we reach the last plan. So break
from the loop if this flag is false, or if we have reached the initial
plan.

Attached is a patch that fixes this issue on the above lines.

>
>> But I am not sure about this : while the workers are at it, why the
>> backend that is waiting for the workers does not come out of the wait
>> state with a SIGINT. I guess the same issue has been discussed in the
>> mail thread that you pointed.
>
> Is it getting stuck here?
>
> /*
>  * We can't finish transaction commit or abort until all of the workers
>  * have exited.  This means, in particular, that we can't respond to
>  * interrupts at this stage.
>  */
> HOLD_INTERRUPTS();
> WaitForParallelWorkersToExit(pcxt);
> RESUME_INTERRUPTS();

Actually the backend is getting stuck in
choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
hanging worker to release the pstate->pa_lock. I think there is
nothing wrong in this, because it is assumed that LWLock wait is going
to be for very short tiime, and because of this bug, the lwlock waits
forever.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

fix_hang_issue.patch
Description: Binary data

Re: JIT compiling with LLVM v9.1

2018-02-05 Thread Pierre Ducroquet

On Sunday, February 4, 2018 12:45:50 AM CET Andreas Karlsson wrote:
> On 02/02/2018 10:48 AM, Pierre Ducroquet wrote:
> > I have successfully built the JIT branch against LLVM 4.0.1 on Debian
> > testing. This is not enough for Debian stable (LLVM 3.9 is the latest
> > available there), but it's a first step.
> > I've split the patch in four files. The first three fix the build issues,
> > the last one fixes a runtime issue.
> > I think they are small enough to not be a burden for you in your
> > developments. But if you don't want to carry these ifdefs right now, I
> > maintain them in a branch on a personal git and rebase as frequently as I
> > can.
> 
> I tested these patches and while the code built for me and passed the
> test suite on Debian testing I have a weird bug where the very first
> query fails to JIT while the rest work as they should. I think I need to
> dig into LLVM's codebase to see what it is, but can you reproduce this
> bug at your machine?
> 
> Code to reproduce:
> 
> SET jit_expressions = true;
> SET jit_above_cost = 0;
> SELECT 1;
> SELECT 1;
> 
> Output:
> 
> postgres=# SELECT 1;
> ERROR:  failed to jit module
> postgres=# SELECT 1;
>   ?column?
> --
>  1
> (1 row)
> 
> Config:
> 
> Version: You patches applied on top of
> 302b7a284d30fb0e00eb5f0163aa933d4d9bea10
> OS: Debian testing
> llvm/clang: 4.0.1-8
> 
> Andreas


I have fixed the patches, I was wrong on 'guessing' the migration of the API 
for one function.
I have rebuilt the whole patch set. It is still based on 302b7a284d and has 
been tested with both LLVM 3.9 and 4.0 on Debian testing.

Thanks for your feedback !
>From ebf577fc6b1e74ddb2f6b9aef1c2632ab807dd70 Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 09:11:55 +0100
Subject: [PATCH 1/6] Add support for LLVM4 in llvmjit.c

---
 src/backend/lib/llvmjit.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/backend/lib/llvmjit.c b/src/backend/lib/llvmjit.c
index 8e5ba94c98..046cd53ef3 100644
--- a/src/backend/lib/llvmjit.c
+++ b/src/backend/lib/llvmjit.c
@@ -230,12 +230,19 @@ llvm_get_function(LLVMJitContext *context, const char *funcname)
 
 		addr = 0;
 		if (LLVMOrcGetSymbolAddressIn(handle->stack, &addr, handle->orc_handle, mangled))
-			elog(ERROR, "failed to lookup symbol");
+			elog(ERROR, "failed to lookup symbol %s", mangled);
 		if (addr)
 			return (void *) addr;
 	}
 #endif
 
+#if LLVM_VERSION_MAJOR < 5
+	if ((addr = LLVMOrcGetSymbolAddress(llvm_opt0_orc, mangled)))
+		return (void *) addr;
+	if ((addr = LLVMOrcGetSymbolAddress(llvm_opt3_orc, mangled)))
+		return (void *) addr;
+	elog(ERROR, "failed to lookup symbol %s for %s", mangled, funcname);
+#else
 	if (LLVMOrcGetSymbolAddress(llvm_opt0_orc, &addr, mangled))
 		elog(ERROR, "failed to lookup symbol");
 	if (addr)
@@ -244,7 +251,7 @@ llvm_get_function(LLVMJitContext *context, const char *funcname)
 		elog(ERROR, "failed to lookup symbol");
 	if (addr)
 		return (void *) addr;
-
+#endif
 	elog(ERROR, "failed to JIT: %s", funcname);
 
 	return NULL;
@@ -380,11 +387,18 @@ llvm_compile_module(LLVMJitContext *context)
 	 * faster instruction selection mechanism is used.
 	 */
 	{
-		LLVMSharedModuleRef smod;
 		instr_time tb, ta;
 
 		/* emit the code */
 		INSTR_TIME_SET_CURRENT(ta);
+#if LLVM_VERSION_MAJOR < 5
+		orc_handle = LLVMOrcAddEagerlyCompiledIR(compile_orc, context->module,
+			 llvm_resolve_symbol, NULL);
+		// It seems there is no error return from that function in LLVM < 5.
+#else
+		LLVMSharedModuleRef smod;
+
+		LLVMSharedModuleRef smod;
 		smod = LLVMOrcMakeSharedModule(context->module);
 		if (LLVMOrcAddEagerlyCompiledIR(compile_orc, &orc_handle, smod,
 		llvm_resolve_symbol, NULL))
@@ -392,6 +406,7 @@ llvm_compile_module(LLVMJitContext *context)
 			elog(ERROR, "failed to jit module");
 		}
 		LLVMOrcDisposeSharedModuleRef(smod);
+#endif
 		INSTR_TIME_SET_CURRENT(tb);
 		INSTR_TIME_SUBTRACT(tb, ta);
 		ereport(DEBUG1, (errmsg("time to emit: %.3fs",
-- 
2.15.1

>From f90c156c9b26eda38d3a1312c3de87ffba50340a Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 09:13:40 +0100
Subject: [PATCH 2/6] Add LLVM4 support in llvmjit_error.cpp

---
 src/backend/lib/llvmjit_error.cpp | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/lib/llvmjit_error.cpp b/src/backend/lib/llvmjit_error.cpp
index 70cecd114b..04e51b2a31 100644
--- a/src/backend/lib/llvmjit_error.cpp
+++ b/src/backend/lib/llvmjit_error.cpp
@@ -56,7 +56,9 @@ llvm_enter_fatal_on_oom(void)
 	if (fatal_new_handler_depth == 0)
 	{
 		old_new_handler = std::set_new_handler(fatal_system_new_handler);
+#if LLVM_VERSION_MAJOR > 4
 		llvm::install_bad_alloc_error_handler(fatal_llvm_new_handler);
+#endif
 		llvm::install_fatal_error_handler(fatal_llvm_error_handler);
 	}
 	fatal_new_handler_depth++;
@@ -72,7 +74,9 @@ llvm_leave_fatal_on_oom(void)
 	if (fatal_new_handler_depth == 0)
 	{
 		std::set_new_handler(old_new_handler);
+#if LLVM_V

Re: non-bulk inserts and tuple routing

2018-02-05 Thread Etsuro Fujita


(2018/02/05 14:34), Amit Langote wrote:

On 2018/02/02 19:56, Etsuro Fujita wrote:

* ExecInitPartitionResultRelInfo is called from ExecFindPartition, but we
could call that another way; in ExecInsert/CopyFrom we call that after
ExecFindPartition if the partition chosen by ExecFindPartition has not
been initialized yet.  Maybe either would be OK, but I like that because I
think that would not only better divide that labor but better fit into the
existing code in ExecInsert/CopyFrom IMO.


I see no problem with that, so done that way.


Thanks.


* In ExecInitPartitionResultRelInfo:
+   /*
+* Note that the entries in this list appear in no predetermined
+* order as result of initializing partition result rels as and when
+* they're needed.
+*/
+   estate->es_leaf_result_relations =
+ lappend(estate->es_leaf_result_relations,
+   leaf_part_rri);

Is it OK to put this within the "if (leaf_part_rri == NULL)" block?


Good catch.  I moved it outside the block.  I was under the impression
that leaf result relations that were reused from the
mtstate->resultRelInfo arrary would have already been added to the list,
but it seems they are not.


I commented this because the update-tuple-routing patch has added to the 
list ResultRelInfos reused from the mtstate->resultRelInfo arrary, but 
on reflection I noticed this would cause oddity in reporting execution 
stats for partitions' triggers in EXPLAIN ANALYZE.  Here is an example 
using the head:


postgres=# create table trigger_test (a int, b text) partition by list (a);
CREATE TABLE
postgres=# create table trigger_test1 partition of trigger_test for 
values in (1);

CREATE TABLE
postgres=# create table trigger_test2 partition of trigger_test for 
values in (2);

CREATE TABLE
postgres=# create trigger before_upd_row_trig before update on 
trigger_test1 for

 each row execute procedure trigger_data (23, 'skidoo');
CREATE TRIGGER
postgres=# create trigger before_del_row_trig before delete on 
trigger_test1 for

 each row execute procedure trigger_data (23, 'skidoo');
CREATE TRIGGER
postgres=# insert into trigger_test values (1, 'foo');
INSERT 0 1
postgres=# explain analyze update trigger_test set a = 2 where a = 1;
NOTICE:  before_upd_row_trig(23, skidoo) BEFORE ROW UPDATE ON trigger_test1
NOTICE:  OLD: (1,foo),NEW: (2,foo)
NOTICE:  before_del_row_trig(23, skidoo) BEFORE ROW DELETE ON trigger_test1
NOTICE:  OLD: (1,foo)
  QUERY PLAN


---
 Update on trigger_test  (cost=0.00..25.88 rows=6 width=42) (actual 
time=2.337..

2.337 rows=0 loops=1)
   Update on trigger_test1
   ->  Seq Scan on trigger_test1  (cost=0.00..25.88 rows=6 width=42) 
(actual tim

e=0.009..0.011 rows=1 loops=1)
 Filter: (a = 1)
 Planning time: 0.186 ms
 Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
 Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
 Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
 Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
 Execution time: 2.396 ms
(10 rows)

Both trigger stats for the on-update and on-delete triggers are doubly 
shown in the above output.  The reason would be that 
ExplainPrintTriggers called report_triggers twice for trigger_test1's 
ResultRelInfo: once for it from queryDesc->estate->es_result_relations 
and once for it from queryDesc->estate->es_leaf_result_relations.  I 
don't think this is intended behavior, so I think we should fix this.  I 
think we could probably address this by modifying ExecInitPartitionInfo 
in your patch so that it doesn't add to the es_leaf_result_relations 
list ResultRelInfos reused from the mtstate->resultRelInfo arrary, as 
your previous version of the patch.  (ExecGetTriggerResultRel looks at 
the list too, but it would probably work well for this change.)  It 
might be better to address this in another patch, though.



* In the same function:
+   /*
+* Verify result relation is a valid target for INSERT.
+*/
+   CheckValidResultRel(leaf_part_rri, CMD_INSERT);

I think it would be better to leave the previous comments as-is here:

 /*
  * Verify result relation is a valid target for an INSERT.  An UPDATE
  * of a partition-key becomes a DELETE+INSERT operation, so this
check
  * is still required when the operation is CMD_UPDATE.
  */


Oops, my bad.  Didn't notice that I had ended up removing the part about
UPDATE.


OK.


* ExecInitPartitionResultRelInfo does the work other than the
initialization of ResultRelInfo for the chosen partition (eg, create a
tuple conversion map to convert a tuple routed to the partition from the
parent's type to the partition's).  So I'd propose to rename that function
to eg, ExecInitPartition.


I went with ExevInitPartitionInfo.


Fine wit

Add PGDLLIMPORT to enable_hashagg

2018-02-05 Thread Metin Doslu

Hey all,

There was already a discussion and commit for adding PGDLLIMPORT to some
variables which enables extensions to use them on Windows builds. For
reference, the previous thread:"Add PGDLLIMPORT lines to some variables".

I would like to add PGDLLIMPORT to enable_hashagg for the same reason. I'm
adding a very simple patch. Please let me know if I'm missing anything
given that this is my first patch submission.

Best,
Metin


add_pgdllimport-v1.patch
Description: Binary data

Re: [HACKERS] Pluggable storage

2018-02-05 Thread Haribabu Kommi

On Tue, Jan 9, 2018 at 11:42 PM, Haribabu Kommi 
wrote:

>
> Updated patches are attached.
>

To integrate the columnar store with the pluggable storage API, I found that
there are couple of other things also that needs to be supported.

1. Choosing the right table access method for a particular table?

I am thinking of adding a new table option to let the user select the
correct table
access method that the user wants for the table. HEAP is the default access
method. This approach may be simple and doesn't need any syntax changes.

Or Is it fine to add syntax "USING method" to CREATE TABLE similar like
CREATE INDEX?

comments?

2. As the columnar storage needs many other relations that are needs to be
created along with main relation.

As these extra relations are internal to the storage and shouldn't be
visible
directly from pg_class and these will be stored in the storage specific
catalog tables. A dependency is created for the original table as these
storage
specific tables must be created/dropped/altered whenever there is a change
with the original table.

Is it fine to add new API while creating/altering/drop the table to get the
control?
or to use only exiting processutility hook?

Regards,
Hari Babu
Fujitsu Australia

Re: PQHost() undefined behavior if connecting string contains both host and hostaddr types

2018-02-05 Thread Haribabu Kommi

On Sun, Jan 14, 2018 at 9:44 PM, Michael Paquier 
wrote:

> On Sun, Jan 14, 2018 at 02:19:26PM +1100, Haribabu Kommi wrote:
> > While working on [1], we find out the inconsistency in PQHost() behavior
> > if the connecting string that is passed to connect to the server contains
> > multiple hosts with both host and hostaddr types. For example,
> >
> > host=host1,host2 hostaddr=127.0.0.1,127.0.0.1 port=5434,5432
> >
> > As the hostaddr is given preference when both host and hostaddr is
> > specified, so the connection type for both addresses of the above
> > conninfo is CHT_HOST_ADDRESS. So the PQhost() returns the
> > conn->pghost value i.e "host1,host2" instead of the actual host that
> > is connected.
>
> During the discussion of adding the client-side connection parameters to
> pg_stat_wal_receiver, which is useful when the client specifies multiple
> hosts and ports, it has been discussed that introducing a new API in
> libpq to get effective host, hostaddr and port values would be necessary
> in order to get all the useful information wanted, however this has a
> larger impact than initially thought as any user showing the host
> information in psql's PROMPT would be equally confused. Any caller of
> PQhost have the same problem.
>
> > Instead of checking the connection type while returning the host
> > details, it should check whether the host is NULL or not? with this
> > change it returns the expected value for all the connection types.
>
> I mentioned that on the other thread, but this seems like an improvement
> to me, which leads to less confusion. See here for more details
> regarding what we get today on HEAD:
> https://www.postgresql.org/message-id/20180109011547.GE76418%40paquier.xyz
>


In the other thread of enhancing pg_stat_wal_receiver to display the remote
host, the details of why the hostaddr was added and reverted for and how it
can be
changed now the PQhost() function to return the host and not the hostaddr
is provided
in mail [1]. It will be useful to take some decision with the PQhost()
function.

Added to the next open commitfest.

[1] -
https://www.postgresql.org/message-id/CAJrrPGctG2WYRXE9XcKf%2B75xSWH-vBgvizzgcLC5M0KuD3nSYQ%40mail.gmail.com


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Robert Haas

On Fri, Feb 2, 2018 at 10:26 PM, Peter Geoghegan  wrote:
> My proposed commit message
> has a full explanation of the Valgrind issue, which I won't repeat
> here. Go read it before reading the rest of this e-mail.

I'm going to paste the first two sentences of your proposed commit
message in here for the convenience of other readers, since I want to
reply to them.

# LogicalTapeFreeze() may write out its first block when it is dirty but
# not full, and then immediately read the first block back in from its
# BufFile as a BLCKSZ-width block.  This can only occur in rare cases
# where next to no tuples were written out, which is only possible with
# parallel external tuplesorts.

So, if I understand correctly what you're saying here, valgrind is
totally cool with us writing out an only-partially-initialized block
to a disk file, but it's got a real problem with us reading that data
back into the same memory space it already occupies.  That's a little
odd.  I presume that it's common for the tail of the final block
written to be uninitialized, but normally when we then go read block
0, that's some other, fully initialized block.

It seems like it would be pretty easy to just suppress the useless
read when we've already got the correct data, and I'd lean toward
going that direction since it's a valid optimization anyway.  But I'd
like to hear some opinions from people who use and think about
valgrind more than I do (Tom, Andres, Noah, ...?).

> It might seem like my suppression is overly broad, or not broad
> enough, since it essentially targets LogicalTapeFreeze(). I don't
> think it is, though, because this can occur in two places within
> LogicalTapeFreeze() -- it can occur in the place we actually saw the
> issue on lousyjack, from the ltsReadBlock() call within
> LogicalTapeFreeze(), as well as a second place -- when
> BufFileExportShared() is called. I found that you have to tweak code
> to prevent it happening in the first place before you'll see it happen
> in the second place.

I don't quite see how that would happen, because BufFileExportShared,
at least AFAICS, doesn't touch the buffer?

Unfortunately valgrind does not work at all on my laptop -- the server
appears to start, but as soon as you try to connect, the whole thing
dies with an error claiming that the startup process has failed.  So I
can't easily test this at the moment.  I'll try to get it working,
here or elsewhere, but thought I'd send the above reply first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-05 Thread Claudio Freire

On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada  wrote:
> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire  
> wrote:
>> After autovacuum gets cancelled, the next time it wakes up it will
>> retry vacuuming the cancelled relation. That's because a cancelled
>> autovacuum doesn't update the last-vacuumed stats.
>>
>> So the timing between an autovacuum work item and the next retry for
>> that relation is more or less an autovacuum nap time, except perhaps
>> in the case where many vacuums get cancelled, and they have to be
>> queued.
>
> I think that's not true if there are multiple databases.

I'd have to re-check.

The loop is basically, IIRC:

while(1) { vacuum db ; work items ; nap }

Now, if that's vacuum one db, not all, and if the decision on the
second run doesn't pick the same db because that big table failed to
be vacuumed, then you're right.

In that case we could add the FSM vacuum as a work item *in addition*
to what this patch does. If the work item queue is full and the FSM
vacuum doesn't happen, it'd be no worse than with the patch as-is.

Is that what you suggest?

With that in mind, I'm noticing WorkItems have a avw_database that
isn't checked by do_autovacuum. Is that right? Shouldn't only work
items that belong to the database being autovacuumed be processed?

>> That's why an initial FSM vacuum makes sense. It has a similar timing
>> to the autovacuum work item, it has the benefit that it can be
>> triggered manually (manual vacuum), and it's cheap and efficient.
>>
>>> Also the patch always vacuums fsm at beginning of the vacuum with a
>>> threshold but it is useless if the table has been properly vacuumed. I
>>> don't think it's good idea to add an additional step that "might be"
>>> efficient, because current vacuum is already heavy.
>>
>> FSMs are several orders of magnitude smaller than the tables
>> themselves. A TB-sized table I have here has a 95M FSM. If you add
>> threshold skipping, that initial FSM vacuum *will* be efficient.
>>
>> By definition, the FSM will be less than 1/4000th of the table, so
>> that initial FSM pass takes less than 1/4000th of the whole vacuum.
>> Considerably less considering the simplicity of the task.
>
> I agree the fsm is very smaller than heap and vacuum on fsm will not
> be comparatively heavy but I'm afraid that the vacuum will get more
> heavy in the future if we pile up such improvement that are small but
> might not be efficient. For example, a feature for reporting the last
> vacuum status has been proposed[1]. I wonder if we can use it to
> determine whether we do the fsm vacuum at beginning of vacuum.

Yes, such a feature would allow skipping that initial FSM vacuum. That
can be improved in a future patch if that proposed feature gets
merged. This patch can be treated independently from that in the
meanwhile, don't you think?

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-05 Thread Claudio Freire

On Mon, Feb 5, 2018 at 2:55 PM, Claudio Freire  wrote:
> With that in mind, I'm noticing WorkItems have a avw_database that
> isn't checked by do_autovacuum. Is that right? Shouldn't only work
> items that belong to the database being autovacuumed be processed?

NVM. I had to git pull, it's fixed in head.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 9:43 AM, Robert Haas  wrote:
> # LogicalTapeFreeze() may write out its first block when it is dirty but
> # not full, and then immediately read the first block back in from its
> # BufFile as a BLCKSZ-width block.  This can only occur in rare cases
> # where next to no tuples were written out, which is only possible with
> # parallel external tuplesorts.
>
> So, if I understand correctly what you're saying here, valgrind is
> totally cool with us writing out an only-partially-initialized block
> to a disk file, but it's got a real problem with us reading that data
> back into the same memory space it already occupies.

That's not quite it. Valgrind is cool with a BufFileWrite(), which
doesn't result in an actual write() because the buffile.c stdio-style
buffer (which isn't where the uninitialized bytes originate from)
isn't yet filled. The actual write() comes later, and that's the point
that Valgrind complains. IOW, Valgrind is cool with copying around
uninitialized memory before we do anything with the underlying values
(e.g., write(), something that affects control flow).

> I presume that it's common for the tail of the final block
> written to be uninitialized, but normally when we then go read block
> 0, that's some other, fully initialized block.

It certainly is common. In the case of logtape.c, we almost always
write out some garbage bytes, even with serial sorts. The only
difference here is the *sense* in which they're garbage: they're
uninitialized bytes, which Valgrind cares about, rather than byte from
previous writes that are left behind in the buffer, which Valgrind
does not care about.

>> It might seem like my suppression is overly broad, or not broad
>> enough, since it essentially targets LogicalTapeFreeze(). I don't
>> think it is, though, because this can occur in two places within
>> LogicalTapeFreeze() -- it can occur in the place we actually saw the
>> issue on lousyjack, from the ltsReadBlock() call within
>> LogicalTapeFreeze(), as well as a second place -- when
>> BufFileExportShared() is called. I found that you have to tweak code
>> to prevent it happening in the first place before you'll see it happen
>> in the second place.
>
> I don't quite see how that would happen, because BufFileExportShared,
> at least AFAICS, doesn't touch the buffer?

It doesn't have to -- at least not directly. Valgrind remembers that
the uninitialized memory from logtape.c buffers are poisoned -- it
"spreads". The knowledge that the bytes are poisoned is tracked as
they're copied around. You get the error on the write() from the
BufFile buffer, despite the fact that you can make the error go away
by using palloc0() instead of palloc() within logtape.c, and nowhere
else.

-- 
Peter Geoghegan

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2018-02-05 Thread Andreas Karlsson

The patch looks good to me now other than some smaller issues, mostly in 
the documentation. If you fix these I will set it as ready for 
committer, and let a committer chime in on the casting logic which we 
both find a bit ugly.


== Comments

The only a bit bigger issue I see is the error messages. Can they be 
improved?


For example:

CREATE TABLE t1 (a int, b int, PrIMARY KEY (a, b));
CREATE TABLE t2 (a int, bs int8[], FOREIGN KEY (a, EACH ELEMENT OF bs) 
REFERENCES t1 (a, b));

INSERT INTO t2 VALUES (0, ARRAY[1, 2]);

Results in:

ERROR:  insert or update on table "t2" violates foreign key constraint 
"t2_a_fkey"

DETAIL:  Key (a, bs)=(0, {1,2}) is not present in table "t1".

Would it be possible to make the DETAIL something like the below 
instead? Do you think my suggested error message is clear?


I am imaging something like the below as a patch. Does it look sane? The 
test cases need to be updated at least.


diff --git a/src/backend/utils/adt/ri_triggers.c 
b/src/backend/utils/adt/ri_triggers.c

index 402bde19d4..9dc7c9812c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -3041,6 +3041,10 @@ ri_ReportViolation(const RI_ConstraintInfo *riinfo,
appendStringInfoString(&key_names, ", ");
appendStringInfoString(&key_values, ", ");
}
+
+   if (riinfo->fk_reftypes[idx] == FKCONSTR_REF_EACH_ELEMENT)
+   appendStringInfoString(&key_names, "EACH ELEMENT OF ");
+
appendStringInfoString(&key_names, name);
appendStringInfoString(&key_values, val);
}


DETAIL:  Key (a, EACH ELEMENT OF bs)=(0, {1,2}) is not present in table 
"t1".


-  REFERENCES reftable [ ( 
refcolumn ) ] [ MATCH FULL 
| MATCH PARTIAL | MATCH SIMPLE ]
+  [EACH ELEMENT OF] REFERENCES class="parameter">reftable [ ( class="parameter">refcolumn ) ] [ MATCH FULL | MATCH 
PARTIAL | MATCH SIMPLE ]


I think this documentation in doc/src/sgml/ref/create_table.sgml should 
be removed since it is no longer correct.


+ 
+  Foreign Key Arrays are an extension
+  of PostgreSQL and are not included in the SQL standard.
+ 

This pargraph and some of the others you added to 
doc/src/sgml/ref/create_table.sgml are strangely line wrapped.


+   
+ON DELETE CASCADE
+
+ 
+  Same as standard foreign key constraints. Deletes the entire 
array.

+ 
+
+   
+  

I thought this was no longer supported.

+   however, this syntax will cast ftest1 to int4 upon RI checks, 
thus defeating the

+  purpose of the index.

There is a minor indentation error on these lines.

+
+   
+ Array ELEMENT foreign keys and the ELEMENT
+ REFERENCES clause are a 
PostgreSQL

+ extension.
+   

We do not have any ELEMENT REFERENCES clause.

-   ereport(ERROR,
-   (errcode(ERRCODE_DATATYPE_MISMATCH),
-errmsg("foreign key constraint \"%s\" "
-   "cannot be implemented",
-   fkconstraint->conname),
-errdetail("Key columns \"%s\" and \"%s\" "
-  "are of incompatible types: %s and %s.",
-  strVal(list_nth(fkconstraint->fk_attrs, i)),
-  strVal(list_nth(fkconstraint->pk_attrs, i)),
-  format_type_be(fktype),
-  format_type_be(pktype;
+   ereport(ERROR,
+   (errcode(ERRCODE_DATATYPE_MISMATCH),
+errmsg("foreign key constraint \"%s\" "
+   "cannot be implemented",
+   fkconstraint->conname),
+errdetail("Key columns \"%s\" and \"%s\" "
+  "are of incompatible types: %s and %s.",
+  strVal(list_nth(fkconstraint->fk_attrs, i)),
+  strVal(list_nth(fkconstraint->pk_attrs, i)),
+  format_type_be(fktype),
+  format_type_be(pktype;

It seems like you accidentally change the indentation here,

Andreas

Warning when building man pages

2018-02-05 Thread Andreas Karlsson


Hi,

I get a warning when building the man pages on Debian testing. The 
warning is in the documentation for CREATE POLICY and as caused by the 
three footnotes in the sql-createpolicy-summary table.


I do not understand our documentation pipeline well enough to know what 
the proper fix would be.


Error message:

Element sup in namespace '' encountered in a, but no template matches.
Element sup in namespace '' encountered in a, but no template matches.
Element sup in namespace '' encountered in a, but no template matches.
Note: Writing man7/CREATE_POLICY.7

Package versions:

docbook  4.5-6
docbook-dsssl1.79-9.1
docbook-xml  4.5-8
docbook-xsl  1.79.1+dfsg-2
openjade 1.4devel1-21.3+b1
opensp   1.5.2-13+b1
xsltproc 1.1.29-5

Andreas

Re: WIP: BRIN multi-range indexes

2018-02-05 Thread Mark Dilger


> BTW while working on the regression tests, I've noticed that brin.sql
> fails to test a couple of minmax opclasses (e.g. abstime/reltime). Is
> that intentional or is that something we should fix eventually?

I believe abstime/reltime are deprecated.  Perhaps nobody wanted to
bother adding test coverage for deprecated classes?  There was another
thread that discussed removing these types.  The consensus seemed to
be in favor of removing them, though I have not seen a patch for that yet.

mark

Re: Warning when building man pages

2018-02-05 Thread Peter Eisentraut

On 2/5/18 14:50, Andreas Karlsson wrote:
> Error message:
> 
> Element sup in namespace '' encountered in a, but no template matches.
> Element sup in namespace '' encountered in a, but no template matches.
> Element sup in namespace '' encountered in a, but no template matches.
> Note: Writing man7/CREATE_POLICY.7

Yes, everyone is getting that.  It has to do with the table that was
added to the CREATE POLICY man page.  I'll look into fixing it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-02-05 Thread Andrey Borodin

Hi, Michail!

Thanks for the patch!

> 1 февр. 2018 г., в 1:17, Michail Nikolaev  
> написал(а):
> 
> Hello.
> 
> WIP-Patch for optimisation of OFFSET + IndexScan using visibility map.

While the patch seems to me useful improvement, I see few problems with code:
1. Both branches of "if (node->iss_tuples_skipped_remaning != 0)" seem too 
similar. There is a lot of duplicate comments et c. I think that this branch 
should be refactored to avoid code duplication.
2. Most of comments are formatted not per project style.

Besides this, patch looks good. Please, add it to the following commitfest so 
that work on the patch could be tracked.

Best regards, Andrey Borodin.

Re: [HACKERS] A design for amcheck heapam verification

2018-02-05 Thread Peter Geoghegan

On Mon, Jan 22, 2018 at 6:07 PM, Michael Paquier
 wrote:
> Yep. I have provided the feedback I wanted for 0001 (no API change in
> the bloom facility by the way :( ), but I still wanted to look at 0002
> in depths.

I don't see a point in adding complexity that no caller will actually
use. It *might* become useful in the future, in which case it's no
trouble at all to come up with an alternative initialization routine.

Anyway, parallel CREATE INDEX added a new "scan" argument to
IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
an additional NULL argument should be passed by amcheck. However, I
have a better idea.

ISTM that verify_nbtree.c should manage the heap scan itself, it the
style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
for bt_index_check() (which pretends to be a CREATE INDEX
CONCURRENTLY). There can be an MVCC snapshot acquired per index
verified, a snapshot that is under the direct control of amcheck. The
snapshot would be acquired at the start of verification on an index
(when "heapallindex = true"), before the verification of the index
structure even begins, and released at the very end of verification.
Advantages of this include:

1. It simplifies the code, and in particular lets us remove the use of
TransactionXmin. Comments already say that this TransactionXmin
business is a way of approximating our own MVCC snapshot acquisition
-- why not *actually do* the MVCC snapshot acquisition, now that
that's possible?

2. It makes the code for bt_index_check() and bt_index_parent_check()
essentially identical, except for varying
IndexBuildHeapScan()/indexInfo parameters to match what CREATE INDEX
itself does. The more we can outsource to IndexBuildHeapScan(), the
better.

3. It ensures that transactions that heapallindexed verify many
indexes in one go are at no real disadvantage to transactions that
heapallindexed verify a single index, since TransactionXmin going
stale won't impact verification (we won't have to skip Bloom filter
probes due to the uncertainty about what should be in the Bloom
filter).

4. It will make parallel verification easier in the future, which is
something that we ought to make happen. Parallel verification would
target a table with multiple indexes, and use a parallel heap scan. It
actually looks like making this work would be fairly easy. We'd only
need to copy code from nbtsort.c, and arrange for parallel workers to
verify an index each ahead of the heap scan. (There would be multiple
Bloom filters in shared memory, all of which parallel workers end up
probing.)

Thoughts?

-- 
Peter Geoghegan

Re: WIP: BRIN multi-range indexes

2018-02-05 Thread Tomas Vondra

On 02/05/2018 09:27 PM, Mark Dilger wrote:
> 
>> BTW while working on the regression tests, I've noticed that brin.sql
>> fails to test a couple of minmax opclasses (e.g. abstime/reltime). Is
>> that intentional or is that something we should fix eventually?
> 
> I believe abstime/reltime are deprecated.  Perhaps nobody wanted to
> bother adding test coverage for deprecated classes?  There was another
> thread that discussed removing these types.  The consensus seemed to
> be in favor of removing them, though I have not seen a patch for that yet.
> 

Yeah, that's what I've been wondering about too. There's also this
comment in nabstime.h:

/*
 * Although time_t generally is a long int on 64 bit systems, these two
 * types must be 4 bytes, because that's what pg_type.h assumes. They
 * should be yanked (long) before 2038 and be replaced by timestamp and
 * interval.
 */

But then why adding BRIN opclasses at all? And if adding them, why not
to test them? We all know how long deprecation takes, particularly for
data types.

For me the question is whether to bother with adding the multi-minmax
opclasses, of course.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: JIT compiling with LLVM v9.1

2018-02-05 Thread Andreas Karlsson


OK that fixed the issue, but you have a typo in your patch set.

diff --git a/src/backend/lib/llvmjit_inline.cpp 
b/src/backend/lib/llvmjit_inline.cpp

index a785261bea..51f38e10d2 100644
--- a/src/backend/lib/llvmjit_inline.cpp
+++ b/src/backend/lib/llvmjit_inline.cpp
@@ -37,7 +37,7 @@ extern "C"
 #include 
 #include 
 #include 
-#if LLVM_MAJOR_VERSION > 3
+#if LLVM_VERSION_MAJOR > 3
 #include 
 #else
 #include "llvm/Bitcode/ReaderWriter.h"

Also I get some warning. Not sure if they are from your patches or from 
Andres's.


llvmjit_error.cpp:118:1: warning: unused function 
'fatal_llvm_new_handler' [-Wunused-function]

fatal_llvm_new_handler(void *user_data,
^
1 warning generated.
llvmjit_inline.cpp:114:6: warning: no previous prototype for function 
'operator!' [-Wmissing-prototypes]

bool operator!(const llvm::ValueInfo &vi) {
 ^
1 warning generated.
psqlscanslash.l: In function ‘psql_scan_slash_option’:
psqlscanslash.l:550:8: warning: variable ‘lexresult’ set but not used 
[-Wunused-but-set-variable]

  int   final_state;
^

Andreas

On 02/05/2018 11:39 AM, Pierre Ducroquet wrote:

On Sunday, February 4, 2018 12:45:50 AM CET Andreas Karlsson wrote:

On 02/02/2018 10:48 AM, Pierre Ducroquet wrote:

I have successfully built the JIT branch against LLVM 4.0.1 on Debian
testing. This is not enough for Debian stable (LLVM 3.9 is the latest
available there), but it's a first step.
I've split the patch in four files. The first three fix the build issues,
the last one fixes a runtime issue.
I think they are small enough to not be a burden for you in your
developments. But if you don't want to carry these ifdefs right now, I
maintain them in a branch on a personal git and rebase as frequently as I
can.


I tested these patches and while the code built for me and passed the
test suite on Debian testing I have a weird bug where the very first
query fails to JIT while the rest work as they should. I think I need to
dig into LLVM's codebase to see what it is, but can you reproduce this
bug at your machine?

Code to reproduce:

SET jit_expressions = true;
SET jit_above_cost = 0;
SELECT 1;
SELECT 1;

Output:

postgres=# SELECT 1;
ERROR:  failed to jit module
postgres=# SELECT 1;
   ?column?
--
  1
(1 row)

Config:

Version: You patches applied on top of
302b7a284d30fb0e00eb5f0163aa933d4d9bea10
OS: Debian testing
llvm/clang: 4.0.1-8

Andreas



I have fixed the patches, I was wrong on 'guessing' the migration of the API
for one function.
I have rebuilt the whole patch set. It is still based on 302b7a284d and has
been tested with both LLVM 3.9 and 4.0 on Debian testing.

Thanks for your feedback !

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Robert Haas

On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan  wrote:
> It certainly is common. In the case of logtape.c, we almost always
> write out some garbage bytes, even with serial sorts. The only
> difference here is the *sense* in which they're garbage: they're
> uninitialized bytes, which Valgrind cares about, rather than byte from
> previous writes that are left behind in the buffer, which Valgrind
> does not care about.

/me face-palms.

So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
on the buffer.  "We know what we're doing, trust us!"

In some ways, that seems better than inserting a suppression, because
it only affects the memory in the buffer.

Anybody else want to express an opinion here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Tels

On Mon, February 5, 2018 4:27 pm, Robert Haas wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan  wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.
>
> /me face-palms.
>
> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.
>
> Anybody else want to express an opinion here?

Are the uninitialized bytes that are written out "whatever was in the
memory previously" or just some "0x00 bytes from the allocation but not
yet overwritten from the PG code"?

Because the first sounds like it could be a security problem - if random
junk bytes go out to the disk, and stay there, information could
inadvertedly leak to permanent storage.

Best regards,

Tels

Re: JIT compiling with LLVM v9.1

2018-02-05 Thread Pierre Ducroquet

On Monday, February 5, 2018 10:20:27 PM CET Andreas Karlsson wrote:
> OK that fixed the issue, but you have a typo in your patch set.
> 
> diff --git a/src/backend/lib/llvmjit_inline.cpp
> b/src/backend/lib/llvmjit_inline.cpp
> index a785261bea..51f38e10d2 100644
> --- a/src/backend/lib/llvmjit_inline.cpp
> +++ b/src/backend/lib/llvmjit_inline.cpp
> @@ -37,7 +37,7 @@ extern "C"
>   #include 
>   #include 
>   #include 
> -#if LLVM_MAJOR_VERSION > 3
> +#if LLVM_VERSION_MAJOR > 3
>   #include 
>   #else
>   #include "llvm/Bitcode/ReaderWriter.h"

Thanks, it's weird I had no issue with it. I will fix in the next patch set.

> Also I get some warning. Not sure if they are from your patches or from
> Andres's.
> 
> llvmjit_error.cpp:118:1: warning: unused function
> 'fatal_llvm_new_handler' [-Wunused-function]
> fatal_llvm_new_handler(void *user_data,
> ^
> 1 warning generated.
> llvmjit_inline.cpp:114:6: warning: no previous prototype for function
> 'operator!' [-Wmissing-prototypes]
> bool operator!(const llvm::ValueInfo &vi) {
>   ^
> 1 warning generated.

Both are mine, I knew about the first one, but I did not see the second one. I 
will fix them too, thanks for the review!

> psqlscanslash.l: In function ‘psql_scan_slash_option’:
> psqlscanslash.l:550:8: warning: variable ‘lexresult’ set but not used
> [-Wunused-but-set-variable]
>int   final_state;
>  ^

I'm not sure Andres's patches have anything to do with psql, it's surprising.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 1:27 PM, Robert Haas  wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan  wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.

I should clarify what I meant here -- it is very common when we have
to freeze a tape, like when we do a serial external randomAccess
tuplesort, or a parallel worker's tuplesort. It shouldn't happen
otherwise. Note that there is a general pattern of dumping out the
current buffer just as the next one is needed, in order to make sure
that the linked list pointer correctly points to the
next/soon-to-be-current block. Note also that the majority of routines
declared within logtape.c can only be used on frozen tapes.

I am pretty confident that I've scoped this correctly by targeting
LogicalTapeFreeze().

> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.

I think that that would also work, and would be simpler, but also
slightly inferior to using the proposed suppression. If there is
garbage in logtape.c buffers, we still generally don't want to do
anything important on the basis of those values. We make one exception
with the suppression, which is a pretty typical kind of exception to
make -- don't worry if we write() poisoned bytes, since those are
bound to be alignment related.

OTOH, as I've said we are generally bound to write some kind of
logtape.c garbage, which will almost certainly not be of the
uninitialized memory variety. So, while I feel that the suppression is
better, the advantage is likely microscopic.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 1:39 PM, Tels  wrote:
> Are the uninitialized bytes that are written out "whatever was in the
> memory previously" or just some "0x00 bytes from the allocation but not
> yet overwritten from the PG code"?
>
> Because the first sounds like it could be a security problem - if random
> junk bytes go out to the disk, and stay there, information could
> inadvertedly leak to permanent storage.

But you can say the same thing about *any* of the
write()-of-uninitialized-bytes Valgrind suppressions that already
exist. There are quite a few of those.

That just isn't part of our security model.

-- 
Peter Geoghegan

Re: Crash in partition-wise join involving dummy partitioned relation

2018-02-05 Thread Robert Haas

On Mon, Feb 5, 2018 at 4:46 AM, Ashutosh Bapat
 wrote:
> Here's patch taking that approach.

I rewrote the comment in relation.h like this, which I think is more clear:

 /*
  * Is given relation partitioned?
  *
- * A join between two partitioned relations with same partitioning scheme
- * without any matching partitions will not have any partition in it but will
- * have partition scheme set. So a relation is deemed to be partitioned if it
- * has a partitioning scheme, bounds and positive number of partitions.
+ * It's not enough to test whether rel->part_scheme is set, because it might
+ * be that the basic partitioning properties of the input relations matched
+ * but the partition bounds did not.
+ *
+ * We treat dummy relations as unpartitioned.  We could alternatively
+ * treat them as partitioned, but it's not clear whether that's a useful thing
+ * to do.
  */

With that change, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: BRIN multi-range indexes

2018-02-05 Thread Tom Lane

Tomas Vondra  writes:
> Yeah, that's what I've been wondering about too. There's also this
> comment in nabstime.h:

> /*
>  * Although time_t generally is a long int on 64 bit systems, these two
>  * types must be 4 bytes, because that's what pg_type.h assumes. They
>  * should be yanked (long) before 2038 and be replaced by timestamp and
>  * interval.
>  */

> But then why adding BRIN opclasses at all? And if adding them, why not
> to test them? We all know how long deprecation takes, particularly for
> data types.

There was some pretty recent chatter about removing these types; IIRC
Andres was annoyed about their lack of overflow checks.

I would definitely vote against adding any BRIN support for these types,
or indeed doing any work on them at all other than removal.

regards, tom lane

psql tab completion vs transactions

2018-02-05 Thread Edmund Horner

Hi folks,

While working on tab completion for SELECT I found a few existing
problems with how psql's tab completion queries interact with
transactions.

  - If a tab completion query fails, it will abort the user's
transaction.  For example, typing, "ALTER PUBLICATION " when
connected to an older server version that doesn't have a
pg_publication table.

  - If the user's transaction is already failed, psql tab completions
that depend on queries won't work.

I made a patch to wrap tab completion queries in SAVEPOINT/ROLLBACK TO
SAVEPOINT.  This addresses the first problem, but not the second
(which is less critical IMHO).  Unfortunately I published it in the
same thread as SELECT tab completion, which may have conflated the two
pieces of work:

https://www.postgresql.org/message-id/CAMyN-kCr6aUFaE%3Dov%2B_r%3Db4nS_ihtmJ8-MLMoomZGWdNFgMaVw%40mail.gmail.com

In the meantime, another problem has appeared:

BEGIN;
SET TRANS(User is hoping to avoid typing the
following long command)
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query

This happens because tab completion for SET does a query on
pg_settings to get a list of variables to suggest.

I don't believe a savepoint will help here, as one cannot issue SET
TRANSACTION ISOLATION LEVEL after a SAVEPOINT.  (The savepoint
mechanism might still be useful for the other cases, though.)

Perhaps it could be fixed by some special case code to suppress the
pg_settings query if and only if:

 1. We're in a transaction.
 2. No queries have been run yet (psql would need to track this
itself, I think, as libpq doesn't provide it -- am I right?).
 3. The user is trying to tab complete on a SET.

If these are true, the code could jump straight to SET TRANSACTION
ISOLATION LEVEL.

Of course, this is a pain for users who don't want to do that but *do*
want to set a variable when beginning a transaction.

Ideas?

Edmund

Better Upgrades

2018-02-05 Thread David Fetter

Folks,

While chatting with Bruce about how to make something better than
pg_upgrade, we (and by "we," I mean mostly Bruce) came up with the
following.

What needs improvement:

- pg_upgrade forces a down time event, no matter how cleverly it's done.
- pg_upgrade is very much a blocker for on-disk format changes.

The proposal:

- Add a new script--possibly Perl or Bash, which would:
- Initdb a new cluster with the new version of PostgreSQL and a
  different port.
- Start logical replication from the old version to the new
  version.
- Poll until a pre-determined default amount of replication lag was 
observed, then:
  * Issue an ALTER SYSTEM on the new server to change its port to the old 
server's
  * Issue a pg_ctl stop -w to the old server
  * Issue a pg_ctl restart on the new server
  * Happiness!

Assumptions underlying it:

- Disk and similar resources are cheap enough for most users that
  doubling up during the upgrade is feasible.
- The default upgrade path should require exactly one step.
- Errors do not, by and large, have the capacity to violate an SLA.

The proposal has blockers: 

- We don't actually have logical decoding for DDL, although I'm given
  to understand that Álvaro Herrera has done some yeoman follow-up
  work on Dimitri Fontaine's PoC patches.
- We don't have logical decoding for DCL (GRANT/REVOKE)

We also came up with and, we believe, addressed an important issue,
namely how to ensure continuity.  When we issue a `pg_ctl stop -w`,
that's short for "Cancel current commands and stop cleanly."  At this
point, the new server will not have WAL to replay, so a pg_ctl restart
will load the new configuration and come up pretty much immediately,
and the next try will find a brand new server without a down time
event.

Does this seem worth coding up in its current form?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Better Upgrades

2018-02-05 Thread David G. Johnston

On Mon, Feb 5, 2018 at 5:09 PM, David Fetter  wrote:

>
> The proposal has blockers:
>
> - We don't actually have logical decoding for DDL, although I'm given
>   to understand that Álvaro Herrera has done some yeoman follow-up
>   work on Dimitri Fontaine's PoC patches.
> - We don't have logical decoding for DCL (GRANT/REVOKE)
>
> We also came up with and, we believe, addressed an important issue,
> namely how to ensure continuity.  When we issue a `pg_ctl stop -w`,
> that's short for "Cancel current commands and stop cleanly."  At this
> point, the new server will not have WAL to replay, so a pg_ctl restart
> will load the new configuration and come up pretty much immediately,
> and the next try will find a brand new server without a down time
> event.
>

To what extent do people think that differing cache contents between the
primary and secondary are likely to play a role?  While connection downtime
is zero as long as people have their fail-and-reconnect logic in place
queries taking longer than expected do to unexpected disk I/O could cause
SLA problems.  pg_prewarm currently fulfills that role, and maybe it isn't
the large a problem in practice (the scale I operate at doesn't have this
concern), but having some kind of channel to keep cached blocks (at least
within PostgreSQL) synchronized would probably be something to at least
factor into the design even if the first pass didn't solve that particular
problem.

David J.

Dave

Re: WIP: BRIN multi-range indexes

2018-02-05 Thread Tomas Vondra

On 02/06/2018 12:40 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> Yeah, that's what I've been wondering about too. There's also this
>> comment in nabstime.h:
> 
>> /*
>>  * Although time_t generally is a long int on 64 bit systems, these two
>>  * types must be 4 bytes, because that's what pg_type.h assumes. They
>>  * should be yanked (long) before 2038 and be replaced by timestamp and
>>  * interval.
>>  */
> 
>> But then why adding BRIN opclasses at all? And if adding them, why not
>> to test them? We all know how long deprecation takes, particularly for
>> data types.
> 
> There was some pretty recent chatter about removing these types;
> IIRC Andres was annoyed about their lack of overflow checks.
> 
> I would definitely vote against adding any BRIN support for these
> types, or indeed doing any work on them at all other than removal.
> 

Works for me. Ripping out the two opclasses from the patch is trivial.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Better Upgrades

2018-02-05 Thread Craig Ringer

On 6 February 2018 at 08:09, David Fetter  wrote:

> Folks,
>
> While chatting with Bruce about how to make something better than
> pg_upgrade, we (and by "we," I mean mostly Bruce) came up with the
> following.
>
> What needs improvement:
>
> - pg_upgrade forces a down time event, no matter how cleverly it's done.
> - pg_upgrade is very much a blocker for on-disk format changes.
>
> The proposal:
>
> - Add a new script--possibly Perl or Bash, which would:
> - Initdb a new cluster with the new version of PostgreSQL and a
>   different port.
> - Start logical replication from the old version to the new
>   version.
> - Poll until a pre-determined default amount of replication lag was
> observed, then:
>   * Issue an ALTER SYSTEM on the new server to change its port to the
> old server's
>   * Issue a pg_ctl stop -w to the old server
>   * Issue a pg_ctl restart on the new server
>   * Happiness!
>
> Assumptions underlying it:
>
> - Disk and similar resources are cheap enough for most users that
>   doubling up during the upgrade is feasible.
> - The default upgrade path should require exactly one step.
> - Errors do not, by and large, have the capacity to violate an SLA.
>
> The proposal has blockers:
>
> - We don't actually have logical decoding for DDL, although I'm given
>   to understand that Álvaro Herrera has done some yeoman follow-up
>   work on Dimitri Fontaine's PoC patches.
>

Yep, some DDL support would be key. Lots of apps expect to do DDL online
(automatic migrations, etc), and after all it's been one of Pg's selling
points for a long time. Not having it means people have to do much more
prep - and *will* ignore that prep and break things if you expose it as an
easy to use tool.

You will find dealing with DDL that does full table rewrites to be a
challenge. Especially ALTER TABLEs that add a DEFAULT using a non-IMMUTABLE
expression and ALTER TYPE ... USING with a non-IMMUTABLE expression. But
even for the immutable cases some work is needed to let us cleanly
replicate the DDL.

Some DDL may initially need to be disallowed because it plays poorly with
logical decoding, e.g. CREATE INDEX CONCURRENTLY. It's fixable, it just
requires special case handling where the apply side understands that
specific statement and knows how to recover if we have an error partway
through applying it.

We (2ndQuadrant) routinely do online migrations like this using pglogical,
which is a much more capable tool than in-core logical replication
currently is. But it still requires careful preparation and application
qualification, and preferably a trial run. Improvements in schema change /
DDL handling would be needed to address that.

Logical decoding doesn't support a number of postgres features, which still
rules its use out for some customers. No sequence support (though pglogical
works around that with periodic sequence sync). No pg_largeobject support.
Probably more I'm forgetting.

Long txns and txns that change huge numbers of rows, especially if they
also dirty the catalogs (even temp tables!) are a challenge. Performance
isn't great for them, but more importantly we don't start decoding them
until they commit so the replication latency for them can be very large.
Care is required to time a cutover so you don't land up stopping writes to
the old db then waiting ages before the new db is caught up.

> - We don't have logical decoding for DCL (GRANT/REVOKE)
>

In general there are issues with any command affecting the cluster as a
whole: capturing them, ensuring replay order between them if run in
different dbs, etc.

The simplest option would be to disallow them during migration (make them
ERROR). Or possibly restrict them to a single DB where we permit them and
add them to a DDL replication queue.

If you let them run in any db and capture them wherever they ran you run
into order-of-apply issues because the logical change streams aren't
synchronised between the DBs.

>
> We also came up with and, we believe, addressed an important issue,
> namely how to ensure continuity.  When we issue a `pg_ctl stop -w`,
> that's short for "Cancel current commands and stop cleanly."  At this
> point, the new server will not have WAL to replay, so a pg_ctl restart
> will load the new configuration and come up pretty much immediately,
> and the next try will find a brand new server without a down time
> event.

You'll need to be able to make sure the old server stays up after the last
user commit  for long enough to flush all pending WAL to the new server.
There may be committed changes that the new server hasn't applied yet.

If the old server crashes during the grace period you'll have to restart it
and retry, since you don't know for sure if the new server got all the
changes.

You'll want some kind of read-only mode where you can ensure that exactly
one server is writeable by user queries at a time. And to prevent conflicts
there'll be some period where both are read-only during th

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Tatsuo Ishii

Robert,

> I just reread those discussions but I don't see that they really make
> any argument for the behavior the patch implements.  I see no
> explanation on the thread for why locking a table inside of a subquery
> is more or less likely to cause deadlock than locking one outside of a
> subquery.

If we allow to lock a table in a subquery, following could happen:

We have: CREATE VIEW v1 AS SELECT * FROM t1 WHERE i = (SELECT i FROM t2);

1. Session A tries to lock v1 (I suppose it tries to acquire lock in
the order of t1, then t2). A acquires lock on t1 but yet on t2.

2. Another session B acquires lock on t2.

3. A continues to try to acquire lock on t2 (blocked).

4. B tries to acquire lock on t1. Deadlock occurs.

So if a user mixes locking a view and a underlying base table, there's
a possibility of deadlocks. If we regard that it is user's
responsibility not to mix them or to be careful about the consequence
the mixing of locks so that dealocks do not happen, then I would agree
that we should lock a table inside a subquery.

What do you think?

>>> I think that if we
>>> change the rules for which subqueries get flattened in a future
>>> release, then the behavior will also change.  That seems bad.
>>
>> I doubt it could happen in the future but if that happend we should
>> disallow locking on such views.
> 
> That doesn't make any sense to me.  When someone migrates from
> PostgreSQL 11 to, say, PostgreSQL 14, the view definition is going to
> be recreated from an SQL query.  Neither the user nor the database
> will know whether the query was optimized the same way on both
> databases, so how could we disallow locking only those views where
> there was a difference on the two releases?  Even if we could, how
> does that help anything?  Throwing an error is just as much a
> backward-incompatibility in the command as silently changing what gets
> locked.
> 
> But my complaint may have been a little off base all the same -- I
> guess we're doing this based on the rewriter output, rather than the
> optimizer output, which makes it a lot less likely that we would
> decide to change anything here.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Better Upgrades

2018-02-05 Thread Peter Eisentraut

On 2/5/18 19:09, David Fetter wrote:
> - Add a new script--possibly Perl or Bash, which would:
> - Initdb a new cluster with the new version of PostgreSQL and a
>   different port.

This will need integration with the packaging system.  You'll want to
carry over settings from the old instance.  You might want to put the
new instance on a different host.

> - Start logical replication from the old version to the new
>   version.

There is a step missing that does the DDL sync.  And various features
are not handled by logical replication.  So you'll need a pre-check mode
like pg_upgrade.

Also, you need to do this for each database, so you'll need to decide
whether you'll do it all in parallel or sequentially, where you will
continue when it fails part way through, etc.

> - Poll until a pre-determined default amount of replication lag was 
> observed, then:

Probably the replication lag should be zero, or perhaps you'll even want
to switch to synchronous replication.  Or perhaps you'll switch the
writing node while replication is still catching up.  A lot of that
depends on the application.

>   * Issue an ALTER SYSTEM on the new server to change its port to the old 
> server's

Or you use a connection proxy and have that handle redirecting the
traffic, so you don't need to restart anything.

>   * Issue a pg_ctl stop -w to the old server
>   * Issue a pg_ctl restart on the new server

You can't use pg_ctl when using systemd.

> Does this seem worth coding up in its current form?

Logically, this should be a ten line script.  But I fear making it
actually work in a variety of practical scenarios will probably require
dozens of options and end up very complicated.

At this point, it might be worth more to actually try this procedure by
hand first and work out the details, e.g., how do you do the DDL sync,
how to you convert the configuration files, etc.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

update tuple routing and triggers

2018-02-05 Thread Amit Langote

Hi.

Fujita-san pointed out in a nearby thread [1] that EXPLAIN ANALYZE shows
duplicate stats for partitions' triggers.

Example:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create table p3 partition of p for values in (3);

create trigger show_data before update on p1 for each row execute
procedure trig_notice_func();
create trigger show_data before update on p2 for each row execute
procedure trig_notice_func();

insert into p values (1), (2);

explain (analyze, costs off, timing off) update p set a = a + 1;
NOTICE:  OLD: (1); NEW: (2)
NOTICE:  OLD: (2); NEW: (3)
  QUERY PLAN
--
 Update on p (actual rows=0 loops=1)
   Update on p1
   Update on p2
   Update on p3
   ->  Seq Scan on p1 (actual rows=1 loops=1)
   ->  Seq Scan on p2 (actual rows=1 loops=1)
   ->  Seq Scan on p3 (actual rows=0 loops=1)
 Planning time: 2.000 ms
 Trigger show_data on p1: calls=1
 Trigger show_data on p2: calls=1
 Trigger show_data on p1: calls=1
 Trigger show_data on p2: calls=1
 Execution time: 4.228 ms
(13 rows)

See that the information about the trigger show_data is shown twice for
partitions p1 and p2.  That happens because ExplainPrintTriggers() goes
through both es_result_relations and es_leaf_result_relations to show the
trigger information.  As Fujita-san pointed out in the linked email,
ExecSetupPartitionTupleRouting() adds a partition ResultRelInfo to
es_leaf_result_relations even if it may already have been present in
es_result_relations, which happens if a ResultRelInfo is reused in the
case of update tuple routing.

Attached is a patch to fix that.  After the patch:

explain (analyze, costs off, timing off) update p set a = a + 1;
NOTICE:  OLD: (1); NEW: (2)
NOTICE:  OLD: (2); NEW: (3)
  QUERY PLAN
--
 Update on p (actual rows=0 loops=1)
   Update on p1
   Update on p2
   Update on p3
   ->  Seq Scan on p1 (actual rows=1 loops=1)
   ->  Seq Scan on p2 (actual rows=1 loops=1)
   ->  Seq Scan on p3 (actual rows=0 loops=1)
 Planning time: 0.627 ms
 Trigger show_data on p1: calls=1
 Trigger show_data on p2: calls=1
 Execution time: 1.443 ms
(11 rows)

When working on this, I wondered if the es_leaf_result_relations should
actually be named something like es_tuple_routing_result_rels, to denote
the fact that they're created by tuple routing code.  The current name
might lead to someone thinking that it contains *all* leaf result rels,
but that won't remain true after this patch.  Thoughts?

Thanks,
Amit

[1] https://www.postgresql.org/message-id/5A783549.4020409%40lab.ntt.co.jp
From 31e119c1adc34fabbf87d4516c8e37e1cd41e90b Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 6 Feb 2018 10:40:45 +0900
Subject: [PATCH v1] Fix trigger behavior with update tuple routing

Trigger stats shown by EXPLAIN ANALYZE may contain duplicate
entries, because leaf partitions' ResultRelInfo is redundantly
present in multiple EState lists.

Reported by: Etsuro Fujita
---
 src/backend/executor/execPartition.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 106a96d910..619a8d7a99 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -178,6 +178,13 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
  resultRTindex,
  rel,
  
estate->es_instrument);
+
+   /*
+* Since we're newly creating this ResultRelInfo, add 
it to
+* someplace where explain.c could find them.
+*/
+   estate->es_leaf_result_relations =
+   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
}
 
part_tupdesc = RelationGetDescr(partrel);
@@ -210,9 +217,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
mtstate != NULL &&
mtstate->mt_onconflict 
!= ONCONFLICT_NONE);
 
-   estate->es_leaf_result_relations =
-   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
-
proute->partitions[i] = leaf_part_rri;
i++;
}
-- 
2.11.0

Re: Better Upgrades

2018-02-05 Thread Joshua D. Drake


On 02/05/2018 04:09 PM, David Fetter wrote:

Does this seem worth coding up in its current form?


No. The pg_upgrade utility is awesome and I have commended Bruce on 
multiple occasions about his work with it. That being said, the
"solution" is to support in-place upgrades and our work should be toward 
that. The idea that we can support and upgrade to the catalog plus 
backward and forward support for changes to the page files (upgrade the 
page files as they are accessed/written to) is a much longer, more 
mature and reasonable approach to this problem[1].


JD

1. Props to Theo for bringing this up about a decade ago.



Best,
David.



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: non-bulk inserts and tuple routing

2018-02-05 Thread Amit Langote

On 2018/02/05 19:43, Etsuro Fujita wrote:
> (2018/02/05 14:34), Amit Langote wrote:
>> On 2018/02/02 19:56, Etsuro Fujita wrote:
>>> * In ExecInitPartitionResultRelInfo:
>>> +   /*
>>> +    * Note that the entries in this list appear in no predetermined
>>> +    * order as result of initializing partition result rels as and
>>> when
>>> +    * they're needed.
>>> +    */
>>> +   estate->es_leaf_result_relations =
>>> + lappend(estate->es_leaf_result_relations,
>>> +   leaf_part_rri);
>>>
>>> Is it OK to put this within the "if (leaf_part_rri == NULL)" block?
>>
>> Good catch.  I moved it outside the block.  I was under the impression
>> that leaf result relations that were reused from the
>> mtstate->resultRelInfo arrary would have already been added to the list,
>> but it seems they are not.
> 
> I commented this because the update-tuple-routing patch has added to the
> list ResultRelInfos reused from the mtstate->resultRelInfo arrary, but on
> reflection I noticed this would cause oddity in reporting execution stats
> for partitions' triggers in EXPLAIN ANALYZE.  Here is an example using the
> head:
> 
> postgres=# create table trigger_test (a int, b text) partition by list (a);
> CREATE TABLE
> postgres=# create table trigger_test1 partition of trigger_test for values
> in (1);
> CREATE TABLE
> postgres=# create table trigger_test2 partition of trigger_test for values
> in (2);
> CREATE TABLE
> postgres=# create trigger before_upd_row_trig before update on
> trigger_test1 for
>  each row execute procedure trigger_data (23, 'skidoo');
> CREATE TRIGGER
> postgres=# create trigger before_del_row_trig before delete on
> trigger_test1 for
>  each row execute procedure trigger_data (23, 'skidoo');
> CREATE TRIGGER
> postgres=# insert into trigger_test values (1, 'foo');
> INSERT 0 1
> postgres=# explain analyze update trigger_test set a = 2 where a = 1;
> NOTICE:  before_upd_row_trig(23, skidoo) BEFORE ROW UPDATE ON trigger_test1
> NOTICE:  OLD: (1,foo),NEW: (2,foo)
> NOTICE:  before_del_row_trig(23, skidoo) BEFORE ROW DELETE ON trigger_test1
> NOTICE:  OLD: (1,foo)
>   QUERY PLAN
> 
> 
> 
> ---
>  Update on trigger_test  (cost=0.00..25.88 rows=6 width=42) (actual
> time=2.337..
> 2.337 rows=0 loops=1)
>    Update on trigger_test1
>    ->  Seq Scan on trigger_test1  (cost=0.00..25.88 rows=6 width=42)
> (actual tim
> e=0.009..0.011 rows=1 loops=1)
>  Filter: (a = 1)
>  Planning time: 0.186 ms
>  Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
>  Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
>  Trigger before_del_row_trig on trigger_test1: time=0.495 calls=1
>  Trigger before_upd_row_trig on trigger_test1: time=0.870 calls=1
>  Execution time: 2.396 ms
> (10 rows)
> 
> Both trigger stats for the on-update and on-delete triggers are doubly
> shown in the above output.  The reason would be that ExplainPrintTriggers
> called report_triggers twice for trigger_test1's ResultRelInfo: once for
> it from queryDesc->estate->es_result_relations and once for it from
> queryDesc->estate->es_leaf_result_relations.  I don't think this is
> intended behavior, so I think we should fix this.  I think we could
> probably address this by modifying ExecInitPartitionInfo in your patch so
> that it doesn't add to the es_leaf_result_relations list ResultRelInfos
> reused from the mtstate->resultRelInfo arrary, as your previous version of
> the patch.  (ExecGetTriggerResultRel looks at the list too, but it would
> probably work well for this change.)  It might be better to address this
> in another patch, though.

I see.  Thanks for the analysis and the explanation.

Seeing as this bug exists in HEAD, as you also seem to be saying, we'd
need to fix it independently of the patches on this thread.  I've posted a
patch in another thread titled "update tuple routing and triggers".

Thanks,
Amit

Add more information_schema columns

2018-02-05 Thread Peter Eisentraut

Here is a patch that fills in a few more information schema columns, in
particular those related to the trigger transition tables feature.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c8e1585951859b1248f02c070929e9f83534092a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 5 Feb 2018 20:22:16 -0500
Subject: [PATCH] Add more information_schema columns

- table_constraints.enforced
- triggers.action_order
- triggers.action_reference_old_table
- triggers.action_reference_new_table
---
 doc/src/sgml/information_schema.sgml   | 20 +--
 src/backend/catalog/information_schema.sql | 12 ---
 src/test/regress/expected/triggers.out | 54 ++
 src/test/regress/sql/triggers.sql  |  5 +++
 4 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/information_schema.sgml 
b/doc/src/sgml/information_schema.sgml
index 0faa72f1d3..09ef2827f2 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -5317,6 +5317,13 @@ table_constraints 
Columns
   yes_or_no
   YES if the constraint is deferrable and 
initially deferred, NO if not
  
+ 
+  enforced
+  yes_or_no
+  Applies to a feature not available in
+  PostgreSQL (currently always
+  YES)
+ 
 

   
@@ -5761,7 +5768,14 @@ triggers Columns
  
   action_order
   cardinal_number
-  Not yet implemented
+  
+   Firing order among triggers on the same table having the same
+   event_manipulation,
+   action_timing, and
+   action_orientation.  In
+   PostgreSQL, triggers are fired in name
+   order, so this column reflects that.
+  
  
 
  
@@ -5806,13 +5820,13 @@ triggers Columns
  
   action_reference_old_table
   sql_identifier
-  Applies to a feature not available in 
PostgreSQL
+  Name of the old transition table, or null if 
none
  
 
  
   action_reference_new_table
   sql_identifier
-  Applies to a feature not available in 
PostgreSQL
+  Name of the new transition table, or null if 
none
  
 
  
diff --git a/src/backend/catalog/information_schema.sql 
b/src/backend/catalog/information_schema.sql
index 6fb1a1bc1c..6066597648 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -1783,7 +1783,8 @@ CREATE VIEW table_constraints AS
CAST(CASE WHEN c.condeferrable THEN 'YES' ELSE 'NO' END AS 
yes_or_no)
  AS is_deferrable,
CAST(CASE WHEN c.condeferred THEN 'YES' ELSE 'NO' END AS yes_or_no)
- AS initially_deferred
+ AS initially_deferred,
+   CAST('YES' AS yes_or_no) AS enforced
 
 FROM pg_namespace nc,
  pg_namespace nr,
@@ -1812,7 +1813,8 @@ CREATE VIEW table_constraints AS
CAST(r.relname AS sql_identifier) AS table_name,
CAST('CHECK' AS character_data) AS constraint_type,
CAST('NO' AS yes_or_no) AS is_deferrable,
-   CAST('NO' AS yes_or_no) AS initially_deferred
+   CAST('NO' AS yes_or_no) AS initially_deferred,
+   CAST('YES' AS yes_or_no) AS enforced
 
 FROM pg_namespace nr,
  pg_class r,
@@ -2084,8 +2086,8 @@ CREATE VIEW triggers AS
CAST(current_database() AS sql_identifier) AS event_object_catalog,
CAST(n.nspname AS sql_identifier) AS event_object_schema,
CAST(c.relname AS sql_identifier) AS event_object_table,
-   CAST(null AS cardinal_number) AS action_order,
-- XXX strange hacks follow
+   CAST(rank() OVER (PARTITION BY n.oid, c.oid, em.num, (t.tgtype & 1 
& 66) ORDER BY t.tgname) AS cardinal_number) AS action_order,
CAST(
  CASE WHEN pg_has_role(c.relowner, 'USAGE')
THEN (regexp_match(pg_get_triggerdef(t.oid), E'.{35,} WHEN 
\\((.+)\\) EXECUTE PROCEDURE'))[1]
@@ -2103,8 +2105,8 @@ CREATE VIEW triggers AS
  -- hard-wired refs to TRIGGER_TYPE_BEFORE, TRIGGER_TYPE_INSTEAD
  CASE t.tgtype & 66 WHEN 2 THEN 'BEFORE' WHEN 64 THEN 'INSTEAD OF' 
ELSE 'AFTER' END
  AS character_data) AS action_timing,
-   CAST(null AS sql_identifier) AS action_reference_old_table,
-   CAST(null AS sql_identifier) AS action_reference_new_table,
+   CAST(tgoldtable AS sql_identifier) AS action_reference_old_table,
+   CAST(tgnewtable AS sql_identifier) AS action_reference_new_table,
CAST(null AS sql_identifier) AS action_reference_old_row,
CAST(null AS sql_identifier) AS action_reference_new_row,
CAST(null AS time_stamp) AS created
diff --git a/src/test/regress/expected/triggers.out 
b/src/test/regress/expected/triggers.out
index 9a7aafcc96..7d60b4164f 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/

Re: update tuple routing and triggers

2018-02-05 Thread Amit Langote

On 2018/02/06 10:48, Amit Langote wrote:
> When working on this, I wondered if the es_leaf_result_relations should
> actually be named something like es_tuple_routing_result_rels, to denote
> the fact that they're created by tuple routing code.  The current name
> might lead to someone thinking that it contains *all* leaf result rels,
> but that won't remain true after this patch.  Thoughts?

While I'm waiting to hear some opinion on the renaming, I updated the
patch to clarify this in the comment about es_leaf_result_relations.

Thanks,
Amit
From 2dca4abcf584f8239b8c47a9a3b96be5585bce06 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 6 Feb 2018 10:40:45 +0900
Subject: [PATCH v2] Fix trigger behavior with update tuple routing

Trigger stats shown by EXPLAIN ANALYZE may contain duplicate
entries, because leaf partitions' ResultRelInfo is redundantly
present in multiple EState lists.

Reported by: Etsuro Fujita
---
 src/backend/executor/execPartition.c | 10 +++---
 src/include/nodes/execnodes.h|  5 -
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 106a96d910..619a8d7a99 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -178,6 +178,13 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
  resultRTindex,
  rel,
  
estate->es_instrument);
+
+   /*
+* Since we're newly creating this ResultRelInfo, add 
it to
+* someplace where explain.c could find them.
+*/
+   estate->es_leaf_result_relations =
+   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
}
 
part_tupdesc = RelationGetDescr(partrel);
@@ -210,9 +217,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
mtstate != NULL &&
mtstate->mt_onconflict 
!= ONCONFLICT_NONE);
 
-   estate->es_leaf_result_relations =
-   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
-
proute->partitions[i] = leaf_part_rri;
i++;
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a2a2a9f3d4..1915b53b2f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -466,7 +466,10 @@ typedef struct EState
ResultRelInfo *es_root_result_relations;/* array of 
ResultRelInfos */
int es_num_root_result_relations;   /* length of 
the array */
 
-   /* Info about leaf partitions of partitioned table(s) for insert 
queries: */
+   /*
+* The following list contains ResultRelInfo's created by the tuple
+* routing code for partitions that don't already have one.
+*/
List   *es_leaf_result_relations;   /* List of ResultRelInfos */
 
/* Stuff used for firing triggers: */
-- 
2.11.0

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Robert Haas

On Mon, Feb 5, 2018 at 8:18 PM, Tatsuo Ishii  wrote:
> We have: CREATE VIEW v1 AS SELECT * FROM t1 WHERE i = (SELECT i FROM t2);
>
> 1. Session A tries to lock v1 (I suppose it tries to acquire lock in
> the order of t1, then t2). A acquires lock on t1 but yet on t2.
>
> 2. Another session B acquires lock on t2.
>
> 3. A continues to try to acquire lock on t2 (blocked).
>
> 4. B tries to acquire lock on t1. Deadlock occurs.

True.  But the same exact analysis also applies to this definition,
which contains no subquery:

CREATE VIEW v1 AS SELECT t1.* FROM t1, t2 WHERE t1.i = t2.i;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Better Upgrades

2018-02-05 Thread Craig Ringer

On 6 February 2018 at 09:51, Joshua D. Drake  wrote:

> On 02/05/2018 04:09 PM, David Fetter wrote:
>
>> Does this seem worth coding up in its current form?
>>
>
> No. The pg_upgrade utility is awesome and I have commended Bruce on
> multiple occasions about his work with it. That being said, the
> "solution" is to support in-place upgrades and our work should be toward
> that.

Yeah. Streaming upgrade is useful, but IMO it's a workaround for upgrade
issues more than a solution in its self. Useful for people who want a
conservative upgrade, but not that big a win. Sure you'd like to be able to
downgrade again if it doesn't go right, but that requires a two-way sync,
which introduces its own problems and failure modes.

Support for reading prior version catalogs in-place is what I see as the
user-friendly end goal. Just start Pg12 on top of a pg11 datadir with the
--allow-upgrade flag and you're done.

But I don't think I'm any keener to do the drudgery required to implement
it than anyone else is... and I share others' concerns about the
maintenance burden imposed, impact on future catalog change freedom, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Tatsuo Ishii

> True.  But the same exact analysis also applies to this definition,
> which contains no subquery:
> 
> CREATE VIEW v1 AS SELECT t1.* FROM t1, t2 WHERE t1.i = t2.i;

That's not an updatable view, thus cannot be locked according to the
proposed implementation.

Anyway do you want to allow to lock all base tables in a view
definition if the view is updatable?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Robert Haas

On Mon, Feb 5, 2018 at 10:26 PM, Tatsuo Ishii  wrote:
>> True.  But the same exact analysis also applies to this definition,
>> which contains no subquery:
>>
>> CREATE VIEW v1 AS SELECT t1.* FROM t1, t2 WHERE t1.i = t2.i;
>
> That's not an updatable view, thus cannot be locked according to the
> proposed implementation.

Hmm, true.  Why exactly are we imposing the restriction to updateable
views, anyway?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-05 Thread Robert Haas

On Sun, Feb 4, 2018 at 3:41 AM, Simon Riggs  wrote:
>> It is not clear to me what is exactly your concern if we try to follow
>> #2?  To me, #2 seems like a natural choice.
>
> At first, but it gives an anomaly so is not a good choice. The patch
> does behavior #5, it rechecks the conditions with the latest row.
>
> Otherwise
> WHEN MATCHED AND a=0 THEN UPDATE SET b=0
> WHEN MATCHED AND a=1 THEN UPDATE SET b=1
> would result in (a=1, b=0) in case of concurrent updates, which the
> user clearly doesn't want.

I am unable to understand this.  What are you presuming the tuple was
originally?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Tatsuo Ishii

> Hmm, true.  Why exactly are we imposing the restriction to updateable
> views, anyway?

In my understanding, because of ambiguity to determine which rows in
which base tables needs to be modified by just looking at the DML
against a view. There could be multiple ways to modify the base
tables.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-05 Thread Robert Haas

On Sun, Feb 4, 2018 at 5:15 AM, Simon Riggs  wrote:
> Changes to support sub-selects don't invalidate what is there now in
> the current patch with regard to query representation or optimization.
> So support of those extra features can be added later if we choose.

I don't think you get to make a unilateral decision to exclude
features that work everywhere else from the scope of this patch.  If
there is agreement that those features can be left out of scope, then
that is one thing, but so far all the commentary about the things that
you've chosen to exclude has been negative.  Nor have you really given
any reason why they should be exempt.  You've pointed out that
parallel query doesn't handle everything (which is certainly true, but
does not mean that any feature from now and the end of time is allowed
to exclude from scope whatever seems inconvenient regardless of
contrary community consensus) and you've pointed out here and
elsewhere that somebody could go add the features you omitted later
(which is also true, but misses the general point that we want
committed patches to be reasonably complete already, not have big gaps
that someone will have to fix later).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Robert Haas

On Mon, Feb 5, 2018 at 10:49 PM, Tatsuo Ishii  wrote:
>> Hmm, true.  Why exactly are we imposing the restriction to updateable
>> views, anyway?
>
> In my understanding, because of ambiguity to determine which rows in
> which base tables needs to be modified by just looking at the DML
> against a view. There could be multiple ways to modify the base
> tables.

But what does that have to do with locking?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Crash in partition-wise join involving dummy partitioned relation

2018-02-05 Thread Ashutosh Bapat

On Tue, Feb 6, 2018 at 4:04 AM, Robert Haas  wrote:
> On Mon, Feb 5, 2018 at 4:46 AM, Ashutosh Bapat
>  wrote:
>> Here's patch taking that approach.
>
> I rewrote the comment in relation.h like this, which I think is more clear:
>
>  /*
>   * Is given relation partitioned?
>   *
> - * A join between two partitioned relations with same partitioning scheme
> - * without any matching partitions will not have any partition in it but will
> - * have partition scheme set. So a relation is deemed to be partitioned if it
> - * has a partitioning scheme, bounds and positive number of partitions.
> + * It's not enough to test whether rel->part_scheme is set, because it might
> + * be that the basic partitioning properties of the input relations matched
> + * but the partition bounds did not.
> + *
> + * We treat dummy relations as unpartitioned.  We could alternatively
> + * treat them as partitioned, but it's not clear whether that's a useful 
> thing
> + * to do.
>   */

The comment says why it checks both bounds and part_scheme, but it
doesn't explain why we check nparts, part_rels etc. My patch had that
explanation. Or may be with these changes those checks are not needed.
Should we remove those?

Thanks for the commit.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-05 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 7:56 PM, Robert Haas  wrote:
> I don't think you get to make a unilateral decision to exclude
> features that work everywhere else from the scope of this patch.  If
> there is agreement that those features can be left out of scope, then
> that is one thing, but so far all the commentary about the things that
> you've chosen to exclude has been negative.  Nor have you really given
> any reason why they should be exempt.  You've pointed out that
> parallel query doesn't handle everything (which is certainly true, but
> does not mean that any feature from now and the end of time is allowed
> to exclude from scope whatever seems inconvenient regardless of
> contrary community consensus) and you've pointed out here and
> elsewhere that somebody could go add the features you omitted later
> (which is also true, but misses the general point that we want
> committed patches to be reasonably complete already, not have big gaps
> that someone will have to fix later).

For me, the concern is not really the omission of support for certain
features as such. The concern is that those omissions hint that there
is a problem with the design itself, particularly in the optimizer.
Allowing subselects in the UPDATE part of a MERGE do not seem like
they could be written as a neat adjunct to what Simon already came up
with. If that was possible, Simon probably already would have done it.

-- 
Peter Geoghegan

Re: [COMMITTERS] pgsql: Rearm statement_timeout after each executed query.

2018-02-05 Thread Peter Eisentraut

On 9/18/17 22:41, Andres Freund wrote:
> Rearm statement_timeout after each executed query.

This appears to have broken statement_timeout behavior in master such
that only every second query is affected by it.  For example:

create table t1 as select * from generate_series(0, 1) as _(a);
set statement_timeout = '1s';
explain analyze select from t1 where a = 55;  -- gets canceled
explain analyze select from t1 where a = 55;  -- completes (>1s)
explain analyze select from t1 where a = 55;  -- gets canceled
explain analyze select from t1 where a = 55;  -- completes (>1s)
etc.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Query running for very long time (server hanged) with parallel append

2018-02-05 Thread Kyotaro HORIGUCHI

At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar  
wrote in 
> On 2 February 2018 at 20:46, Robert Haas  wrote:
> > On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar  
> > wrote:
> >> The query is actually hanging because one of the workers is in a small
> >> loop where it iterates over the subplans searching for unfinished
> >> plans, and it never comes out of the loop (it's a bug which I am yet
> >> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
> >> each iteration; it's a small loop that does not pass control to any
> >> other functions .
> >
> > Uh, sounds like we'd better fix that bug.
> 
> 
> The scenario is this : One of the workers w1 hasn't yet chosen the
> first plan, and all the plans are already finished. So w1 has it's
> node->as_whichplan equal to -1. So the below condition in
> choose_next_subplan_for_worker() never becomes true for this worker :
> 
> if (pstate->pa_next_plan == node->as_whichplan)
> {
>/* We've tried everything! */
>pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
>LWLockRelease(&pstate->pa_lock);
>return false;
> }
> 
> What I think is : we should save the information about which plan we
> started the search with, before the loop starts. So, we save the
> starting plan value like this, before the loop starts:
>initial_plan = pstate->pa_next_plan;
> 
> And then break out of the loop when we come back to to this initial plan :
> if (initial_plan == pstate->pa_next_plan)
>break;
> 
> Now, suppose it so happens that initial_plan is a non-partial plan.
> And then when we wrap around to the first partial plan, we check
> (initial_plan == pstate->pa_next_plan) which will never become true
> because initial_plan is less than first_partial_plan.
> 
> So what I have done in the patch is : have a flag 'should_wrap_around'
> indicating that we should not wrap around. This flag is true when
> initial_plan is a non-partial plan, in which case we know that we will
> have covered all plans by the time we reach the last plan. So break
> from the loop if this flag is false, or if we have reached the initial
> plan.
> 
> Attached is a patch that fixes this issue on the above lines.

The patch adds two new variables and always sets them but warp
around can happen at most once per call so I think it is enough
to arrange to stop at the wrap around time. Addition to that the
patch is forgetting the case of no partial plan. The loop can
overrun on pa_finished in the case.

I think something like the following would work.

@@ -485,6 +485,13 @@ choose_next_subplan_for_worker(AppendState *node)
 {
   /* Loop back to first partial plan. */
   pstate->pa_next_plan = append->first_partial_plan;
+
+  /*
+   * Arrange to bail out if we are trying to fetch the first partial
+   * plan
+   */
+  if (node->as_whichplan < append->first_partial_plan)
+node->as_whichplan = append->first_partial_plan;
 }
 else

> >> But I am not sure about this : while the workers are at it, why the
> >> backend that is waiting for the workers does not come out of the wait
> >> state with a SIGINT. I guess the same issue has been discussed in the
> >> mail thread that you pointed.
> >
> > Is it getting stuck here?
> >
> > /*
> >  * We can't finish transaction commit or abort until all of the workers
> >  * have exited.  This means, in particular, that we can't respond to
> >  * interrupts at this stage.
> >  */
> > HOLD_INTERRUPTS();
> > WaitForParallelWorkersToExit(pcxt);
> > RESUME_INTERRUPTS();
> 
> Actually the backend is getting stuck in
> choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
> hanging worker to release the pstate->pa_lock. I think there is
> nothing wrong in this, because it is assumed that LWLock wait is going
> to be for very short tiime, and because of this bug, the lwlock waits
> forever.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Query running for very long time (server hanged) with parallel append

2018-02-05 Thread Kyotaro HORIGUCHI

At Tue, 06 Feb 2018 13:34:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180206.133419.02213593.horiguchi.kyot...@lab.ntt.co.jp>
> At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar  
> wrote in 
> > On 2 February 2018 at 20:46, Robert Haas  wrote:
> > > On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar  
> > > wrote:
> > >> The query is actually hanging because one of the workers is in a small
> > >> loop where it iterates over the subplans searching for unfinished
> > >> plans, and it never comes out of the loop (it's a bug which I am yet
> > >> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
> > >> each iteration; it's a small loop that does not pass control to any
> > >> other functions .
> > >
> > > Uh, sounds like we'd better fix that bug.
> > 
> > 
> > The scenario is this : One of the workers w1 hasn't yet chosen the
> > first plan, and all the plans are already finished. So w1 has it's
> > node->as_whichplan equal to -1. So the below condition in
> > choose_next_subplan_for_worker() never becomes true for this worker :
> > 
> > if (pstate->pa_next_plan == node->as_whichplan)
> > {
> >/* We've tried everything! */
> >pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
> >LWLockRelease(&pstate->pa_lock);
> >return false;
> > }
> > 
> > What I think is : we should save the information about which plan we
> > started the search with, before the loop starts. So, we save the
> > starting plan value like this, before the loop starts:
> >initial_plan = pstate->pa_next_plan;
> > 
> > And then break out of the loop when we come back to to this initial plan :
> > if (initial_plan == pstate->pa_next_plan)
> >break;
> > 
> > Now, suppose it so happens that initial_plan is a non-partial plan.
> > And then when we wrap around to the first partial plan, we check
> > (initial_plan == pstate->pa_next_plan) which will never become true
> > because initial_plan is less than first_partial_plan.
> > 
> > So what I have done in the patch is : have a flag 'should_wrap_around'
> > indicating that we should not wrap around. This flag is true when
> > initial_plan is a non-partial plan, in which case we know that we will
> > have covered all plans by the time we reach the last plan. So break
> > from the loop if this flag is false, or if we have reached the initial
> > plan.
> > 
> > Attached is a patch that fixes this issue on the above lines.
> 
> The patch adds two new variables and always sets them but warp
> around can happen at most once per call so I think it is enough
> to arrange to stop at the wrap around time. Addition to that the
> patch is forgetting the case of no partial plan. The loop can
> overrun on pa_finished in the case.

Sorry, the patch works fine. But still the new variables don't
seem needed.

> I think something like the following would work.
> 
> @@ -485,6 +485,13 @@ choose_next_subplan_for_worker(AppendState *node)
>  {
>/* Loop back to first partial plan. */
>pstate->pa_next_plan = append->first_partial_plan;
> +
> +  /*
> +   * Arrange to bail out if we are trying to fetch the first partial
> +   * plan
> +   */
> +  if (node->as_whichplan < append->first_partial_plan)
> +node->as_whichplan = append->first_partial_plan;
>  }
>  else
> 
> > >> But I am not sure about this : while the workers are at it, why the
> > >> backend that is waiting for the workers does not come out of the wait
> > >> state with a SIGINT. I guess the same issue has been discussed in the
> > >> mail thread that you pointed.
> > >
> > > Is it getting stuck here?
> > >
> > > /*
> > >  * We can't finish transaction commit or abort until all of the 
> > > workers
> > >  * have exited.  This means, in particular, that we can't respond to
> > >  * interrupts at this stage.
> > >  */
> > > HOLD_INTERRUPTS();
> > > WaitForParallelWorkersToExit(pcxt);
> > > RESUME_INTERRUPTS();
> > 
> > Actually the backend is getting stuck in
> > choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
> > hanging worker to release the pstate->pa_lock. I think there is
> > nothing wrong in this, because it is assumed that LWLock wait is going
> > to be for very short tiime, and because of this bug, the lwlock waits
> > forever.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: update tuple routing and triggers

2018-02-05 Thread Amit Khandekar

Thanks for the patch Amit.

I was wondering whether the same duplicate result rels issue can arise
for es_root_result_relations and es_result_relations. But I think for
inserts, es_root_result_relations is NULL, and for updates, these two
lists always have distinct set of relations. So this issue might not
be there for these two lists.

So what the patch does, looks good to me. One minor thing :

+ * Since we're newly creating this ResultRelInfo, add it to
+ * someplace where explain.c could find them.

I think above, instead of keeping the comment specifically for
explain.c, we should make it more general. The list is also used by
ExecGetTriggerResultRel(). (There, this duplicate relation issue does
not come up because it only uses the first one out of them).

I think the name es_tuple_routing_result_rels that you chose, sounds good.

Thanks
-Amit

On 6 February 2018 at 08:08, Amit Langote  wrote:
> On 2018/02/06 10:48, Amit Langote wrote:
>> When working on this, I wondered if the es_leaf_result_relations should
>> actually be named something like es_tuple_routing_result_rels, to denote
>> the fact that they're created by tuple routing code.  The current name
>> might lead to someone thinking that it contains *all* leaf result rels,
>> but that won't remain true after this patch.  Thoughts?
>
> While I'm waiting to hear some opinion on the renaming, I updated the
> patch to clarify this in the comment about es_leaf_result_relations.
>
> Thanks,
> Amit

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: update tuple routing and triggers

2018-02-05 Thread Etsuro Fujita


(2018/02/06 11:38), Amit Langote wrote:

On 2018/02/06 10:48, Amit Langote wrote:

When working on this, I wondered if the es_leaf_result_relations should
actually be named something like es_tuple_routing_result_rels, to denote
the fact that they're created by tuple routing code.  The current name
might lead to someone thinking that it contains *all* leaf result rels,
but that won't remain true after this patch.  Thoughts?


While I'm waiting to hear some opinion on the renaming, I updated the
patch to clarify this in the comment about es_leaf_result_relations.


I'm not sure we really need to rename that.  Wouldn't the updated 
comment on that list in execnodes.h be enough?


Other comment:

@@ -178,6 +178,13 @@ ExecSetupPartitionTupleRouting(ModifyTableState 
*mtstate,

  resultRTindex,
  rel,
  
estate->es_instrument);
+
+   /*
+* Since we're newly creating this ResultRelInfo, add 
it to
+* someplace where explain.c could find them.
+*/
+   estate->es_leaf_result_relations =
+   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
}

I think the above comment would need some more work because that list 
will be searched by ExecGetTriggerResultRel also.


Other than that, the patch looks good to me.

Thanks for the patch!

Best regards,
Etsuro Fujita

Re: [HACKERS] More stats about skipped vacuums

2018-02-05 Thread Masahiko Sawada

On Mon, Dec 11, 2017 at 8:15 PM, Kyotaro HORIGUCHI
 wrote:
> At Mon, 27 Nov 2017 13:51:22 -0500, Robert Haas  wrote 
> in 
>> On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
>>  wrote:
>> > Hmmm. Okay, we must make stats collector more effeicient if we
>> > want to have additional counters with smaller significance in the
>> > table stats. Currently sizeof(PgStat_StatTabEntry) is 168
>> > bytes. The whole of the patchset increases it to 232 bytes. Thus
>> > the size of a stat file for a database with 1 tables
>> > increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
>> > not dynamically expandable so placing stats on shared hash
>> > doesn't seem effective. Stats as a regular table could work but
>> > it seems too-much.
>>
>> dshash, which is already committed, is both DSM-based and dynamically
>> expandable.
>
> Yes, I forgot about that. We can just copy memory blocks to take
> a snapshot of stats.
>
>> > Is it acceptable that adding a new section containing this new
>> > counters, which is just loaded as a byte sequence and parsing
>> > (and filling the corresponding hash) is postponed until a counter
>> > in the section is really requested?  The new counters need to be
>> > shown in a separate stats view (maybe named pg_stat_vacuum).
>>
>> Still makes the stats file bigger.
>
> I considered dshash for pgstat.c and the attached is a *PoC*
> patch, which is not full-fledged and just working under a not so
> concurrent situation.
>
> - Made stats collector an auxiliary process. A crash of stats
>   collector leads to a whole-server restarting.
>
> - dshash lacks capability of sequential scan so added it.
>
> - Also added snapshot function to dshash. It just copies
>   unerlying DSA segments into local memory but currently it
>   doesn't aquire dshash-level locks at all. I tried the same
>   thing with resize but it leads to very quick exhaustion of
>   LWLocks. An LWLock for the whole dshash would be required. (and
>   it is also useful to resize() and sequential scan.
>
> - The current dshash doesn't shrink at all. Such a feature also
>   will be required. (A server restart causes a shrink of hashes
>   in the same way as before but bloat dshash requires copying
>   more than necessary size of memory on takeing a snapshot.)
>
> The size of DSA is about 1MB at minimum. Copying entry-by-entry
> into (non-ds) hash might be better than copying underlying DSA as
> a whole, and DSA/DSHASH snapshot brings a kind of dirty..
>
>
> Does anyone give me opinions or suggestions?
>

The implementation of autovacuum work-item has been changed (by commit
31ae1638) because dynamic shared memory is not portable enough. IIUC
this patch is going to do the similar thing. Since stats collector is
also a critical part of the server, should we consider whether we can
change it? Or the portability problem is not relevant with this patch?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: update tuple routing and triggers

2018-02-05 Thread Amit Langote

Thank you both for the review.

I updated the comment in ExecSetupPartitionTupleRouting considering the
point both of you raised.

About renaming es_leaf_result_relations to
es_tuple_routing_result_relations, I will defer that to committer.  But on
second though, maybe we don't need to make this patch larger than it has
to be.

Thanks,
Amit
From d001de96ab452e48b011db7f34840a3e6b8999c9 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 6 Feb 2018 10:40:45 +0900
Subject: [PATCH v3] Fix trigger behavior with update tuple routing

Trigger stats shown by EXPLAIN ANALYZE may contain duplicate
entries, because leaf partitions' ResultRelInfo is redundantly
present in multiple EState lists.

Reported by: Etsuro Fujita
---
 src/backend/executor/execPartition.c | 10 +++---
 src/include/nodes/execnodes.h|  5 -
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 106a96d910..8094dbc614 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -178,6 +178,13 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
  resultRTindex,
  rel,
  
estate->es_instrument);
+
+   /*
+* Since we're newly creating this ResultRelInfo, add 
it to
+* someplace where others could find it.
+*/
+   estate->es_leaf_result_relations =
+   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
}
 
part_tupdesc = RelationGetDescr(partrel);
@@ -210,9 +217,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
mtstate != NULL &&
mtstate->mt_onconflict 
!= ONCONFLICT_NONE);
 
-   estate->es_leaf_result_relations =
-   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
-
proute->partitions[i] = leaf_part_rri;
i++;
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a2a2a9f3d4..1915b53b2f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -466,7 +466,10 @@ typedef struct EState
ResultRelInfo *es_root_result_relations;/* array of 
ResultRelInfos */
int es_num_root_result_relations;   /* length of 
the array */
 
-   /* Info about leaf partitions of partitioned table(s) for insert 
queries: */
+   /*
+* The following list contains ResultRelInfo's created by the tuple
+* routing code for partitions that don't already have one.
+*/
List   *es_leaf_result_relations;   /* List of ResultRelInfos */
 
/* Stuff used for firing triggers: */
-- 
2.11.0

Re: update tuple routing and triggers

2018-02-05 Thread Amit Langote

On 2018/02/06 13:56, Amit Khandekar wrote:
> I was wondering whether the same duplicate result rels issue can arise
> for es_root_result_relations and es_result_relations. But I think for
> inserts, es_root_result_relations is NULL, and for updates, these two
> lists always have distinct set of relations. So this issue might not
> be there for these two lists.

Yeah, we don't need to worry about es_root_result_relations.

Thanks,
Amit

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-05 Thread Pavan Deolasee

On Tue, Feb 6, 2018 at 9:50 AM, Peter Geoghegan  wrote:

> On Mon, Feb 5, 2018 at 7:56 PM, Robert Haas  wrote:
> > I don't think you get to make a unilateral decision to exclude
> > features that work everywhere else from the scope of this patch.  If
> > there is agreement that those features can be left out of scope, then
> > that is one thing, but so far all the commentary about the things that
> > you've chosen to exclude has been negative.  Nor have you really given
> > any reason why they should be exempt.  You've pointed out that
> > parallel query doesn't handle everything (which is certainly true, but
> > does not mean that any feature from now and the end of time is allowed
> > to exclude from scope whatever seems inconvenient regardless of
> > contrary community consensus) and you've pointed out here and
> > elsewhere that somebody could go add the features you omitted later
> > (which is also true, but misses the general point that we want
> > committed patches to be reasonably complete already, not have big gaps
> > that someone will have to fix later).
>
> For me, the concern is not really the omission of support for certain
> features as such. The concern is that those omissions hint that there
> is a problem with the design itself, particularly in the optimizer.
> Allowing subselects in the UPDATE part of a MERGE do not seem like
> they could be written as a neat adjunct to what Simon already came up
> with. If that was possible, Simon probably already would have done it.
>
>
As someone who's helping Simon with that part of the code, I must say that
omission of sub-selects in the UPDATE targetlist and WHEN quals is not
because of some known design problems.  So while it may be true that we've
a design problem, it's also quite likely that we are missing some
planner/optimiser trick and once we add those missing pieces, it will start
working. Same is the case with RLS.

Partitioned table is something I am actively working on. I must say that
the very fact that INSERT and UPDATE/DELETE take completely different paths
in partitioned/inherited table, makes MERGE quite difficult because it has
to carry out both the operations and hence require all the required
machinery. If I understand correctly, INSERT ON CONFLICT must have faced
similar problems and hence DO UPDATE does not work with partitioned table.
I am not sure if that choice was made when INSERT ON CONFLICT was
implemented or when partitioned table support was added. But the challenges
look similar.

I first tried to treat MERGE similar to UPDATE/DELETE case and ensure that
the INSERTs go through the root partition. That mostly works, but the RIGHT
OUTER join between the child tables and the source relation ends up
emitting duplicate rows, if the partitioned table is the resultRelation and
when it gets expanded in inheritance_planner(). That's a blocker. So what I
am trying now is to push the join between the Append relation and the
source relation below the ModifyTable node, so that we get the final join
result. We can then look up the tableoid in the row returned from the join,
find the corresponding result relation and then carry out MERGE actions.
Note that unlike regular ExecModifyTable(), here we must execute just one
subplan as that will return all the required tuples.

Does anyone see a potential blocker with this approach, except that it may
not be the most elegant way? I think EvalPlanQual might need some treatment
because when the plan is re-executed, it will expect to the find the
updated tuple in the slot of the underlying query's RTE and not in the
resultRelation's RTE, which does not participate in the join at all.
Anything else I could be missing out completely?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Query running for very long time (server hanged) with parallel append

2018-02-05 Thread Amit Khandekar

On 6 February 2018 at 10:11, Kyotaro HORIGUCHI
 wrote:
>> At Mon, 5 Feb 2018 15:29:27 +0530, Amit Khandekar  
>> wrote in
>> > Attached is a patch that fixes this issue on the above lines.
>>
>> The patch adds two new variables and always sets them but warp
>> around can happen at most once per call so I think it is enough
>> to arrange to stop at the wrap around time. Addition to that the
>> patch is forgetting the case of no partial plan. The loop can
>> overrun on pa_finished in the case.
>
> Sorry, the patch works fine. But still the new variables don't
> seem needed.

True, I could have set initially as_whichplan to pa_next_plan, and
used as_whichplan instead of initial_plan. And, I could have used the
condition initial_plan >= append->first_partial_plan instead of
should_wrap_around. The reason I used these two variables is because I
thought  the logic would be more reader-friendly. (Also,
should_wrap_around uses static values so using this variable avoids
determining the condition (initial_plan >= append->first_partial_plan)
at each iteration).

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] [PATCH] Lockable views

2018-02-05 Thread Tatsuo Ishii

> But what does that have to do with locking?

Well, if the view is not updatable, I think there will be less point
to allow to lock the base tables in the view because locking is
typically used in a case when updates are required.

Of course we could add special triggers to allow to update views that
are not automatically updatable but that kind of views are tend to
complex and IMO there's less need the automatic view locking feature.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Query running for very long time (server hanged) with parallel append

2018-02-05 Thread Rajkumar Raghuwanshi

On Mon, Feb 5, 2018 at 3:29 PM, Amit Khandekar 
wrote:

>
> Attached is a patch that fixes this issue on the above lines.
>

Patch applied cleanly and work fine for me. mentioned issue is not
reproducible now.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: Add more information_schema columns

2018-02-05 Thread Michael Paquier

On Mon, Feb 05, 2018 at 08:59:31PM -0500, Peter Eisentraut wrote:
> Here is a patch that fills in a few more information schema columns, in
> particular those related to the trigger transition tables feature.

It is unfortunate that this cannot be backpatched.  Here are few
comments, the logic and theh definitions look correct to me.

> -   CAST(null AS cardinal_number) AS action_order,
> -- XXX strange hacks follow
> +   CAST(rank() OVER (PARTITION BY n.oid, c.oid, em.num, (t.tgtype & 
> 1 & 66) ORDER BY t.tgname) AS cardinal_number) AS action_order,

Better to use parenthesis for (t.tgtype & 1 & 66) perhaps? You may want
to comment that this is to filter per row-statement first, and then with
after/before/instead of, which are what the 1 and the 66 are for.

> -   CAST(null AS sql_identifier) AS action_reference_old_table,
> -   CAST(null AS sql_identifier) AS action_reference_new_table,
> +   CAST(tgoldtable AS sql_identifier) AS action_reference_old_table,
> +   CAST(tgnewtable AS sql_identifier) AS action_reference_new_table,

> +SELECT trigger_name, event_manipulation, event_object_schema,
> event_object_table, action_order, action_condition,
> action_orientation, action_timing, action_reference_old_table,
> action_reference_new_table FROM information_schema.triggers ORDER BY
> 1, 2;

Writing those SQL queries across multiple lines would make them easier
to read...
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-05 Thread Kyotaro HORIGUCHI

Hello,

At Fri, 2 Feb 2018 19:52:02 -0300, Claudio Freire  
wrote in 
> On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro
>  wrote:
> > On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire  
> > wrote:
> >> I had the tests running in a loop all day long, and I cannot reproduce
> >> that variance.
> >>
> >> Can you share your steps to reproduce it, including configure flags?
> >
> > Here are two build logs where it failed:
> >
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511
> >
> > Here's one where it succeeded:
> >
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855
> >
> > The full build script used is:
> >
> > ./configure --enable-debug --enable-cassert --enable-coverage
> > --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap
> > --with-icu && make -j4 all contrib docs && make -Otarget -j3
> > check-world
> >
> > This is a virtualised 4 core system.  I wonder if "make -Otarget -j3
> > check-world" creates enough load on it to produce some weird timing
> > effect that you don't see on your development system.
> 
> I can't reproduce it, not even with the same build script.

I had the same error by "make -j3 check-world" but only twice
from many trials.

> It's starting to look like a timing effect indeed.

It seems to be truncation skip, maybe caused by concurrent
autovacuum. See lazy_truncate_heap() for details. Updates of
pg_stat_*_tables can be delayed so looking it also can fail. Even
though I haven't looked the patch closer, the "SELECT
pg_relation_size()" doesn't seem to give something meaningful
anyway.

> I get a similar effect if there's an active snapshot in another
> session while vacuum runs. I don't know how the test suite ends up in
> that situation, but it seems to be the case.
> 
> How do you suggest we go about fixing this? The test in question is
> important, I've caught actual bugs in the implementation with it,
> because it checks that vacuum effectively frees up space.
> 
> I'm thinking this vacuum test could be put on its own parallel group
> perhaps? Since I can't reproduce it, I can't know whether that will
> fix it, but it seems sensible.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

best way to check identical constraint between databases

2018-02-05 Thread Marc Cousin


Hi all,


I'm trying to extend pg_tap/pg_tapgen in order to be able to check that 
two databases have the exact same constraints (for all available types 
of constraints).



I am wondering if there is a better way to check that two constraints 
are equal than using pg_get_constraintdef, which is full text and may 
change formatting between releases.


Using the pg_constraints columns could have the same problem anyway. For 
check constraints, for instance, consrc could change formatting (even if 
that's unlikely), and anyway parsing consrc would add much complexity to 
pg_tap… typeids could be different so they would have to be resolved in 
both databases.



So the question is, is there a better way of checking that a constraint 
in a database is the same in another one than using pg_get_constraintdef 
? (and printing a warning in the doc that it is only guaranteed to work 
with the same PG release).



Regards

Marc

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-05 Thread Masahiko Sawada

On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire  wrote:
> On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada  wrote:
>> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire  
>> wrote:
>>> After autovacuum gets cancelled, the next time it wakes up it will
>>> retry vacuuming the cancelled relation. That's because a cancelled
>>> autovacuum doesn't update the last-vacuumed stats.
>>>
>>> So the timing between an autovacuum work item and the next retry for
>>> that relation is more or less an autovacuum nap time, except perhaps
>>> in the case where many vacuums get cancelled, and they have to be
>>> queued.
>>
>> I think that's not true if there are multiple databases.
>
> I'd have to re-check.
>
> The loop is basically, IIRC:
>
> while(1) { vacuum db ; work items ; nap }
>
> Now, if that's vacuum one db, not all, and if the decision on the
> second run doesn't pick the same db because that big table failed to
> be vacuumed, then you're right.
>
> In that case we could add the FSM vacuum as a work item *in addition*
> to what this patch does. If the work item queue is full and the FSM
> vacuum doesn't happen, it'd be no worse than with the patch as-is.
>
> Is that what you suggest?

That's one of the my suggestion. I might had to consider this issue
for each case. To be clear let me summarize for each case.

For table with indices, vacuum on fsm of heap is done after
lazy_vacuum_heap(). Therefore I think that the case where a table got
vacuumed but fsm couldn't get vacuumed doesn't happen unless the
autovacuum gets cancelled before or while vacuuming fsm. So it can
solve the problem in most cases. Also we can use a vacuum progress
information to check whether we should vacuum fsm of table after got
cancelled. For vacuuming fsm of index, we might have to consider to
vacuum fsm of index after lazy_vacuum_index.

For table with no index, it would be more complicated; similar to
table with indices we can vacuum fsm of table more frequently but
table bloating still can happen. Considering a way to surely vacuum
fsm of table there are some approaches:
(1) using autovacuum work-item, (2) vacuuming fsm of table in
PG_CATCH, (3) remembering the tables got cancelled and vacuuming them
after finished a loop of table_oids.

For (1), we have a issue that is work-item queue will be full when
many tables get cancelled and it's not good idea to queue many
redundant work-items. For (2), it would not be a good way to vacuum
fsm of table immediately after cancelled because immediately after got
cancelled the table still likely to being locked by others. For (3),
that might work fine but it can happen that other autovacum worker
vacuums the fsm of table before processing the list. But it would be
better than we always vacuums them at beginning of vacuum. So I'm in
favor of (3). Even when processing tables in the list, we should take
a lock on the table conditionally so that the autovacuum doesn't block
any foreground work. However, it's quite possible that I'm not seeing
the whole picture here.

>
> With that in mind, I'm noticing WorkItems have a avw_database that
> isn't checked by do_autovacuum. Is that right? Shouldn't only work
> items that belong to the database being autovacuumed be processed?
>>> That's why an initial FSM vacuum makes sense. It has a similar timing
>>> to the autovacuum work item, it has the benefit that it can be
>>> triggered manually (manual vacuum), and it's cheap and efficient.
>>>
 Also the patch always vacuums fsm at beginning of the vacuum with a
 threshold but it is useless if the table has been properly vacuumed. I
 don't think it's good idea to add an additional step that "might be"
 efficient, because current vacuum is already heavy.
>>>
>>> FSMs are several orders of magnitude smaller than the tables
>>> themselves. A TB-sized table I have here has a 95M FSM. If you add
>>> threshold skipping, that initial FSM vacuum *will* be efficient.
>>>
>>> By definition, the FSM will be less than 1/4000th of the table, so
>>> that initial FSM pass takes less than 1/4000th of the whole vacuum.
>>> Considerably less considering the simplicity of the task.
>>
>> I agree the fsm is very smaller than heap and vacuum on fsm will not
>> be comparatively heavy but I'm afraid that the vacuum will get more
>> heavy in the future if we pile up such improvement that are small but
>> might not be efficient. For example, a feature for reporting the last
>> vacuum status has been proposed[1]. I wonder if we can use it to
>> determine whether we do the fsm vacuum at beginning of vacuum.
>
> Yes, such a feature would allow skipping that initial FSM vacuum. That
> can be improved in a future patch if that proposed feature gets
> merged. This patch can be treated independently from that in the
> meanwhile, don't you think?

IMO, I'd like to have it after we could have such an optimization.
Otherwise it will result in adding extra steps for every vacuums until
we get the optimization. But, as you said, we also ca

68 matches

Mail list logo