date:20170216

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Thomas Munro

On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs  wrote:
> Feeling happier about this for now at least.

Thanks!

> I think we need to document how this works more in README or header
> comments. That way I can review it against what it aims to do rather
> than what I think it might do.

I have added a bunch of new comments to explain in the -v2 patch (see
reply to Abhijit).  Please let me know if you think I need to add
still more.  I'm especially interested in your feedback on the block
of comments above the line:

+   LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());

Specifically, your feedback on the sufficiency of this (LSN, time)
pair + filtering out repeat LSNs as an approximation of the time this
LSN was flushed.

> e.g. We need to document what replay_lag represents. Does it include
> write_lag and flush_lag, or is it the time since the flush_lag. i.e.
> do I add all 3 together to get the full lag, or would that cause me to
> double count?

I have included full descriptions of exactly what the 3 times
represent in the user documentation in the -v2 patch.

> How sensitive is this? Does the lag spike quickly and then disappear
> again quickly? If we're sampling this every N seconds, will we get a
> realistic viewpoint or just a random sample?

In my testing it seems to move fairly smoothly so I think sampling
every N seconds would be quite effective and would not be 'noisy'.
The main time it jumps quickly is at the end of a large data load,
when a slow standby finally reaches the end of its backlog; you see it
climb slowly up and up while the faster primary is busy generating WAL
too fast for it to apply, but then if the primary goes idle the
standby eventually catches up.  The high lag number sometimes lingers
for a bit and then pops down to a low number when new WAL arrives that
can be applied quickly.  It seems like a very accurate depiction of
what is really happening so I like that.  I would love to hear other
opinions and feedback/testing experiences!

> Should we smooth the
> value, or present preak info?

Hmm.  Well, it might be interesting to do online exponential moving
averages, similar to the three numbers Unix systems present for load.
On the other hand, I'm amazed no one has complained that I'm making
pg_stat_replication ridiculously wide already, and users/monitoring
system could easy do that kind of thing themselves, and the number
doesn't seem to jumping/noisy/in-need-of-smoothing.  Same would go for
logging over time; seems like an external monitoring tool's bailiwick.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned tables and relfilenode

2017-02-16 Thread Amit Langote

On 2017/02/16 23:40, Robert Haas wrote:
> On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs  wrote:
>> Why not vacuum all partitions?
>> Why not analyze all partitions?
>> Truncate all partitions
> 
> I agree.  But, we need to be careful that a database-wide VACUUM or
> ANALYZE doesn't hit the partitions multiple times, once for the parent
> and again for each child.  Actually, a database-wide VACUUM should hit
> each partition individually and do nothing for the parents, but a

This is what would happen even without the patch.  Patch only modifies
what happens when a partitioned table is specified in the vacuum command.
It emits a warning:

WARNING: skipping "%s" --- cannot vacuum partitioned tables

It seems both you and Simon agree that instead of this warning, we should
recurse to process the leaf partitions (ignoring any partitioned tables in
the hierarchy for which there is nothing to do).  If that's right, I will
modify the patch to do that.

> database-wide ANALYZE should process the parents and do nothing for
> the children, so that the inheritance statistics get updated.

Currently vacuum() processes the following relations:

/*
 * Process all plain relations and materialized views listed in
 * pg_class
 */

while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
{
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);

if (classForm->relkind != RELKIND_RELATION &&
classForm->relkind != RELKIND_MATVIEW)
continue;

Do you mean that if database-wide analyze is to be run, we should also
exclude those RELKIND_RELATION relations that are partitions?

So the only way to update a partition's statistics is to directly specify
it in the command or by autovacuum.

Truncate already recurses to partitions by way of inheritance recursion
that's already in place.  The patch simply teaches ExecuteTruncate() to
ignore partitioned tables when we get to the part where relfilenodes are
manipulated.

>>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>>>   tables, because there is nothing to be done.
>>>
>>> - Since we cannot create indexes on partitioned tables anyway, there is
>>>   no need to handle cluster and reindex (they throw a meaningful error
>>>   already due to the lack of indexes.)
>>
>> Create index on all partitions
> 
> That one's more complicated, per what I wrote in
> https://www.postgresql.org/message-id/CA+TgmoZUwj=qynak+f7xef4w_e2g3xxdmnsnzmzjuinhrco...@mail.gmail.com

I agree that it would be nice to fix this usability issue.  But, it's also
important as you say in the quoted email that we should not underestimate
the amount of work that might be required to properly implement it.

>> (It also seems like wasted effort to try to remove the overhead caused
>> by a parent table for partitioning. Why introduce a special case to
>> save a few bytes? Premature optimization, surely?)
> 
> I don't think it's wasted effort, actually.  My concern isn't so much
> the empty file on disk (which is stupid, but maybe harmless) as
> eliminating the dummy scan from the query plan.  I believe the
> do-nothing scan can actually be a noticeable drag on performance in
> some cases - e.g. if the scan of the partitioned table is on the
> inside of a nested loop, so that instead of repeatedly doing an index
> scan on each of 4 partitions, you repeatedly do an index scan on each
> of 4 partitions and a sequential scan of an empty table.  A zero-page
> sequential scan is pretty fast, but not free.  An even bigger problem
> is that the planner may think that always-empty parent can contain
> some rows, throwing planner estimates off and messing up the whole
> plan.  We've been living with that problem for a long time, but now
> that we have an opportunity to fix it, it would be good to do so.

Agreed.  I would have reconsidered if it were a more invasive patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Thomas Munro

On Thu, Feb 16, 2017 at 11:18 PM, Abhijit Menon-Sen  
wrote:
> Hi Thomas.
>
> At 2017-02-15 00:48:41 +1300, thomas.mu...@enterprisedb.com wrote:
>>
>> Here is a new version with the buffer on the sender side as requested.
>
> This looks good.

Thanks for the review!

>> + write_lag
>> + interval
>> + Estimated time taken for recent WAL records to be written on 
>> this
>> +  standby server
>
> I think I would find a slightly more detailed explanation helpful here.

Fixed.

> A few tiny nits:
>
>> +  * If the lsn hasn't advanced since last time, then do nothing.  This 
>> way
>> +  * we only record a new sample when new WAL has been written, which is
>> +  * simple proxy for the time at which the log was written.
>
> "which is simple" → "which is a simple"

Fixed.

>> +  * If the buffer if full, for now we just rewind by one slot and 
>> overwrite
>> +  * the last sample, as a simple if somewhat uneven way to lower the
>> +  * sampling rate.  There may be better adaptive compaction algorithms.
>
> "buffer if" → "buffer is"

Fixed.

>> + * Find out how much time has elapsed since WAL position 'lsn' or earlier 
>> was
>> + * written to the lag tracking buffer and 'now'.  Return -1 if no time is
>> + * available, and otherwise the elapsed time in microseconds.
>
> Find out how much time has elapsed "between X and 'now'", or "since X".
> (I prefer the former, i.e., s/since/between/.)

Fixed.

I also added some more comments in response to Simon's request for
more explanation of how it works (but will reply to his email
separately).  Please find version 2 attached.

-- 
Thomas Munro
http://www.enterprisedb.com


replication-lag-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-16 Thread Amit Kapila

On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila  wrote:
> On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas  wrote:
>
> Attached are refactoring patches.  WAL patch needs some changes based
> on above comments, so will post it later.
>

Attached is a rebased patch to enable WAL logging for hash indexes.
Note, that this needs to be applied on top of refactoring patches [1]
sent in the previous e-mail.

[1]  - 
https://www.postgresql.org/message-id/CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_%2B-oCRtRkdqPrQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0006-Enable-WAL-for-Hash-Indexes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-16 Thread Kouhei Kaigai

Hello,

The attached patch is revised one.

Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
ExecParallelRetrieveInstrumentation() not to walk on the plan-
state tree twice.
One (hypothetical) downside is, FDW/CSP can retrieve its own
run-time statistics only when query is executed under EXPLAIN
ANALYZE.

This enhancement allows FDW/CSP to collect its specific run-
time statistics more than Instrumentation, then show them as
output of EXPLAIN. My expected examples are GPU's kernel execution
time, DMA transfer ratio and so on. These statistics will never
appear in the Instrumentation structure, however, can be a hot-
point of performance bottleneck if CustomScan works on background
workers.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 3:37 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:42 AM, Kouhei Kaigai  wrote:
> > I also had thought an idea to have extra space to Instrumentation
> > structure, however, it needs to make Instrumentation flexible-length
> > structure according to the custom format by CSP/FDW. Likely, it is not
> a good design.
> > As long as extension can retrieve its custom statistics on DSM area
> > required by ExecParallelEstimate(), I have no preference on the hook
> location.
> 
> That's what I had in mind: the hook happens there, but the extension
> retrieves the information from some extension-specific DSM area, just as
> it would on the ParallelFinish hook.
> 
> > One thing we may pay attention is, some extension (not mine) may want
> > to collect worker's statistics regardless of Instrumentation (in other
> > words, even if plan is not under EXPLAIN ANALYZE).
> > It is the reason why I didn't put a hook under the
> ExecParallelRetrieveInstrumentation().
> 
> I don't think you should worry about that as long as it's a hypothetical
> case.
> 
> If/when some extension actually needs to do that, the design can be discussed
> with a real use case at hand, and not a hypothetical one.
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


parallel-finish-fdw_csp.v2.patch
Description: parallel-finish-fdw_csp.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Amit Langote

On 2017/02/17 14:50, Peter Geoghegan wrote:
> On Thu, Feb 16, 2017 at 9:27 PM, Amit Langote
>  wrote:
>> Attached patch fixes that.  Thom, your example query should not error out
>> with the patch.  As discussed here, DO UPDATE cannot be supported at the
>> moment.
> 
> Maybe you should just let infer_arbiter_indexes() fail, rather than
> enforcing this directly. IIRC, that's what happens with
> inheritance-based partitioning.

That would be another way.  The error message emitted by
infer_arbiter_indexes() would be:

ERROR:  there is no unique or exclusion constraint matching the ON
CONFLICT specification

It does read better than what proposed patch makes
transformOnConflictClause() emit:

ERROR:  ON CONFLICT ON UPDATE clause is not supported with partitioned tables

I updated the patch.  Now it's reduced to simply removing the check in
transformInsertStmt() that prevented using *any* ON CONFLICT on
partitioned tables at all.

I don't however see why the error would *necessarily* occur in the case of
inheritance partitioning.  I mean if inserts into the root table in an
inheritance hierarchy, it's still possible to ON CONFLICT DO UPDATE using
the unique index only on that table for inference, although that's what a
user would intend to do.

create table foo (a int, b int, unique (a));
create table foo_part (like foo including indexes) inherits (foo);
insert into foo values (1, 2);

-- the following still works

insert into foo values (1, 2)
   on conflict (a) do update set b = excluded.b where excluded.a = 1;
insert into foo values (1, 2)
   on conflict (a) do update set b = excluded.b where excluded.a = 1;

As the documentation about inheritance partitioning notes, that may not be
the behavior expected for partitioned tables:

 INSERT statements with ON CONFLICT
 clauses are unlikely to work as expected, as the ON CONFLICT
 action is only taken in case of unique violations on the specified
 target relation, not its child relations.

With partitioned tables, since it's not possible to create index
constraints on them, ON CONFLICT DO UPDATE simply won't work.  So the
patch also updates the note in the document about partitioned tables and
ON CONFLICT.

Thanks,
Amit
>From 188f9e64402ce70f36e48274927fc6d5784319fa Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 17 Feb 2017 14:18:01 +0900
Subject: [PATCH] ON CONFLICT DO NOTHING should work with partitioned tables

Currently, a check in transformInsertStmt() prevents *any*
ON CONFLICT clause from being specified on a partitioned table,
even those specifying DO NOTHING as the alternative action.  It
is harmless to allow those, so remove that check.  It would still
not be possible to use DO UPDATE with partitioned table though,
because infer_arbiter_indexes() will eventually error out upon
failing to find a unique/exclusion constraint.  Remember it is
not at the moment possible to create these constraints on
partitioned tables.

Adds a test and updates the note in document about using ON CONFLICT
with partitioned tables.
---
 doc/src/sgml/ddl.sgml |  9 ++---
 src/backend/parser/analyze.c  |  8 
 src/test/regress/expected/insert_conflict.out | 10 ++
 src/test/regress/sql/insert_conflict.sql  | 10 ++
 4 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index f909242e4c..c99951a660 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3862,9 +3862,12 @@ ANALYZE measurement;

-  INSERT statements with ON CONFLICT
-  clause are currently not allowed on partitioned tables, that is,
-  cause error when specified.
+  Using the ON CONFLICT clause with partitioned tables
+  will cause an error if DO UPDATE is specified as the
+  alternative action, because it requires specifying a unique or exclusion
+  constraint to determine if there is a conflict.  Currently, it is not
+  possible to create indexes on partitioned tables required to implement
+  such constraints.

diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f7659bb6b..a25a7c503a 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -843,16 +843,8 @@ transformInsertStmt(ParseState *pstate, InsertStmt *stmt)

 	/* Process ON CONFLICT, if any. */
 	if (stmt->onConflictClause)
-	{
-		/* Bail out if target relation is partitioned table */
-		if (pstate->p_target_rangetblentry->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("ON CONFLICT clause is not supported with partitioned tables")));
-
 		qry->onConflict = transformOnConflictClause(pstate,
 	stmt->onConflictClause);
-	}

 	/*
 	 * If we have a RETURNING clause, we need to add the target relation to
diff --git a/src/test/regress/expected/insert_conflict.out

[HACKERS] pg_recvlogical.c doesn't build with --disable-integer-datetimes

2017-02-16 Thread Thomas Munro

Hi,

Since commit 7c030783a5bd07cadffc2a1018bc33119a4c7505 it seems that $SUBJECT.

pg_recvlogical.c:501:37: error: incompatible pointer types passing
'int64 *' (aka 'long *') to parameter of type 'TimestampTz *' (aka
'double *') [-Werror,-Wincompatible-pointer-types]
if (!flushAndSendFeedback(conn, ))

pg_recvlogical.c:547:36: error: incompatible pointer types passing
'int64 *' (aka 'long *') to parameter of type 'TimestampTz *' (aka
'double *') [-Werror,-Wincompatible-pointer-types]
if (!flushAndSendFeedback(conn, ))

pg_recvlogical.c:596:36: error: incompatible pointer types passing
'int64 *' (aka 'long *') to parameter of type 'TimestampTz *' (aka
'double *') [-Werror,-Wincompatible-pointer-types]
if (!flushAndSendFeedback(conn, ))

Maybe the attached is the right fix for that?  (Sorry, I didn't test
because I haven't got around to learning how to drive this new stuff
yet.)

-- 
Thomas Munro
http://www.enterprisedb.com


fix-pg-recvlogical-with-float-timestamps.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-02-16 Thread Ashutosh Bapat

On Thu, Feb 16, 2017 at 8:15 PM, Robert Haas  wrote:
> On Wed, Feb 15, 2017 at 11:15 PM, Ashutosh Bapat
>  wrote:
>> If the user is ready throw 200 workers and if the subplans can use
>> them to speed up the query 200 times (obviously I am exaggerating),
>> why not to use those? When the user set
>> max_parallel_workers_per_gather to that high a number, he meant it to
>> be used by a gather, and that's what we should be doing.
>
> The reason is because of what Amit Khandekar wrote in his email -- you
> get a result with a partitioned table that is wildly inconsistent with
> the result you get for an unpartitioned table.  You could equally well
> argue that if the user sets max_parallel_workers_per_gather to 200,
> and there's a parallel sequential scan of an 8MB table to be
> performed, we ought to use all 200 workers for that.  But the planner
> in fact estimates a much lesser number of workers, because using 200
> workers for that task wastes a lot of resources for no real
> performance benefit.  If you partition that 8MB table into 100 tables
> that are each 80kB, that shouldn't radically increase the number of
> workers that get used.

That's true for a partitioned table, but not necessarily for every
append relation. Amit's patch is generic for all append relations. If
the child plans are joins or subquery segments of set operations, I
doubt if the same logic works. It may be better if we throw as many
workers (or some function "summing" those up) as specified by those
subplans. I guess, we have to use different logic for append relations
which are base relations and append relations which are not base
relations.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Documentation improvements for partitioning

2017-02-16 Thread Ashutosh Bapat

Forgot to attach the patch. Thanks Rajkumar for notifying me.

On Fri, Feb 17, 2017 at 11:18 AM, Ashutosh Bapat
 wrote:
> In the documentation of ALTER TABLE ... ATTACH PARTITION its implicit
> that partition name specified should be the name of the existing table
> being attached. Same is the case with DETACH PARTITION where its
> implicit that the detached partition becomes a stand-alone table with
> same name as the partition being detached. I think this needs to be
> more explicit. PFA patch on those lines.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


alter_table_part_doc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Instability in select_parallel regression test

2017-02-16 Thread Tom Lane

Buildfarm members gaur, pademelon, and gharial have all recently shown
failures like this:

*** 
/home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out
  Thu Feb 16 20:35:14 2017
--- 
/home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out
   Thu Feb 16 21:17:58 2017
***
*** 163,167 
  -- provoke error in worker
  select stringu1::int2 from tenk1 where unique1 = 1;
  ERROR:  invalid input syntax for integer: "BA"
- CONTEXT:  parallel worker
  rollback;
--- 163,166 

==

The failure appears intermittent on gharial but is quite reproducible
on gaur/pademelon.  I inserted some debug elog calls and got this trace
of events:

2017-02-17 00:28:32.641 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 1
2017-02-17 00:28:32.641 EST [18934] STATEMENT:  select count(*) from a_star;
2017-02-17 00:28:32.643 EST [18934] LOG:  LaunchParallelWorkers: registered 1 
of 1 workers
2017-02-17 00:28:32.643 EST [18934] STATEMENT:  select count(*) from a_star;
2017-02-17 00:28:32.731 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 2
2017-02-17 00:28:32.731 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:32.824 EST [18934] LOG:  LaunchParallelWorkers: registered 1 
of 4 workers
2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:32.824 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 3
2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:32.923 EST [18934] LOG:  LaunchParallelWorkers: registered 2 
of 4 workers
2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:32.923 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 4
2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:32.989 EST [18934] LOG:  LaunchParallelWorkers: registered 3 
of 4 workers
2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:32.989 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 5
2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:33.040 EST [18934] LOG:  LaunchParallelWorkers: registered 4 
of 4 workers
2017-02-17 00:28:33.040 EST [18934] STATEMENT:  select length(stringu1) from 
tenk1 group by length(stringu1);
2017-02-17 00:28:33.093 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 6
2017-02-17 00:28:33.093 EST [18934] STATEMENT:  select count(*) from tenk1 
where (two, four) not in
(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.173 EST [18934] LOG:  LaunchParallelWorkers: registered 1 
of 4 workers
2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 
where (two, four) not in
(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.173 EST [18934] LOG:  RegisterDynamicBackgroundWorker: 
grabbing slot 7
2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 
where (two, four) not in
(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.253 EST [18934] LOG:  LaunchParallelWorkers: registered 2 
of 4 workers
2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 
where (two, four) not in
(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no 
free slots of 8
2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 
where (two, four) not in
(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to 
register 3 of 4 workers
2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 
where (two, four) not in
(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.557 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no 
free slots of 8
2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from 
tenk1 where hundred > 1;
2017-02-17 00:28:33.557 EST [18934] LOG:  LaunchParallelWorkers: failed to 
register 1 of 4 workers
2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from 
tenk1 where hundred > 1;
2017-02-17 00:28:33.703 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no 
free slots of 8
2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from 
tenk1 where unique1 = 1;
2017-02-17 00:28:33.703 EST [18934]

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Peter Geoghegan

On Thu, Feb 16, 2017 at 9:27 PM, Amit Langote
 wrote:
> Attached patch fixes that.  Thom, your example query should not error out
> with the patch.  As discussed here, DO UPDATE cannot be supported at the
> moment.

Maybe you should just let infer_arbiter_indexes() fail, rather than
enforcing this directly. IIRC, that's what happens with
inheritance-based partitioning.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Documentation improvements for partitioning

2017-02-16 Thread Ashutosh Bapat

In the documentation of ALTER TABLE ... ATTACH PARTITION its implicit
that partition name specified should be the name of the existing table
being attached. Same is the case with DETACH PARTITION where its
implicit that the detached partition becomes a stand-alone table with
same name as the partition being detached. I think this needs to be
more explicit. PFA patch on those lines.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Amit Langote

On 2017/02/17 13:25, Peter Geoghegan wrote:
> On Thu, Feb 16, 2017 at 8:21 PM, Amit Langote
>  wrote:
>> would be working on a leaf partition chosen by tuple-routing after an
>> insert on a partitioned table.  The leaf partitions can very well have a
>> unique index, which can be used for inference.  The problem however is
>> that infer_arbiter_indexes() in the optimizer would be looking at the root
>> partitioned, which cannot yet have any indexes defined on them, let alone
>> unique indexes.  When we develop a feature where defining an index on the
>> root partitioned table would create the same index on all the leaf
>> partitions and then extend it to support unique indexes, then we can
>> perhaps talk about supporting ON CONFLICT handing.  Does that make sense?
> 
> Yes, that makes sense, but I wasn't arguing that that should be
> possible today. I was arguing that when you don't spell out an
> arbiter, which ON CONFLICT DO NOTHING permits, then it should be
> possible for it to just work today -- infer_arbiter_indexes() will
> return immediately.

I see.  It now seems that I should have realized the DO NOTHING action is
indeed supportable when I initially wrote the code that causes the current
error.

> This should be just like the old approach involving inheritance, in
> that that should be possible. No?

So we should error out only when the DO UPDATE conflict action is
requested.  Because it will require specifying conflict_target, which it's
not possible to do in case of partitioned tables.

Attached patch fixes that.  Thom, your example query should not error out
with the patch.  As discussed here, DO UPDATE cannot be supported at the
moment.

Thanks,
Amit
>From 63f2c2d6c47300cb7f1a422a0a5d2697223f55e3 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 17 Feb 2017 14:18:01 +0900
Subject: [PATCH] ON CONFLICT DO NOTHING should work with partitioned tables

The DO NOTHING conflict action does not require one to specify
conflict_target, which would require arbiter indexes to be defined
on the table.  So, only error out if DO UPDATE is requested as
the conflict action.

Adds the test as well.
---
 src/backend/parser/analyze.c  | 20 
 src/test/regress/expected/insert_conflict.out | 10 ++
 src/test/regress/sql/insert_conflict.sql  | 10 ++
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f7659bb6b..8e91f2f7c2 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -843,16 +843,8 @@ transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
 
 	/* Process ON CONFLICT, if any. */
 	if (stmt->onConflictClause)
-	{
-		/* Bail out if target relation is partitioned table */
-		if (pstate->p_target_rangetblentry->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("ON CONFLICT clause is not supported with partitioned tables")));
-
 		qry->onConflict = transformOnConflictClause(pstate,
 	stmt->onConflictClause);
-	}
 
 	/*
 	 * If we have a RETURNING clause, we need to add the target relation to
@@ -1010,6 +1002,18 @@ transformOnConflictClause(ParseState *pstate,
 	List	   *exclRelTlist = NIL;
 	OnConflictExpr *result;
 
+	/*
+	 * Bail out if target relation is partitioned table and on conflict
+	 * action is UPDATE; there won't be any arbiter indexes to infer the
+	 * conflict from.  That's because we do not yet support creating
+	 * indexes or index constraints on partitioned tables.
+	 */
+	if (onConflictClause->action == ONCONFLICT_UPDATE &&
+		pstate->p_target_rangetblentry->relkind == RELKIND_PARTITIONED_TABLE)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ON CONFLICT ON UPDATE clause is not supported with partitioned tables")));
+
 	/* Process the arbiter clause, ON CONFLICT ON (...) */
 	transformOnConflictArbiter(pstate, onConflictClause, ,
 			   , );
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 8d005fddd4..d37f57d571 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -786,3 +786,13 @@ select * from selfconflict;
 (3 rows)
 
 drop table selfconflict;
+-- check that the following works:
+-- insert into partitioned_table on conflict do nothing
+create table parted_conflict_test (a int, b char) partition by list (a);
+create table parted_conflict_test_1 partition of parted_conflict_test for values in (1);
+insert into parted_conflict_test values (1, 'a') on conflict do nothing;
+insert into parted_conflict_test values (1, 'a') on conflict do nothing;
+-- however, on conflict do update not supported yet
+insert into parted_conflict_test values (1) on conflict (a) do update set b = excluded.b where excluded.a = 1;
+ERROR:  ON CONFLICT ON UPDATE clause is

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2017-02-16 Thread Prabakaran, Vaishnavi

On 22 November 2016 at 18:32, Craig Ringer wrote:
> On 22 November 2016 at 15:14, Haribabu Kommi
>  wrote:
> >
> > On Fri, Nov 18, 2016 at 7:18 PM, Craig Ringer 
> wrote:
> >>
> >> The latest is what's attached upthread and what's in the git repo
> >> also referenced upthread.
> >>
> >> I haven't been able to update in response to more recent review due
> >> to other development commitments. At this point I doubt I'll be able
> >> to get update it again in time for v10, so anyone who wants to adopt
> >> it is welcome.
> >
> >
> > Currently patch status is marked as "returned with feedback" in the
> > 11-2016 commitfest. Anyone who wants to work on it can submit the
> > updated patch by taking care of all review comments and change the
> > status of the patch at any time.
> >
> > Thanks for the patch.
>
> Thanks. Sorry I haven't had time to work on it. Priorities.
Hi,
I am interested in this patch and addressed various below comments from 
reviewers. And, I have separated out code and test module into 2 patches. So, 
If needed, test patch can be enhanced more, meanwhile code patch can be 
committed.

>Renaming and refactoring new APIs
> +PQisInBatchMode   172
>+PQqueriesInBatch  173
>+PQbeginBatchMode  174
>+PQendBatchMode175
>+PQsendEndBatch176
>+PQgetNextQuery177
>+PQbatchIsAborted  178
>This set of routines is a bit inconsistent. Why not just prefixing them with 
>PQbatch? Like that for example:
> PQbatchStatus(): consists of disabled/inactive/none, active, error. This 
> covers both PQbatchIsAborted() and PQisInBatchMode().
>PQbatchBegin()
>PQbatchEnd()
>PQbatchQueueSync() or PQbatchQueueFlush, same as PQsendEndBatch() to add and 
>process a sync message into the queue.

Renamed and modified batch status APIs as below
PQisInBatchMode & PQbatchIsAborted ==> PQbatchStatus
PQqueriesInBatch ==> PQbatchQueueCount
PQbeginBatchMode ==> PQbatchBegin
PQendBatchMode ==> PQbatchEnd
PQsendEndBatch ==> PQbatchQueueSync
PQgetNextQuery ==> PQbatchQueueProcess

>PQbatchQueueCount(): returns N>0 if there are N entries, 0 if empty,-1 on 
>failure
>PQbatchQueueProcess(): returns 1 if process can begin, 0 if not, -1 on failure 
>(OOM)
I think it is still ok to keep the current behaviour like other ones present in 
the same file. E.g:"PQsendPrepare" "PQsendQueryGuts"

>PQqueriesInBatch() (Newname(NN):PQbatchQueueCount)doesn't work as documented.
>It says:
>"Returns the number of queries still in the queue for this batch"
>but in fact it's implemented as a Boolean.
Modified the logic to count number of entries in pending queue and return the 
count

>The changes in src/test/examples/ are not necessary anymore. You moved all the 
>tests to test_libpq (for the best actually).
Removed these unnecessary changes from src/test/examples folder and corrected 
the path mentioned in comments section of testlibpqbatch.c

> +   while (queue != NULL)
>+  {
>   PGcommandQueueEntry *prev = queue;
>+   queue = queue->next;
>+   free(prev);
>+   }
>This should free prev->query.
Both prev->query and prev is freed. Also, this applies to "cmd_queue_recycle" 
too.

>Running directly make check in src/test/modules/test_libpq/ does not work
Modified "check" rule in makefile

>You could just remove the VERBOSE flag in the tests, having a test more 
>talkative is always better.
Removed ifdef VERBOSE checks.

>But with the libpq batch API, maybe this could be modernized
>with meta-commands like this:
>\startbatch
>...
>\endbatch
I think it is a separate patch candidate.

> It is possible to guess each one of those errors(occurred in 
> PQbatchQueueProcess API) with respectively
> PQgetResult == NULL, PQisInBatchMode() and PQqueriesInBatch().
> Definitely it should be mentioned in the docs that it is possible to make a 
> difference between all those states.
Updated documentation section of PQbatchQueueProcess() with these details.

> +   entry = PQmakePipelinedCommand(conn);
>+   entry->queryclass = PGQUERY_SYNC;
>+   entry->query = NULL;
>PQmakePipelinedCommand() returns NULL, and boom.
Corrected to return false if PQmakePipelinedCommand() returns NULL.

> +   boolin_batch;   /* connection is in batch (pipelined) mode */
>+   boolbatch_aborted;  /* current batch is aborted, discarding until 
>next Sync */
>Having only one flag would be fine. batch_aborted is set of used only
>when in_batch is used, so both have a strong link
Yes, agree that tracking the batch status via one flag is more clean
So, Added new enum typedef enum
{
PQBATCH_MODE_OFF,
PQBATCH_MODE_ON,
PQBATCH_MODE_ABORTED
} PQBatchStatus;
and " PQBatchStatus batch_status"  member of pg_conn is used to track the 
status of batch mode.

>/* OK, it's launched! */
>-   conn->asyncStatus = PGASYNC_BUSY;
>+   if (conn->in_batch)
>+   PQappendPipelinedCommand(conn, pipeCmd);
>+   else
>+

Re: [HACKERS] Parallel Index-only scan

2017-02-16 Thread Rafia Sabih

On Thu, Feb 16, 2017 at 9:25 PM, Amit Kapila 
wrote:
>
> Few comments:
>
> 1.
> + * ioss_PscanLen   This is needed for parallel index scan
>   * 
>   */
>  typedef struct IndexOnlyScanState
> @@ -1427,6 +1428,7 @@ typedef struct IndexOnlyScanState
>   IndexScanDesc ioss_ScanDesc;
>   Buffer ioss_VMBuffer;
>   long ioss_HeapFetches;
> + Size ioss_PscanLen; /* This is needed for parallel index scan */
>
> No need to mention same comment at multiple places.  I think you keep
> it on top of structure.  The explanation could be some thing like
> "size of parallel index only scan descriptor"
>

Fixed.

>
> 2.
> + node->ioss_ScanDesc->xs_want_itup = true;
>
> I think wherever you are initializing xs_want_itup, you should
> initialize ioss_VMBuffer as well.  Is there a reason for not doing so?
>

Done.

>
>
> 3.
> explain (costs off)
>   select  sum(parallel_restricted(unique1)) from tenk1
>   group by(parallel_restricted(unique1));
> - QUERY PLAN
> -
> +QUERY PLAN
> +---
>   HashAggregate
> Group Key: parallel_restricted(unique1)
> -   ->  Index Only Scan using tenk1_unique1 on tenk1
> -(3 rows)
> +   ->  Gather
> + Workers Planned: 4
> + ->  Parallel Index Only Scan using tenk1_unique1 on tenk1
> +(5 rows)
>
> It doesn't look good that you want to test parallel index only scan
> for a test case that wants to test restricted function.  The comments
> atop test looks odd. I suggest add a separate test (both explain and
> actual execution of query) for parallel index only scan as we have for
> parallel plans for other queries and see if we can keep the output of
> existing test same.
>

Agree, but actually the objective of this test-case is met even with this
plan. To restrict parallel index-only scan here, modification in query or
other parameters would be required. However, for the proper code-coverage
and otherwise I have added test-case for parallel index-only scan.

>
> 4.
> ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
> {
> ..
> + /*
> + * if we are here to just update the scan keys, then don't reset parallel
> + * scan
> + */
> + if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
> + reset_parallel_scan = false;
> ..
> }
>
> I think here you can update the comment to indicate that for detailed
> reason refer ExecReScanIndexScan.
>

Done.
Please find the attached patch for the revised version.
Just an FYI, in my recent tests on TPC-H 300 scale factor, Q16 showed
improved execution time from 830 seconds to 730 seconds with this patch
when used with parallel merge-join patch [1].

[1]
https://www.postgresql.org/message-id/CAFiTN-tX3EzDw7zYvi97eNADG9PH-nmhLa24Y3uWdzy_Y4SkfQ%40mail.gmail.com
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


parallel_index_only_v8.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Peter Geoghegan

On Thu, Feb 16, 2017 at 8:21 PM, Amit Langote
 wrote:
> would be working on a leaf partition chosen by tuple-routing after an
> insert on a partitioned table.  The leaf partitions can very well have a
> unique index, which can be used for inference.  The problem however is
> that infer_arbiter_indexes() in the optimizer would be looking at the root
> partitioned, which cannot yet have any indexes defined on them, let alone
> unique indexes.  When we develop a feature where defining an index on the
> root partitioned table would create the same index on all the leaf
> partitions and then extend it to support unique indexes, then we can
> perhaps talk about supporting ON CONFLICT handing.  Does that make sense?

Yes, that makes sense, but I wasn't arguing that that should be
possible today. I was arguing that when you don't spell out an
arbiter, which ON CONFLICT DO NOTHING permits, then it should be
possible for it to just work today -- infer_arbiter_indexes() will
return immediately.

This should be just like the old approach involving inheritance, in
that that should be possible. No?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Amit Langote

On 2017/02/17 1:17, Peter Geoghegan wrote:
> But surely it should be possible to use DO NOTHING without inferring some
> particular unique index? That's possible with an approach based on
> inheritance.

Hmm.  Code after the following comment fragment in ExecInsert():

 * Do a non-conclusive check for conflicts first.

would be working on a leaf partition chosen by tuple-routing after an
insert on a partitioned table.  The leaf partitions can very well have a
unique index, which can be used for inference.  The problem however is
that infer_arbiter_indexes() in the optimizer would be looking at the root
partitioned, which cannot yet have any indexes defined on them, let alone
unique indexes.  When we develop a feature where defining an index on the
root partitioned table would create the same index on all the leaf
partitions and then extend it to support unique indexes, then we can
perhaps talk about supporting ON CONFLICT handing.  Does that make sense?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Implement custom join algorithm

2017-02-16 Thread Tom Lane

Amin Fallahi  writes:
> I want to implement my custom join algorithm into postgres. Where should I
> start? Where in the code base and how the current algorithms are currently
> implemented?

TBH, if this is your first Postgres programming project, you might
want to set your sights a bit lower than a new join algorithm.
Tackling something a little less far-reaching would be a good way
of starting to learn your way around the codebase.

But in any case, start here:
https://www.postgresql.org/docs/devel/static/overview.html
and after you've read that, start poking around the codebase.
There are many README files worth looking at, and a significant
part of what you'll need to do will amount to copying-and-pasting
existing code, so first you want to get familiar with the existing
code that does something like what you want.

You might for instance want to model your code on merge join,
src/backend/executor/nodeMergejoin.c
or hash join,
src/backend/executor/nodeHash.c
src/backend/executor/nodeHashjoin.c
(the separation between Hash and HashJoin nodes is a bit artificial
IMO, but it's been there since Berkeley days), or good ol' nestloop,
src/backend/executor/nodeNestloop.c

You'll also need planner support, the core of which would be additions
in these files:
src/backend/optimizer/path/joinpath.c
src/backend/optimizer/path/costsize.c
although depending on how outré your algorithm is and how smart you
want the planner to be, you could need large amounts of additional
work in that area.  (As an example, the desire to avoid extra sorts
for mergejoin input has consequences throughout the planner, starting
with the choice to label every path node with its sort order.  For
that matter, it's hardly accidental that mergejoin depends on btree
operator classes and hashjoin depends on hash operator classes ---
you might have to go as far as developing new types of operator classes
to embody whatever datatype and operator knowledge you need.)

And you'll need a whole bunch of boilerplate support code, such as
copyfuncs.c support for your plan node type.  Grepping for references to
the path and plan node types for one of the existing join methods should
find the places where you need to add that, or at least most of them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SUBSCRIPTIONS and pg_upgrade

2017-02-16 Thread Stephen Frost

Peter,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Peter Eisentraut (pete...@gmx.net) wrote:
> > Logical replication
> > 
> > - Add PUBLICATION catalogs and DDL
> > - Add SUBSCRIPTION catalog and DDL
> > - Define logical replication protocol and output plugin
> > - Add logical replication workers
> 
> I'm not entirely sure about the reasoning behind requiring a flag to
> include subscriptions in pg_dump output, as the documentation doesn't
> actually provide one, but if we are going to require that, shouldn't
> pg_upgrade use it, to make sure that the subscriptions are pulled
> forward to the upgraded cluster?
> 
> Also, we should probably discuss that default in pg_dump to not include
> something in the database by default as that's not something we've ever
> done before.  We made a very deliberate point to make sure that RLS
> didn't work by default with pg_dump to avoid the risk that we might not
> dump include everything in the database in the pg_dump output.  I agree
> that it's not exactly the same, but even so, I was surprised to find
> out that subscriptions aren't included by default and I doubt I'd be
> alone.
> 
> If this was all already discussed, I'm happy to go review the relevant
> thread(s).  I'll admit that I didn't follow all of that discussion very
> closely, I'm just going through parts of pg_dump which are not being
> tested in our regression tests currently and discovered that dumping out
> subscriptions is not tested at all.

We should probably also throw an error if --include-subscriptions and
--data-only are used together, instead of just not dumping out the
subscriptions in that case, which is what happens now.

Surely, if the user asked for subscriptions to be included, we should
either throw an error or actually include them.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Partitioned tables and relfilenode

2017-02-16 Thread Amit Langote

On 2017/02/15 16:14, Michael Paquier wrote:
> On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote
>  wrote:
>> The new partitioned tables do not contain any data by themselves.  Any
>> data inserted into a partitioned table is routed to and stored in one of
>> its partitions.  In fact, it is impossible to insert *any* data before a
>> partition (to be precise, a leaf partition) is created.  It seems wasteful
>> then to allocate physical storage (files) for partitioned tables.  If we
>> do not allocate the storage, then we must make sure that the right thing
>> happens when a command that is intended to manipulate a table's storage
>> encounters a partitioned table, the "right thing" here being that the
>> command's code either throws an error or warning (in some cases) if the
>> specified table is a partitioned table or ignores any partitioned tables
>> when it reads the list of relations to process from pg_class.  Commands
>> that need to be taught about this are vacuum, analyze, truncate, and alter
>> table.  Specifically:
> 
> Thanks. I have been looking a bit at this set of patches.

Thanks for reviewing!

>> - In case of vacuum, specifying a partitioned table causes a warning
>> - In case of analyze, we do not throw an error or warning but simply
>>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>>   acquire_inherited_sample_rows(), any partitioned tables in the list
>>   returned by find_all_inheritors() are skipped.
>> - In case of truncate, only the part which manipulates table's physical
>>   storage is skipped for partitioned tables.
> 
> I am wondering if instead those should not just be errors saying that
> such operations are just not support. This could be relaxed in the
> future to mean that a vacuum, truncate or analyze on the partitioned
> table triggers an operator in cascade.

As the discussion down-thread suggests, that might be a path worth
considering.  That is, vacuum or analyze on a partitioned table causes all
the (leaf) partitions to be vacuumed or analyzed.

Truncate already does that (that is, recurse to leaf partitions).  This
patch simply prevents ExecuteTruncate() from trying to manipulate the
partitioned table's relfilenode.

>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
> 
> Yep.

This one is still under discussion.

>> Patches 0001 and 0002 of the attached implement the above part.  0001
>> teaches the above mentioned commands to do the "right thing" as described
>> above and 0002 teaches heap_create() and heap_create_with_catalog() to not
>> create any physical storage (none of the forks) for partitioned tables.
> 
> -   else
> +   /*
> +* Although we cannot analyze partitioned tables themselves, we are
> +* still be able to do the recursive ANALYZE.
> +*/
> +   else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
> {
> /* No need for a WARNING if we already complained during VACUUM */
> if (!(options & VACOPT_VACUUM))
> It seems to me that it is better style to use an empty else if with
> only the comment. Comments related to a condition that are outside
> this condition should be conditional in their formulations. Comments
> within a condition can be affirmations when they refer to a decided
> state of things.

Not sure if this will survive based on the decisions on what to do about
vacuum and analyze, but how about the following rewrite of the above:

else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
 * Although we cannot analyze partitioned tables themselves, we are
 * still be able to do the recursive ANALYZE, which we do below.
 */
}
else
{

>>From patch 2:
> @@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname,
>relkind == RELKIND_TOASTVALUE ||
>relkind == RELKIND_PARTITIONED_TABLE);
> 
> -   heap_create_init_fork(new_rel_desc);
> +   /*
> +* We do not want to create any storage objects for a partitioned
> +* table.
> +*/
> +   if (relkind != RELKIND_PARTITIONED_TABLE)
> +   heap_create_init_fork(new_rel_desc);
> }
> This does not make much sense with an assertion telling exactly the
> contrary a couple of lines before. I think that it would make more
> sense to move the assertion on relkind directly in
> heap_create_init_fork().

Hmm, perhaps the below is saner, with the Assert moved to the start of
heap_create_init_fork() as shown below:

/*
 * We do not want to create any storage objects for a partitioned
 * table, including the init fork.
 */
if (relpersistence == RELPERSISTENCE_UNLOGGED &&
relkind != RELKIND_PARTITIONED_TABLE)
heap_create_init_fork(new_rel_desc);


@@ -1382,6 +1376,8 @@ heap_create_with_catalog(const char *relname,
 void
 heap_create_init_fork(Relation rel)
 {
+Assert(relkind ==

[HACKERS] SUBSCRIPTIONS and pg_upgrade

2017-02-16 Thread Stephen Frost

Peter,

* Peter Eisentraut (pete...@gmx.net) wrote:
> Logical replication
> 
> - Add PUBLICATION catalogs and DDL
> - Add SUBSCRIPTION catalog and DDL
> - Define logical replication protocol and output plugin
> - Add logical replication workers

I'm not entirely sure about the reasoning behind requiring a flag to
include subscriptions in pg_dump output, as the documentation doesn't
actually provide one, but if we are going to require that, shouldn't
pg_upgrade use it, to make sure that the subscriptions are pulled
forward to the upgraded cluster?

Also, we should probably discuss that default in pg_dump to not include
something in the database by default as that's not something we've ever
done before.  We made a very deliberate point to make sure that RLS
didn't work by default with pg_dump to avoid the risk that we might not
dump include everything in the database in the pg_dump output.  I agree
that it's not exactly the same, but even so, I was surprised to find
out that subscriptions aren't included by default and I doubt I'd be
alone.

If this was all already discussed, I'm happy to go review the relevant
thread(s).  I'll admit that I didn't follow all of that discussion very
closely, I'm just going through parts of pg_dump which are not being
tested in our regression tests currently and discovered that dumping out
subscriptions is not tested at all.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Thomas Munro

On Fri, Feb 17, 2017 at 11:13 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane  wrote:
>>> No, it'd be the *most* common MCV, because we're concerned about the
>>> worst-case (largest) bucket size.  But that's good, really, because the
>>> highest MCV frequency will be the one we have most statistical
>>> confidence in.  There's generally a whole lot of noise in the tail-end
>>> MCV numbers.
>
>> Oh, right.  That's reassuring, as it seems like it has a much better
>> chance of actually being right.
>
> Here's a version that does it that way.  Unsurprisingly, it doesn't
> cause any regression test changes, but you can confirm it's having an
> effect with this test case:
>
> create table tt(f1 int);
> insert into tt select 1 from generate_series(1,100) g;
> insert into tt select g from generate_series(1,100) g;
> analyze tt;
> explain select * from tt a natural join tt b;
>
> Unpatched code will go for a hash join on this example.

+1

By strange coincidence, I was about to propose something along these
lines on theoretical grounds, having spent a bunch of time studying
the hash join code recently.  It makes a lot of sense to use
statistics to try to avoid the "fail" (ie fail to respect work_mem,
and maybe fail to complete: maybe better called "overflow" or
"explode") mode during planning.

I have been wondering about a couple of different worst case execution
strategies that would be better than throwing our hands up and
potentially exploding memory once we detect that further partitioning
is not going to help, if we still manage to reach that case despite
adding stats-based defences like this due to statistics being missing,
bad or confused by joins below us.

1.  We could probe the fraction of the hash table that we have managed
to load into work_mem so far and then rewind the outer batch and do it
again for the next work_mem-sized fraction of the inner batch and so
on.  For outer joins we'd need to scan for unmatched tuples after each
hash table refill.  If we detect this condition during the initial
hash build (as opposed to a later inner batch->hash table load), we'd
need to disable the so called 'hybrid' optimisation and fall back to
the so called 'Grace' hash join; that is, we'd need to pull in the
whole outer relation and write it out to batches before we even begin
probing batch 0, so that we have the ability to rewind outer batch 0
for another pass.  When doing extra passes of an outer batch file, we
have to make sure that we don't do the 'send this tuple to a future
batch' behaviour if the number of batches happens to have increased.
Modulo some details, and I may be missing something fundamental here
(maybe breaks in some semi/anti case?)...

2.  We could just abandon hash join for this batch.  "She cannae take
it, captain", so sort inner and outer batches and merge-join them
instead.  Same comment re switching to Grace hash join if this happens
while loading inner batch 0; we'll need a complete inner batch 0 and
outer batch 0, so we can't juse the hybrid optimisation.

Obviously there are vanishing returns here as we add more defences
making it increasingly unlikely that we hit "fail" mode.  But it
bothers me that hash joins in general are not 100% guaranteed to be
able to complete unless you have infinite RAM.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Implement custom join algorithm

2017-02-16 Thread Amin Fallahi

Hi

I want to implement my custom join algorithm into postgres. Where should I
start? Where in the code base and how the current algorithms are currently
implemented?

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Tom Lane

Robert Haas  writes:
> On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane  wrote:
>> No, it'd be the *most* common MCV, because we're concerned about the
>> worst-case (largest) bucket size.  But that's good, really, because the
>> highest MCV frequency will be the one we have most statistical
>> confidence in.  There's generally a whole lot of noise in the tail-end
>> MCV numbers.

> Oh, right.  That's reassuring, as it seems like it has a much better
> chance of actually being right.

Here's a version that does it that way.  Unsurprisingly, it doesn't
cause any regression test changes, but you can confirm it's having an
effect with this test case:

create table tt(f1 int);
insert into tt select 1 from generate_series(1,100) g;
insert into tt select g from generate_series(1,100) g;
analyze tt;
explain select * from tt a natural join tt b;

Unpatched code will go for a hash join on this example.

For application to the back branches, we could do it just like this
(leaving the existing fields alone, and allowing sizeof(RestrictInfo)
to grow), or we could change the datatypes of the four fields involved
to float4 so that sizeof(RestrictInfo) stays the same.  I'm not entirely
sure which way is the more hazardous from an ABI standpoint.  If there
are any external modules doing their own palloc(sizeof(RestrictInfo))
calls, the first way would be bad, but really there shouldn't be; I'd
expect people to be using make_restrictinfo() and friends.  (Note that
palloc's power-of-2 padding wouldn't save us, because sizeof(RestrictInfo)
is currently exactly 128 on 32-bit machines in several of the back
branches.)  Conversely, if any non-core code is touching the bucketsize
fields, changing those field widths would break that; but that doesn't
seem all that likely either.  On balance I think I might go for the first
way, because it would remove doubt about whether reducing the precision
of the bucketsize estimates would cause any unexpected plan changes.

Or we could decide not to back-patch because the problem doesn't come
up often enough to justify taking any risk for.  But given that we've
gotten one confirmed field report, I'm not voting that way.

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 05d8538..c2cdd79 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*** _copyRestrictInfo(const RestrictInfo *fr
*** 2070,2075 
--- 2070,2077 
  	COPY_SCALAR_FIELD(hashjoinoperator);
  	COPY_SCALAR_FIELD(left_bucketsize);
  	COPY_SCALAR_FIELD(right_bucketsize);
+ 	COPY_SCALAR_FIELD(left_mcvfreq);
+ 	COPY_SCALAR_FIELD(right_mcvfreq);

  	return newnode;
  }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d01630f..f88c1c5 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2800,2805 
--- 2800,2806 
  	double		hashjointuples;
  	double		virtualbuckets;
  	Selectivity innerbucketsize;
+ 	Selectivity innermcvfreq;
  	ListCell   *hcl;

  	/* Mark the path with the correct row estimate */
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2827,2835 
  	virtualbuckets = (double) numbuckets *(double) numbatches;

  	/*
! 	 * Determine bucketsize fraction for inner relation.  We use the smallest
! 	 * bucketsize estimated for any individual hashclause; this is undoubtedly
! 	 * conservative.
  	 *
  	 * BUT: if inner relation has been unique-ified, we can assume it's good
  	 * for hashing.  This is important both because it's the right answer, and
--- 2828,2836 
  	virtualbuckets = (double) numbuckets *(double) numbatches;

  	/*
! 	 * Determine bucketsize fraction and MCV frequency for the inner relation.
! 	 * We use the smallest bucketsize or MCV frequency estimated for any
! 	 * individual hashclause; this is undoubtedly conservative.
  	 *
  	 * BUT: if inner relation has been unique-ified, we can assume it's good
  	 * for hashing.  This is important both because it's the right answer, and
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2837,2850 
--- 2838,2856 
  	 * non-unique-ified paths.
  	 */
  	if (IsA(inner_path, UniquePath))
+ 	{
  		innerbucketsize = 1.0 / virtualbuckets;
+ 		innermcvfreq = 0.0;
+ 	}
  	else
  	{
  		innerbucketsize = 1.0;
+ 		innermcvfreq = 1.0;
  		foreach(hcl, hashclauses)
  		{
  			RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(hcl);
  			Selectivity thisbucketsize;
+ 			Selectivity thismcvfreq;

  			Assert(IsA(restrictinfo, RestrictInfo));

*** final_cost_hashjoin(PlannerInfo *root, H
*** 2853,2860 
  			 * is the inner side.
  			 *
  			 * Since we tend to visit the same clauses over and over when
! 			 * planning a large query, we cache the bucketsize estimate in the
! 			 * RestrictInfo node to

[HACKERS] [PATCH] Add pg_disable_checksums() and supporting infrastructure

2017-02-16 Thread David Christensen

Extracted from a larger patch, this patch provides the basic infrastructure for 
turning data
checksums off in a cluster.  This also sets up the necessary pg_control fields 
to support the
necessary multiple states for handling the transitions.

We do a few things:

- Change "data_checksums" from a simple boolean to "data_checksum_state", an 
enum type for all of
  the potentially-required states for this feature (as well as enabling).

- Add pg_control support for parsing/displaying the new setting.

- Distinguish between the possible checksum states and be specific about 
whether we care about
  checksum read failures or write failures at all call sites, turning 
DataChecksumsEnabled() into two
  functions: DataChecksumsEnabledReads() and DataChecksumsEnabledWrites().

- Create a superuser function pg_disable_checksums() to perform the actual 
disabling of this in the
  cluster.

I have *not* changed the default in initdb to enable checksums, but this would 
be trivial.




disable-checksums.patch
Description: Binary data



--
David Christensen
End Point Corporation
da...@endpoint.com
785-727-1171




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane  wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join?  That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
>
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size.  But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in.  There's generally a whole lot of noise in the tail-end
> MCV numbers.

Oh, right.  That's reassuring, as it seems like it has a much better
chance of actually being right.

> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats.  That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats.  Given how long it took us
> to recognize this problem, that's probably the wrong default.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Tom Lane

Robert Haas  writes:
> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane  wrote:
>> I initially thought about driving the shutoff strictly from the estimate
>> of the MCV frequency, without involving the more general ndistinct
>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>> that would do for your concern, but at least the MCV frequency doesn't
>> involve quite as much extrapolation as ndistinct.

> Hmm, so we could do something like: if the estimated frequency of the
> least-common MCV is enough to make one bucket overflow work_mem, then
> don't use a hash join?  That would still be prone to some error (in
> both directions, really) but it seems less likely to spit out
> completely stupid results than relying on ndistinct, which never gets
> very big even in a 10TB table.

No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size.  But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in.  There's generally a whole lot of noise in the tail-end
MCV numbers.

Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
stats.  That would be an expected case if the column is believed unique,
and it's probably a better fallback behavior when we simply don't have
stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
almost always when we don't have stats.  Given how long it took us
to recognize this problem, that's probably the wrong default.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small issue in online devel documentation build

2017-02-16 Thread Fabien COELHO




My point is that some xsl file should be fixed somewhere, probably, to avoid
that.


Yeah, I noticed this too, and I think Ashutosh also complained about
it on another thread.  It's pretty annoying - I hope somebody can
figure out the cause and fix it.  It seems to be only affecting the
development docs.


My 0.02€:

After some random digging, it seems that the documentation files are 
loaded from provided tarballs by a script: "tools/docs/docload.py" 
maintained in "pgweb.git". Maybe this is run from some cron.


AFAICS from the script, the provided tarballs contains the generated html 
that someone/something has already generated and put somewhere. I have 
found no clear clue about who, where and when. However I would bet that 
"Magnus Hagander" and "Dave Page" know, given that they commits suggest 
that they are maintaining the site:-)


The load script passes the doc through "tidy" with some options, and 
registers the pages into a database probably for django. I doubt this 
would change the ul class.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar  wrote:
> interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
> which is used by 0001

Committed.  I moved the function prototype for dsa_allocate0() next to
the existing prototype for dsa_allocate().  Please try to think about
things like this when you add functions or prototypes or structure
members: put them in a natural place where they are close to related
things, not just wherever happens to be convenient.

> gather_shutdown_childeren_first.patch : Processing the child node

This same problem came up on the Parallel Hash thread.  I mentioned
this proposed fix over there; let's see what he (or anyone else) has
to say about it.

> 0001: tidbimap change to provide the interfaces for shared iterator.

+ * in order to share relptrs of the chunk and the pages arrays and other
+ * TBM related information with other workers.

No relptrs any more.

+dsa_pointer dsapagetable;/* dsa_pointer to the element array */
+dsa_pointer dsapagetableold;/* dsa_pointer to the old element array */
+dsa_area   *dsa;/* reference to per-query dsa area */
+dsa_pointer ptpages;/* dsa_pointer to the page array */
+dsa_pointer ptchunks;/* dsa_pointer to the chunk array */

Let's put the DSA pointer first and then the other stuff after it.
That seems more logical.

+typedef struct TBMSharedIteratorState
+{
+intspageptr;/* next spages index */
+intschunkptr;/* next schunks index */
+intschunkbit;/* next bit to check in current schunk */
+LWLocklock;/* lock to protect above members */
+dsa_pointer pagetable;/* dsa pointers to head of pagetable data */
+dsa_pointer spages;/* dsa pointer to page array */
+dsa_pointer schunks;/* dsa pointer to chunk array */
+intnentries;/* number of entries in pagetable */
+intmaxentries;/* limit on same to meet maxbytes */
+intnpages;/* number of exact entries in
pagetable */
+intnchunks;/* number of lossy entries in pagetable */
+} TBMSharedIteratorState;

I think you've got this largely backwards from the most logical order.
Let's order it like this: nentries, maxentries, npages, nchunks,
pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
chunkptr, schunkbit.

+struct TBMSharedIterator
+{
+PagetableEntry *ptbase;/* pointer to the pagetable element array */
+dsa_area   *dsa;/* reference to per-query dsa area */
+int   *ptpages;/* sorted exact page index list */
+int   *ptchunks;/* sorted lossy page index list */
+TBMSharedIteratorState *state;/* shared members */
+TBMIterateResult output;/* MUST BE LAST (because variable-size) */
+};

Do we really need "dsa" here when it's already present in the shared
state?  It doesn't seem like we even use it for anything.  It's needed
to create each backend's TBMSharedIterator, but not afterwards, I
think.

In terms of ordering, I'd move "state" to the top of the structure,
just like "tbm" comes first in a TBMIterator.

+ * total memory consumption.  If dsa passed to this function is not NULL
+ * then the memory for storing elements of the underlying page table will
+ * be allocated from the DSA.

Notice that you capitalized "DSA" in two different ways in the same
sentence; I'd go for the all-caps version.  Also, you need the word
"the" before the first one.

+if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))

Excess parentheses.

+ * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
+ * by multiple workers using shared iterator.

Prepare to iterate, not prepare to iterator.  I suggest rephrasing
this as "prepare shared iteration state for a TIDBitmap".

+ * The TBMSharedIteratorState will be allocated from DSA so that multiple
+ * worker can attach to it and iterate jointly.

Maybe change to "The necessary shared state will be allocated from the
DSA passed to tbm_create, so that multiple workers can attach to it
and iterate jointly".

+ * This will convert the pagetable hash into page and chunk array of the index
+ * into pagetable array so that multiple workers can use it to get the actual
+ * page entry.

I think you can leave off everything starting from "so that".  It's
basically redundant with what you already said.

+dsa_pointer iterator;
+TBMSharedIteratorState *iterator_state;

These aren't really great variable names, because the latter isn't the
state associated with the former.  They're both the same object.
Maybe just "dp" and "istate".

I think this function should Assert(tbm->dsa != NULL) and
Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
tbm_begin_iterate should Assert(tbm->iterating !=

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane  wrote:
>>> This might be overly aggressive, because it will pretty much shut off
>>> any attempt to use hash joining on a large inner relation unless we
>>> have statistics for it (and those stats are favorable).  But having
>>> seen this example, I think we need to be worried.
>
>> I do think that's worrying, but on the other hand it seems like this
>> solution could disable many hash joins that would actually be fine.  I
>> don't think the largest ndistinct estimates we ever generate are very
>> large, and therefore this seems highly prone to worry even when
>> worrying isn't really justified.
>
> I initially thought about driving the shutoff strictly from the estimate
> of the MCV frequency, without involving the more general ndistinct
> computation that estimate_hash_bucketsize does.  I'm not sure how much
> that would do for your concern, but at least the MCV frequency doesn't
> involve quite as much extrapolation as ndistinct.

Hmm, so we could do something like: if the estimated frequency of the
least-common MCV is enough to make one bucket overflow work_mem, then
don't use a hash join?  That would still be prone to some error (in
both directions, really) but it seems less likely to spit out
completely stupid results than relying on ndistinct, which never gets
very big even in a 10TB table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Tom Lane

Robert Haas  writes:
> On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane  wrote:
>> This might be overly aggressive, because it will pretty much shut off
>> any attempt to use hash joining on a large inner relation unless we
>> have statistics for it (and those stats are favorable).  But having
>> seen this example, I think we need to be worried.

> I do think that's worrying, but on the other hand it seems like this
> solution could disable many hash joins that would actually be fine.  I
> don't think the largest ndistinct estimates we ever generate are very
> large, and therefore this seems highly prone to worry even when
> worrying isn't really justified.

I initially thought about driving the shutoff strictly from the estimate
of the MCV frequency, without involving the more general ndistinct
computation that estimate_hash_bucketsize does.  I'm not sure how much
that would do for your concern, but at least the MCV frequency doesn't
involve quite as much extrapolation as ndistinct.

There's a practical problem from final_cost_hashjoin's standpoint,
which is that it has noplace to cache the MCV frequency separately from
estimate_hash_bucketsize's output.  In HEAD we could just add some more
fields to RestrictInfo, but that would be an unacceptable ABI break in
the back branches.  Maybe we could get away with replacing the float8
bucketsize fields with two float4 fields --- it seems unlikely that we
need more than 6 digits of precision for these numbers, and I doubt any
extensions are touching the bucketsize fields.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Peter Geoghegan

On Thu, Feb 16, 2017 at 11:11 AM, Robert Haas  wrote:
> I do think that's worrying, but on the other hand it seems like this
> solution could disable many hash joins that would actually be fine.  I
> don't think the largest ndistinct estimates we ever generate are very
> large, and therefore this seems highly prone to worry even when
> worrying isn't really justified.

+1. ndistinct has a general tendency to be wrong, owing to how ANALYZE
works, which we see problems with from time to time.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 2:02 PM, Tom Lane  wrote:
> The planner doesn't currently worry about work_mem restrictions when
> planning a hash join, figuring that the executor should be able to
> subdivide the data arbitrarily finely by splitting buckets at runtime.
> However there's a thread here:
> https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com
> exhibiting a case where a hash join was chosen even though a single
> value accounts for three-quarters of the inner relation.  Bucket
> splitting obviously can never separate multiple instances of the
> same value, so this choice forced the executor to try to load
> three-quarters of the (very large) inner relation into memory at once;
> unsurprisingly, it failed.
>
> To fix this, I think we need to discourage use of hash joins whenever
> a single bucket is predicted to exceed work_mem, as in the attached
> draft patch.  The patch results in changing from hash to merge join
> in one regression test case, which is fine; that case only cares about
> the join order not the types of the joins.
>
> This might be overly aggressive, because it will pretty much shut off
> any attempt to use hash joining on a large inner relation unless we
> have statistics for it (and those stats are favorable).  But having
> seen this example, I think we need to be worried.

I do think that's worrying, but on the other hand it seems like this
solution could disable many hash joins that would actually be fine.  I
don't think the largest ndistinct estimates we ever generate are very
large, and therefore this seems highly prone to worry even when
worrying isn't really justified.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-02-16 Thread Tom Lane

The planner doesn't currently worry about work_mem restrictions when
planning a hash join, figuring that the executor should be able to
subdivide the data arbitrarily finely by splitting buckets at runtime.
However there's a thread here:
https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail.gmail.com
exhibiting a case where a hash join was chosen even though a single
value accounts for three-quarters of the inner relation.  Bucket
splitting obviously can never separate multiple instances of the
same value, so this choice forced the executor to try to load
three-quarters of the (very large) inner relation into memory at once;
unsurprisingly, it failed.

To fix this, I think we need to discourage use of hash joins whenever
a single bucket is predicted to exceed work_mem, as in the attached
draft patch.  The patch results in changing from hash to merge join
in one regression test case, which is fine; that case only cares about
the join order not the types of the joins.

This might be overly aggressive, because it will pretty much shut off
any attempt to use hash joining on a large inner relation unless we
have statistics for it (and those stats are favorable).  But having
seen this example, I think we need to be worried.

I'm inclined to treat this as a bug and back-patch it, but I wonder if
anyone is concerned about possible plan destabilization in the back
branches.

regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d01630f..9170b92 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2800,2805 
--- 2800,2806 
  	double		hashjointuples;
  	double		virtualbuckets;
  	Selectivity innerbucketsize;
+ 	double		inner_bucket_rows;
  	ListCell   *hcl;
  
  	/* Mark the path with the correct row estimate */
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2894,2899 
--- 2895,2914 
  	}
  
  	/*
+ 	 * If even a single bucket would exceed work_mem, we don't want to hash
+ 	 * unless there is really no other alternative, so apply disable_cost.
+ 	 * (Because estimate_hash_bucketsize's estimate is mostly driven by the
+ 	 * MCV frequency, this condition suggests that the executor will be unable
+ 	 * to drive the batch size below work_mem no matter how much it splits
+ 	 * buckets: splitting cannot separate values that are equal.)
+ 	 */
+ 	inner_bucket_rows = clamp_row_est(inner_path_rows * innerbucketsize);
+ 	if (relation_byte_size(inner_bucket_rows,
+ 		   inner_path->pathtarget->width) >
+ 		(work_mem * 1024L))
+ 		startup_cost += disable_cost;
+ 
+ 	/*
  	 * Compute cost of the hashquals and qpquals (other restriction clauses)
  	 * separately.
  	 */
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2964,2970 
  		 */
  		startup_cost += hash_qual_cost.startup;
  		run_cost += hash_qual_cost.per_tuple * outer_path_rows *
! 			clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
  
  		/*
  		 * Get approx # tuples passing the hashquals.  We use
--- 2979,2985 
  		 */
  		startup_cost += hash_qual_cost.startup;
  		run_cost += hash_qual_cost.per_tuple * outer_path_rows *
! 			inner_bucket_rows * 0.5;
  
  		/*
  		 * Get approx # tuples passing the hashquals.  We use
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4992048..b2270ca 100644
*** a/src/test/regress/expected/join.out
--- b/src/test/regress/expected/join.out
*** where not exists (
*** 2429,2461 
) a1 on t3.c2 = a1.c1
where t1.c1 = t2.c2
  );
!QUERY PLAN
! -
!  Hash Anti Join
!Hash Cond: (t1.c1 = t2.c2)
!->  Seq Scan on tt4x t1
!->  Hash
!  ->  Merge Right Join
!Merge Cond: (t5.c1 = t3.c2)
!->  Merge Join
!  Merge Cond: (t4.c2 = t5.c1)
!  ->  Sort
!Sort Key: t4.c2
!->  Seq Scan on tt4x t4
!  ->  Sort
!Sort Key: t5.c1
!->  Seq Scan on tt4x t5
!->  Sort
!  Sort Key: t3.c2
!  ->  Merge Left Join
!Merge Cond: (t2.c3 = t3.c1)
 ->  Sort
!  Sort Key: t2.c3
!  ->  Seq Scan on tt4x t2
 ->  Sort
!  Sort Key: t3.c1
!  ->  Seq Scan on tt4x t3
! (24 rows)
  
  --
  -- regression test for problems of the sort depicted in bug #3494
--- 2429,2465 
) a1 on t3.c2 = a1.c1
where t1.c1 = t2.c2
  );

Re: [HACKERS] COPY IN/BOTH vs. extended query mode

2017-02-16 Thread Corey Huinker

On Mon, Feb 13, 2017 at 4:29 PM, Craig Ringer 
wrote:

>
>
> On 14 Feb. 2017 06:15, "Robert Haas"  wrote:
>
> On Mon, Jan 23, 2017 at 9:12 PM, Robert Haas 
> wrote:
> > According to the documentation for COPY IN mode, "If the COPY command
> > was issued via an extended-query message, the backend will now discard
> > frontend messages until a Sync message is received, then it will issue
> > ReadyForQuery and return to normal processing."  I added a similar
> > note to the documentation for COPY BOTH mode in
> > 91fa8532f4053468acc08534a6aac516ccde47b7, and the documentation
> > accurately describes the behavior of the server.  However, this seems
> > to make fully correct error handling for clients using libpq almost
> > impossible, because PQsendQueryGuts() sends
> > Parse-Bind-Describe-Execute-Sync in one shot without regard to whether
> > the command that was just sent invoked COPY mode (cf. the note in
> > CopyGetData about why we ignore Flush and Sync in that function).
> >
> > So imagine that the client uses libpq to send (via the extended query
> > protocol) a COPY IN command (or some hypothetical command that starts
> > COPY BOTH mode to begin).  If the server throws an error before the
> > Sync message is consumed, it will bounce back to PostgresMain which
> > will set doing_extended_query_message = true after which it will
> > consume messages, find the Sync, reset that flag, and send
> > ReadyForQuery.  On the other hand, if the server enters CopyBoth mode,
> > consumes the Sync message in CopyGetData (or a similar function), and
> > *then* throws an ERROR, the server will wait for a second Sync message
> > from the client before issuing ReadyForQuery.  There is no sensible
> > way of coping with this problem in libpq, because there is no way for
> > the client to know which part of the server code consumed the Sync
> > message that it already sent.  In short, from the client's point of
> > view, if it enters COPY IN or COPY BOTH mode via the extend query
> > protocol, and an error occurs on the server, the server MAY OR MAY NOT
> > expect a further Sync message before issuing ReadyForQuery, and the
> > client has no way of knowing -- except maybe waiting for a while to
> > see what happens.
> >
> > It does not appear to me that there is any good solution to this
> > problem.  Fixing it on the server side would require a wire protocol
> > change - e.g. one kind of Sync message that is used in a
> > Parse-Bind-Describe-Execute-Sync sequence that only terminates
> > non-COPY commands and another kind that is used to signal the end even
> > of COPY.  Fixing it on the client side would require all clients to
> > know prior to initiating an extended-query-protocol sequence whether
> > or not the command was going to initiate COPY, which is an awful API
> > even if didn't constitute an impossible-to-contemplate backward
> > compatibility break.  Perhaps we will have to be content to document
> > the fact that this part of the protocol is depressingly broken...
> >
> > ...unless of course somebody can see something that I'm missing here
> > and the situation isn't as bad as it currently appears to me to be.
>
> Anybody have any thoughts on this?
>
>
> I've been thinking on it a bit, but don't really have anything that can be
> done without a protocol version bump.
>
> We can't really disallow extended query protocol COPY, too much is likely
> to break. And we can't fix it without a protocol change.
>
> A warning in the docs for COPY would be appropriate, noting that clients
> should use the simple query protocol to issue COPY. It's kind of mixing
> layers, since many users won't see the protocol level or have any idea if
> their client driver uses ext or simple query, but we can at least advise
> libpq users.
>
> Also in the protocol docs, noting that clirnfa sending COPY should prefer
> the simple query protocol due to error recovery issues with COPY and
> extended query protocol.
>


Forgive my ignorance, but is this issue related to the Catch-22 I had with
"COPY as a set returning function", wherein a function that invokes
BeginCopyFrom() basically starts a result set, but then ends it to do the
BeginCopyFrom() having NULL (meaning STDIN) as the file, so that when the
results from the copy come back the 'T' record that was going to preface
the 'D' records emitted by the function is now gone?

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs  wrote:
> On 15 February 2017 at 08:07, Masahiko Sawada  wrote:
>> It's a bug. Attached latest version patch, which passed make check.
>
> In its current form, I'm not sure this is a good idea. Problems...
>
> 1. I'm pretty sure the world doesn't need another VACUUM parameter
>
> I suggest that we use the existing vacuum scale factor/4 to reflect
> that indexes are more sensitive to bloat.

I do not think it's a good idea to control multiple behaviors with a
single GUC.  We don't really know that dividing by 4 will be right for
everyone, or even for most people.  It's better to have another
parameter with a sensible default than to hardcode a ratio that might
work out poorly for some people.

> 2. The current btree vacuum code requires 2 vacuums to fully reuse
> half-dead pages. So skipping an index vacuum might mean that second
> index scan never happens at all, which would be bad.

Maybe.  If there are a tiny number of those half-dead pages in a huge
index, it probably doesn't matter.  Also, I don't think it would never
happen, unless the table just never gets any more updates or deletes -
but that case could also happen today.  It's just a matter of
happening less frequently.

I guess the question is whether the accumulation of half-dead pages in
the index could become a problem before the unsetting of all-visible
bits in the heap becomes a problem.  If the second one always happen
first, then we don't have an issue here, but if it's possible for the
first one to become a big problem before the second one gets to be a
serious issue, then we need something more sophisticated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

oldhtml-stamp (was Re: [HACKERS] duplicate "median" entry in doc)

2017-02-16 Thread Tom Lane

Fabien COELHO  writes:
> While testing with "oldhtml" I found that the "maintainer-clean" targets 
> in "sgml" does not clean enough: it lacks cleaning the "oldhtml-stamp".

Yeah, and .gitignore fails to ignore it, too.  I'm not sure if it's
worth fixing ... Peter, what's the expected half-life of the "oldhtml"
target?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small issue in online devel documentation build

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 1:05 PM, Fabien COELHO  wrote:
> When consulting the online "devel" documentation, for instance for
> "pgbench":
>
> https://www.postgresql.org/docs/devel/static/pgbench.html
>
> Unordered lists  are shown in bold, as can be seen in the -M option and
> in the description of random functions. This is because "ul" is in class
> "c2" in the html:
>
>  ...
>
> I cannot reproduce this when generating html for head, so I gather that the
> web-site doc generation must be based on a different transformations.
>
> My point is that some xsl file should be fixed somewhere, probably, to avoid
> that.

Yeah, I noticed this too, and I think Ashutosh also complained about
it on another thread.  It's pretty annoying - I hope somebody can
figure out the cause and fix it.  It seems to be only affecting the
development docs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small issue in online devel documentation build

2017-02-16 Thread Fabien COELHO



When consulting the online "devel" documentation, for instance for 
"pgbench":


https://www.postgresql.org/docs/devel/static/pgbench.html

Unordered lists  are shown in bold, as can be seen in the -M option 
and in the description of random functions. This is because "ul" is in 
class "c2" in the html:


 ...

I cannot reproduce this when generating html for head, so I gather that 
the web-site doc generation must be based on a different transformations.


My point is that some xsl file should be fixed somewhere, probably, to 
avoid that.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-16 Thread Joel Jacobson

On Thu, Feb 16, 2017 at 6:53 AM, Tom Lane  wrote:
> Have you read any of our innumerable previous discussions about this?

No, sorry, didn't see them, thanks for sharing the links.

> The short answer is that nobody can see a way to modify the identifier
> case-folding rules that isn't going to add more pain than it subtracts.
> And much of the added pain will be felt by people who aren't getting
> any benefit, who will therefore be vociferously against the whole thing.

I've read the discussion and have an idea:

When case preservation by default is on, then simply enforce
UNIQUE(LOWER(object_name)), to prevent ambiguity.

If all objects lowercase names are unique, but the casing is
preserved, then a user who later on suffers from problems with
external tools that work poorly with non-lowercase object names, could
then simply switch back to lowercase object names by changing the GUC.

OTOH, if not enforcing lowercase uniqueness, there would be a risk two
objects with different casing would have conflicting lowercase names,
and then the user who later runs into problems with some external tool
would have a serious problem, since switching to lowercase would not
be an option.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] duplicate "median" entry in doc

2017-02-16 Thread Fabien COELHO




[ scratches head... ]


Magic:-)

While testing with "oldhtml" I found that the "maintainer-clean" targets 
in "sgml" does not clean enough: it lacks cleaning the "oldhtml-stamp".


See attached very minor fix.

--
Fabien.diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index fe7ca65..897c9c9 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -410,6 +410,6 @@ distclean: clean
 
 maintainer-clean: distclean
 # HTML
-	rm -fr html/ html-stamp
+	rm -fr html/ html-stamp oldhtml-stamp
 # man
 	rm -rf man1/ man3/ man7/ man-stamp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VOPS: vectorized executor for Postgres: how to speedup OLAP queries more than 10 times without changing anything in Postgres executor

2017-02-16 Thread Thom Brown

On 16 February 2017 at 17:00, Konstantin Knizhnik
 wrote:
> More progress in vectorized Postgres extension (VOPS). It is not required
> any more to use some special functions in queries.
> You can use vector operators in query with standard SQL and still get ten
> times improvement on some queries.
> VOPS extension now uses post parse analyze hook to transform query.
> I really impressed by flexibility and extensibility of Postgres type system.
> User defined types do most of the work.
>
> It is still responsibility of programmer or database administrator to create
> proper projections
> of original table. This projections need to use tiles types for some
> attributes (vops_float4,...).
> Then you can query this table using standard SQL. And this query will be
> executed using vector operations!
>
> Example of such TPC-H queries:
>
> Q1:
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
> sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> vops_lineitem_projection
> where
> l_shipdate <= '1998-12-01'::date
> group by
> l_returnflag,
> l_linestatus
> order by
> l_returnflag,
> l_linestatus;
>
>
>
> Q6:
> select
> sum(l_extendedprice*l_discount) as revenue
> from
> lineitem_projection
> where
> l_shipdate between '1996-01-01'::date and '1997-01-01'::date
> and l_discount between 0.08 and 0.1
> and l_quantity < 24;

>
> On 13.02.2017 17:12, Konstantin Knizhnik wrote:
>>
>> Hello hackers,
>>
>> There were many discussions concerning  possible ways of speeding-up
>> Postgres. Different approaches were suggested:
>>
>> - JIT (now we have three different prototype implementations based on
>> LLVM)
>> - Chunked (vectorized) executor
>> - Replacing pull with push
>> - Columnar store (cstore_fdw, IMCS)
>> - Optimizing and improving current executor (reducing tuple deform
>> overhead, function call overhead,...)
>>
>> Obviously the best result can be achieved in case of combining all this
>> approaches. But actually them are more or less interchangeable: vectorized
>> execution is not eliminating interpretation overhead, but it is divided by
>> vector size and becomes less critical.
>>
>> I decided to write small prototype to estimate possible speed improvement
>> of vectorized executor. I created special types representing "tile" and
>> implement standard SQL operators for them. So neither Postgres planer,
>> nether Postgres executor, nether Postgres heap manager are changed. But I
>> was able to reach more than 10 times speed improvement on TPC-H Q1/Q6
>> queries!

Impressive work!

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 11:45 AM, Peter Geoghegan  wrote:
> Maybe I'm just being pedantic here, since we both actually want the
> code to do the same thing.

Pedantry from either of us?  Nah...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Peter Geoghegan

On Wed, Feb 15, 2017 at 6:05 PM, Thomas Munro
 wrote:
> Here are some results with your latest patch, using the same test as
> before but this time with SCALE=100 (= 100,000,000 rows).

Cool.

> To reduce the number of combinations I did only unique data and built
> only non-unique indexes with only 'wide' tuples (= key plus a text
> column that holds a 151-character wide string, rather than just the
> key), and also didn't bother with the 1MB memory size as suggested.
> Here are the results up to 4 workers (a results table going up to 8
> workers is attached, since it wouldn't format nicely if I pasted it
> here).

I think that you are still I/O bound in a way that is addressable by
adding more disks. The exception is the text cases, where the patch
does best. (I don't place too much emphasis on that because I know
that in the long term, we'll have abbreviated keys, which will take
some of the sheen off of that.)

> Again, the w = 0 time is seconds, the rest show relative
> speed-up.

I think it's worth pointing out that while there are cases where we
see no benefit from going from 4 to 8 workers, it tends to hardly hurt
at all, or hardly help at all. It's almost irrelevant that the number
of workers used is excessive, at least up until the point when all
cores have their own worker. That's a nice quality for this to have --
the only danger is that we use parallelism when we shouldn't have at
all, because the serial case could manage an internal sort, and the
sort was small enough that that could be a notable factor.

> "textwide" "asc" is nearly an order of magnitude faster than other
> initial orders without parallelism, but then parallelism doesn't seem
> to help it much.  Also, using more that 64MB doesn't ever seem to help
> very much; in the "desc" case it hinders.

Maybe it's CPU cache efficiency? There are edge cases where multiple
passes are faster than one pass. That'ks the only explanation I can
think of.

> I was curious to understand how performance changes if we become just
> a bit less correlated (rather than completely uncorrelated or
> perfectly inversely correlated), so I tried out a 'banana skin' case:
> I took the contents of the textwide asc table and copied it to a new
> table, and then moved the 900 words matching 'banana%' to the physical
> end of the heap by deleting and reinserting them in one transaction.

A likely problem with that is that most runs will actually not have
their own banana skin, so to speak. You only see a big drop in
performance when every quicksort operation has presorted input, but
with one or more out-of-order tuples at the end. In order to see a
really unfortunate case with parallel CREATE INDEX, you'd probably
have to have enough memory that workers don't need to do their own
merge (and so worker's work almost entirely consists of one big
quicksort operation), with enough "banana skin heap pages" that the
parallel heap scan is pretty much guaranteed to end up giving "banana
skin" (out of order) tuples to every worker, making all of them "have
a slip" (throw away a huge amount of work as the presorted
optimization is defeated right at the end of its sequential read
through).

A better approach would be to have several small localized areas
across input where input tuples are a little out of order. That would
probably show that the performance is pretty in line with random
cases.

> It's hard to speculate about this, but I guess that a significant
> number of indexes in real world databases might be uncorrelated to
> insert order.

That would certainly be true with text, where we see a risk of (small)
regressions.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VOPS: vectorized executor for Postgres: how to speedup OLAP queries more than 10 times without changing anything in Postgres executor

2017-02-16 Thread Konstantin Knizhnik

More progress in vectorized Postgres extension (VOPS). It is not 
required any more to use some special functions in queries.
You can use vector operators in query with standard SQL and still get 
ten times improvement on some queries.

VOPS extension now uses post parse analyze hook to transform query.
I really impressed by flexibility and extensibility of Postgres type 
system. User defined types do most of the work.


It is still responsibility of programmer or database administrator to 
create proper projections
of original table. This projections need to use tiles types for some 
attributes (vops_float4,...).
Then you can query this table using standard SQL. And this query will be 
executed using vector operations!


Example of such TPC-H queries:

Q1:
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
vops_lineitem_projection
where
l_shipdate <= '1998-12-01'::date
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;



Q6:
select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01'::date and '1997-01-01'::date
and l_discount between 0.08 and 0.1
and l_quantity < 24;



On 13.02.2017 17:12, Konstantin Knizhnik wrote:

Hello hackers,

There were many discussions concerning  possible ways of speeding-up 
Postgres. Different approaches were suggested:


- JIT (now we have three different prototype implementations based on 
LLVM)

- Chunked (vectorized) executor
- Replacing pull with push
- Columnar store (cstore_fdw, IMCS)
- Optimizing and improving current executor (reducing tuple deform 
overhead, function call overhead,...)


Obviously the best result can be achieved in case of combining all 
this approaches. But actually them are more or less interchangeable: 
vectorized execution is not eliminating interpretation overhead, but 
it is divided by vector size and becomes less critical.


I decided to write small prototype to estimate possible speed 
improvement of vectorized executor. I created special types 
representing "tile" and implement standard SQL operators for them. So 
neither Postgres planer, nether Postgres executor, nether Postgres 
heap manager are changed. But I was able to reach more than 10 times 
speed improvement on TPC-H Q1/Q6 queries!


Please find more information here: 
https://cdn.rawgit.com/postgrespro/vops/ddcbfbe6/vops.html
The sources of the project can be found here: 
https://github.com/postgrespro/vops.git




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Peter Geoghegan

On Thu, Feb 16, 2017 at 6:28 AM, Robert Haas  wrote:
> On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan  wrote:
>> At the risk of stating the obvious, ISTM that the right way to do
>> this, at a high level, is to err on the side of unneeded extra
>> unlink() calls, not leaking files. And, to make the window for problem
>> ("remaining hole that you haven't quite managed to plug") practically
>> indistinguishable from no hole at all, in a way that's kind of baked
>> into the API.
>
> I do not think there should be any reason why we can't get the
> resource accounting exactly correct here.  If a single backend manages
> to remove every temporary file that it creates exactly once (and
> that's currently true, modulo system crashes), a group of cooperating
> backends ought to be able to manage to remove every temporary file
> that any of them create exactly once (again, modulo system crashes).

I believe that we are fully in agreement here. In particular, I think
it's bad that there is an API that says "caller shouldn't throw an
elog error between these two points", and that will be fixed before
too long. I just think that it's worth acknowledging a certain nuance.

> I do agree that a duplicate unlink() call isn't as bad as a missing
> unlink() call, at least if there's no possibility that the filename
> could have been reused by some other process, or some other part of
> our own process, which doesn't want that new file unlinked.  But it's
> messy.  If the seatbelts in your car were to randomly unbuckle, that
> would be a safety hazard.  If they were to randomly refuse to
> unbuckle, you wouldn't say "that's OK because it's not a safety
> hazard", you'd say "these seatbelts are badly designed".  And I think
> the same is true of this mechanism.

If it happened in the lifetime of only one out of a million seatbelts
manufactured, and they were manufactured at a competitive price (not
over-engineered), I probably wouldn't say that. The fact that the
existing resource manger code only LOGs most temp file related
failures suggests to me that that's a "can't happen" condition, but we
still hedge. I would still like to hedge against even (theoretically)
impossible risks.

Maybe I'm just being pedantic here, since we both actually want the
code to do the same thing.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] duplicate "median" entry in doc

2017-02-16 Thread Tom Lane

Fabien COELHO  writes:
> I confirm that "oldhtml" does not generate the "see also" in head, 
> probably because the first termindex/primary is kept and overrides the 
> second somehow, which is consistent with the warning.

> However, after removing the duplicate, both "oldhtml" (openjade 1.3, 
> issues with 1.4) and "html" (xsltproc) generate a correct "median" index 
> with its "see also" subsection, in "bookindex.html":

[ scratches head... ]  Coulda sworn I tried that and it didn't do what
I wanted.  But testing now, it does, so pushed.  Thanks for pointing
out my error.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 10:59 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
>>  wrote:
>>> My idea is that we need more general redesign of specifying ordering which
>>> index can produce.  Ideally, we should replace amcanorder, amcanbackward and
>>> amcanorderbyop with single callback.  Such callback should take a list of
>>> pathkeys and return number of leading pathkeys index could satisfy (with
>>> corresponding information for index scan).  I'm not sure that other hackers
>>> would agree with such design, but I'm very convinced that we need something
>>> of this level of extendability.  Otherwise we would have to hack our planner
>>> <-> index_access_method interface each time we decide to cover another index
>>> produced ordering.
>
>> Yeah.  I'm not sure if that's exactly the right idea.  But it seems
>> like we need something.
>
> That's definitely not exactly the right idea, because using it would
> require the core planner to play twenty-questions trying to guess which
> pathkeys the index can satisfy.  ("Can you satisfy some prefix of this
> pathkey list?  How about that one?")  It could be sensible to have a
> callback that's called once per index and hands back a list of pathkey
> lists that represent interesting orders the index could produce, which
> could be informed by looking aside at the PlannerInfo contents to see
> what is likely to be relevant to the query.
>
> But even so, I'm not convinced that that is a better design or more
> maintainable than the current approach.  I fear that it will lead to
> duplicating substantial amounts of code and knowledge into each index AM,
> which is not an improvement; and if anything, that increases the risk of
> breaking every index AM anytime you want to introduce some fundamentally
> new capability in the area.  Now that it's actually practical to have
> out-of-core index AMs, that's a bigger concern than it might once have
> been.

Yeah, that's all true.  But I think Alexander is right that just
adding amcandoblah flags ad infinitum doesn't feel good either.  The
interface isn't really arm's-length if every new thing somebody wants
to do something new requires another flag.

> Also see the discussion that led up to commit ed0097e4f.  Users objected
> the last time we tried to make index capabilities opaque at the SQL level,
> so they're not going to like a design that tries to hide that information
> even from the core C code.

Discoverability is definitely important, but first we have to figure
out how we're going to make it work, and then we can work out how to
let users see how it works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Peter Geoghegan

But surely it should be possible to use DO NOTHING without inferring some
particular unique index? That's possible with an approach based on
inheritance.

--
Peter Geoghegan
(Sent from my phone)

Re: [HACKERS] duplicate "median" entry in doc

2017-02-16 Thread Fabien COELHO




When trying to build the documentation there is a minor warning:
   collateindex.pl: duplicated index entry found: MEDIAN


See
https://www.postgresql.org/message-id/29262.1483053...@sss.pgh.pa.us

I've been waiting for some clarification on that before attempting
to fix this.  In any case, your proposed patch would remove the main
index entry, which doesn't seem like what we want.


Hmmm. I do not get it.

I confirm that "oldhtml" does not generate the "see also" in head, 
probably because the first termindex/primary is kept and overrides the 
second somehow, which is consistent with the warning.


However, after removing the duplicate, both "oldhtml" (openjade 1.3, 
issues with 1.4) and "html" (xsltproc) generate a correct "median" index 
with its "see also" subsection, in "bookindex.html":


  ...
  MD5, Password Authentication
  median, Aggregate Expressions
  (see also percentile)
  memory context
  ...

So this really seems ok to me... The only difference is that with the new 
chain the percentile is a link and there are parentheses, while with 
oldhtml it is just text.


Am I missing something?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] drop support for Python 2.3

2017-02-16 Thread Devrim Gündüz


Hi Peter,

On Thu, 2017-02-16 at 08:21 -0500, Peter Eisentraut wrote:
> > I have CentOS 5 instances running on buildfarm. I'll register them via
> > buildfarm.pg.org soon.
> 
> I will wait for that before proceeding.

Sorry for the delay. Machines are ready, I think I can prepare the buildfarm
instances later Saturday.

Regards,
-- 
Devrim Gündüz
EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Tom Lane

Robert Haas  writes:
> On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
>  wrote:
>> My idea is that we need more general redesign of specifying ordering which
>> index can produce.  Ideally, we should replace amcanorder, amcanbackward and
>> amcanorderbyop with single callback.  Such callback should take a list of
>> pathkeys and return number of leading pathkeys index could satisfy (with
>> corresponding information for index scan).  I'm not sure that other hackers
>> would agree with such design, but I'm very convinced that we need something
>> of this level of extendability.  Otherwise we would have to hack our planner
>> <-> index_access_method interface each time we decide to cover another index
>> produced ordering.

> Yeah.  I'm not sure if that's exactly the right idea.  But it seems
> like we need something.

That's definitely not exactly the right idea, because using it would
require the core planner to play twenty-questions trying to guess which
pathkeys the index can satisfy.  ("Can you satisfy some prefix of this
pathkey list?  How about that one?")  It could be sensible to have a
callback that's called once per index and hands back a list of pathkey
lists that represent interesting orders the index could produce, which
could be informed by looking aside at the PlannerInfo contents to see
what is likely to be relevant to the query.

But even so, I'm not convinced that that is a better design or more
maintainable than the current approach.  I fear that it will lead to
duplicating substantial amounts of code and knowledge into each index AM,
which is not an improvement; and if anything, that increases the risk of
breaking every index AM anytime you want to introduce some fundamentally
new capability in the area.  Now that it's actually practical to have
out-of-core index AMs, that's a bigger concern than it might once have
been.

Also see the discussion that led up to commit ed0097e4f.  Users objected
the last time we tried to make index capabilities opaque at the SQL level,
so they're not going to like a design that tries to hide that information
even from the core C code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Index-only scan

2017-02-16 Thread Amit Kapila

On Thu, Feb 16, 2017 at 3:57 PM, Rafia Sabih
 wrote:
>
>
> On Thu, Feb 16, 2017 at 3:40 PM, Rafia Sabih 
> wrote:
>>
>>
>> Please find the attached patch for rebased and cleaner version.
>>
> Please find the attached patch with a minor comment update.
>

Few comments:

1.
+ * ioss_PscanLen   This is needed for parallel index scan
  * 
  */
 typedef struct IndexOnlyScanState
@@ -1427,6 +1428,7 @@ typedef struct IndexOnlyScanState
  IndexScanDesc ioss_ScanDesc;
  Buffer ioss_VMBuffer;
  long ioss_HeapFetches;
+ Size ioss_PscanLen; /* This is needed for parallel index scan */

No need to mention same comment at multiple places.  I think you keep
it on top of structure.  The explanation could be some thing like
"size of parallel index only scan descriptor"

2.
+ node->ioss_ScanDesc->xs_want_itup = true;

I think wherever you are initializing xs_want_itup, you should
initialize ioss_VMBuffer as well.  Is there a reason for not doing so?

3.
explain (costs off)
  select  sum(parallel_restricted(unique1)) from tenk1
  group by(parallel_restricted(unique1));
- QUERY PLAN
-
+QUERY PLAN
+---
  HashAggregate
Group Key: parallel_restricted(unique1)
-   ->  Index Only Scan using tenk1_unique1 on tenk1
-(3 rows)
+   ->  Gather
+ Workers Planned: 4
+ ->  Parallel Index Only Scan using tenk1_unique1 on tenk1
+(5 rows)

It doesn't look good that you want to test parallel index only scan
for a test case that wants to test restricted function.  The comments
atop test looks odd. I suggest add a separate test (both explain and
actual execution of query) for parallel index only scan as we have for
parallel plans for other queries and see if we can keep the output of
existing test same.

4.
ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
{
..
+ /*
+ * if we are here to just update the scan keys, then don't reset parallel
+ * scan
+ */
+ if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
+ reset_parallel_scan = false;
..
}

I think here you can update the comment to indicate that for detailed
reason refer ExecReScanIndexScan.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2017-02-16 Thread Dilip Kumar

On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas  wrote:
> What is the point of having a TBMSharedIterator contain a TIDBitmap
> pointer?  All the values in that TIDBitmap are populated from the
> TBMSharedInfo, but it seems to me that the fields that are copied over
> unchanged (like status and nchunks) could just be used directly from
> the TBMSharedInfo, and the fields that are computed (like relpages and
> relchunks) could be stored directly in the TBMSharedIterator.
> tbm_shared_iterate() is already separate code from tbm_iterate(), so
> it can be changed to refer to whichever data structure contains the
> data it needs.

Done

>
> Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
> like you could merge those two into a single structure.

Done
>
> I think we can simplify things here a bit by having shared iterators
> not support single-page mode.  In the backend-private case,
> tbm_begin_iterate() really doesn't need to do anything with the
> pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
> got to anyway copy the pagetable into shared memory.  So it seems to
> me that it would be simpler just to transform it into a standard
> iteration array while we're at it, instead of copying it into entry1.
> In other words, I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

Done

>
> I think "dsa_data" is poorly named; it should be something like
> "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
> think you should should move the "base", "relpages", and "relchunks"
> into TBMSharedIterator and give them better names, like "ptbase",
> "ptpages" and "ptchunks".  "base" isn't clear that we're talking about
> the pagetable's base as opposed to anything else, and "relpages" and
> "relchunks" are named based on the fact that the pointers are relative
> rather than named based on what data they point at, which doesn't seem
> like the right choice.

Done
>
> I suggest putting the parallel functions next to each other in the
> file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
> tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
> tbm_end_shared_iterate().

Done
>
> +if (tbm->dsa == NULL)
> +return pfree(pointer);
>
> Don't try to return a value of type void.  The correct spelling of
> this is { pfree(pointer); return; }.  Formatted appropriately across 4
> lines, of course.

Fixed

>
> +/*
> + * If TBM is in iterating phase that means pagetable is already created
> + * and we have come here during tbm_free. By this time we are already
> + * detached from the DSA because the GatherNode would have been shutdown.
> + */
> +if (tbm->iterating)
> +return;
>
> This seems like something of a kludge, although not a real easy one to
> fix.  The problem here is that tidbitmap.c ideally shouldn't have to
> know that it's being used by the executor or that there's a Gather
> node involved, and certainly not the order of operations around
> shutdown.  It should really be the problem of 0002 to handle this kind
> of problem, without 0001 having to know anything about it.  It strikes
> me that it would probably be a good idea to have Gather clean up its
> children before it gets cleaned up itself.  So in ExecShutdownNode, we
> could do something like this:
>
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> index 0dd95c6..5ccc2e8 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
>  if (node == NULL)
>  return false;
>
> +planstate_tree_walker(node, ExecShutdownNode, NULL);
> +
>  switch (nodeTag(node))
>  {
>  case T_GatherState:
> @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
>  break;
>  }
>
> -return planstate_tree_walker(node, ExecShutdownNode, NULL);
> +return false;
>  }
>
> Also, in ExecEndGather, something like this:
>
> diff --git a/src/backend/executor/nodeGather.c
> b/src/backend/executor/nodeGather.c
> index a1a3561..32c97d3 100644
> --- a/src/backend/executor/nodeGather.c
> +++ b/src/backend/executor/nodeGather.c
> @@ -229,10 +229,10 @@ ExecGather(GatherState *node)
>  void
>  ExecEndGather(GatherState *node)
>  {
> +ExecEndNode(outerPlanState(node));  /* let children clean up first */
>  ExecShutdownGather(node);
>  ExecFreeExprContext(>ps);
>  ExecClearTuple(node->ps.ps_ResultTupleSlot);
> -ExecEndNode(outerPlanState(node));
>  }
>
>  /*
>
> With those changes and an ExecShutdownBitmapHeapScan() called from
> ExecShutdownNode(), it should be possible (I think) for us to always
> have the bitmap heap scan node shut down before the Gather node shuts
> down, which I think would let you avoid having a special case for this
> inside the TBM code.

Done

Re: [HACKERS] duplicate "median" entry in doc

2017-02-16 Thread Tom Lane

Fabien COELHO  writes:
> When trying to build the documentation there is a minor warning:
>collateindex.pl: duplicated index entry found: MEDIAN

See
https://www.postgresql.org/message-id/29262.1483053...@sss.pgh.pa.us

I've been waiting for some clarification on that before attempting
to fix this.  In any case, your proposed patch would remove the main
index entry, which doesn't seem like what we want.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
 wrote:
> My idea is that we need more general redesign of specifying ordering which
> index can produce.  Ideally, we should replace amcanorder, amcanbackward and
> amcanorderbyop with single callback.  Such callback should take a list of
> pathkeys and return number of leading pathkeys index could satisfy (with
> corresponding information for index scan).  I'm not sure that other hackers
> would agree with such design, but I'm very convinced that we need something
> of this level of extendability.  Otherwise we would have to hack our planner
> <-> index_access_method interface each time we decide to cover another index
> produced ordering.

Yeah.  I'm not sure if that's exactly the right idea.  But it seems
like we need something.

>> info->amcanorderbyop = (void (*)()) amroutine->amcanorderbyop;
>
> It's not necessary to use cast here.  For instance, we don't use cast for
> amcostestimate.

In fact, it's bad to use the cast here, because if in future the
signature of one of amroutine->amcanorderbyop is changed and
info->amcanorderbyop is not changed to match, then the cast will
prevent a compiler warning, but at runtime you may crash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] duplicate "median" entry in doc

2017-02-16 Thread Fabien COELHO



When trying to build the documentation there is a minor warning:

  collateindex.pl: duplicated index entry found: MEDIAN

Indeed, the "median" index term is specified twice in "syntax.sgml". The 
attached patch removes the warning.


--
Fabien.diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 4ea667b..40f722c 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1699,9 +1699,6 @@ SELECT string_agg(a ORDER BY a, ',') FROM table;  -- incorrect

 
  median
-
-
- median
  percentile
 
 An example of an ordered-set aggregate call is:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark  wrote:
> On 13 February 2017 at 12:01, Amit Khandekar  wrote:
>> There are a few things that can be discussed about :
>
> If you do a normal update the new tuple is linked to the old one using
> the ctid forming a chain of tuple versions. This tuple movement breaks
> that chain.  So the question I had reading this proposal is what
> behaviour depends on ctid and how is it affected by the ctid chain
> being broken.

I think this is a good question.

> I think the concurrent update case is just a symptom of this. If you
> try to update a row that's locked for a concurrent update you normally
> wait until the concurrent update finishes, then follow the ctid chain
> and recheck the where clause on the target of the link and if it still
> matches you perform the update there.

Right.  EvalPlanQual behavior, in short.

> At least you do that if you have isolation_level set to
> repeatable_read. If you have isolation level set to serializable then
> you just fail with a serialization failure. I think that's what you
> should do if you come across a row that's been updated with a broken
> ctid chain even in repeatable read mode. Just fail with a
> serialization failure and document that in partitioned tables if you
> perform updates that move tuples between partitions then you need to
> be ensure your updates are prepared for serialization failures.

Now, this part I'm not sure about.  What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode.  What's less
clear is what to do about that.  We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens or (2) throw a serialization
error.  You're advocating for #2, but I'm not sure that's right,
because:

1. It's a lot more work,

2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and

3. It's not obvious to me that it's clearly preferable from a user
experience standpoint.  I mean, either way the user doesn't get the
behavior that they want.  Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR.  Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

More opinions welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-16 Thread David Fetter

On Thu, Feb 16, 2017 at 03:39:30PM +0530, Amit Khandekar wrote:
>  and then run ExecFindPartition()
>  again using the root. Will check. I am not sure right now how involved
>  that would turn out to be, but I think that logic would not change the
>  existing code, so in that sense it is not invasive.
> >>>
> >>> I couldn't understand why run ExecFindPartition() again on the root
> >>> partitioned table, can you clarify?  ISTM, we just want to tell the user
> >>> in the HINT that trying the same update query with root partitioned table
> >>> might work. I'm not sure if it would work instead to find some
> >>> intermediate partitioned table (that is, between the root and the one that
> >>> update query was tried with) to include in the HINT.
> >>
> >> What I had in mind was : Give that hint only if there *was* a
> >> subpartition that could accommodate that row. And if found, we can
> >> only include the subpartition name.
> >
> > Asking to try the update query with the root table sounds like a good
> > enough hint.  Trying to find the exact sub-partition (I assume you mean to
> > imply sub-tree here) seems like an overkill, IMHO.
> Yeah ... I was thinking , anyways it's an error condition, so why not
> let the server spend a bit more CPU and get the right sub-partition
> for the message. If we decide to write code to find the root
> partition, then it's just a matter of another function
> ExecFindPartition().
> 
> Also, I was thinking : give the hint *only* if we know there is a
> right sub-partition. Otherwise, it might distract the user.

If this is relatively straight-forward, it'd be great.  More
actionable knowledge is better.

Thanks for taking this on.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New CORRESPONDING clause design

2017-02-16 Thread Surafel Temsgen

Here is the implementation of the clause with the slight change, instead of
doing column mapping for each side of leaf Queries in planner I make the
projection nodes output to corresponding column lists only.

This patch compiles and tests successfully with master branch on
ubuntu-15.10-desktop-amd64.It also includes documentation and new
regression tests in union.sql.

Regards

Surafel Temesgen



On Tue, Jan 17, 2017 at 8:21 AM, Tom Lane  wrote:

> Surafel Temsgen  writes:
> >  My design is
> >  *In parsing stage*
> > 1. Check at least one common column name appear in queries
> > 2. If corresponding column list is not specified, then make corresponding
> > list from common column name in queries target lists in the order
> > that those column names appear in first query
> > 3. If corresponding column list is specified, then check that every
> column
> > name
> > in The corresponding column list appears in column names of both
> queries
> > *In planner*
> > 1. Take projection columen name from corresponding list
> > 2. Reorder left and right queries target lists according to corresponding
> > list
> > 3. For each query, project columens as many as number of corresponding
> > columen.
>
> FWIW, I think you need to do most of that work in the parser, not the
> planner.  The parser certainly has to determine the actual output column
> names and types of the setop construct, and we usually consider that the
> parsing stage is expected to fully determine the semantics of the query.
> So I'd envision the parser as emitting some explicit representation of
> the column mapping for each side, rather than expecting the planner to
> determine for itself what maps to what.
>
> It probably does make sense for the actual implementation to involve
> inserting projection nodes below the setop.  I'm pretty sure the executor
> code for setops expects to just pass input tuples through without
> projection.  You could imagine changing that, but it would add a penalty
> for non-CORRESPONDING setops, which IMO would be the wrong tradeoff.
>
> regards, tom lane
>


corresponding_clause_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-16 Thread Robert Haas

On Wed, Feb 15, 2017 at 4:03 PM, Alvaro Herrera
 wrote:
> So what is going on here is that SysLogger_Start() wants to unlink the
> current-logfile file if the collector is not enabled.  This should
> probably be split out into a separate new function, for two reasons:
> first, it doesn't seem good idea to have SysLogger_Start do something
> other than start the logger; and second, so that we don't have a syscall
> on each ServerLoop iteration.  That new function should be called from
> some other place -- perhaps reaper() and just before entering
> ServerLoop, so that the file is deleted if the syslogger goes away or is
> not started.

I think it's sufficient to just remove the file once on postmaster
startup before trying to launch the syslogger for the first time.
logging_collector is PGC_POSTMASTER, so if it's not turned on when the
cluster first starts, it can't be activated later.  If it dies, that
doesn't seem like a reason to remove the file.  We're going to restart
the syslogger, and when we do, it can update the file.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 1:34 AM, Amit Khandekar  wrote:
>> What I was thinking about is something like this:
>>
>> 1. First, take the maximum parallel_workers value from among all the 
>> children.
>>
>> 2. Second, compute log2(num_children)+1 and round up.  So, for 1
>> child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
>> for 9-16 children, 5, and so on.
>>
>> 3. Use as the number of parallel workers for the children the maximum
>> of the value computed in step 1 and the value computed in step 2.
>
> Ah, now that I closely look at compute_parallel_worker(), I see what
> you are getting at.
>
> For plain unpartitioned table, parallel_workers is calculated as
> roughly equal to log(num_pages) (actually it is log3). So if the table
> size is n, the workers will be log(n). So if it is partitioned into p
> partitions of size n/p each, still the number of workers should be
> log(n). Whereas, in the patch, it is calculated as (total of all the
> child workers) i.e. n * log(n/p) for this case. But log(n) != p *
> log(x/p). For e.g. log(1000) is much less than log(300) + log(300) +
> log(300).
>
> That means, the way it is calculated in the patch turns out to be much
> larger than if it were calculated using log(total of sizes of all
> children). So I think for the step 2 above, log(total_rel_size)
> formula seems to be appropriate. What do you think ? For
> compute_parallel_worker(), it is actually log3 by the way.
>
> BTW this formula is just an extension of how parallel_workers is
> calculated for an unpartitioned table.

log(total_rel_size) would be a reasonable way to estimate workers when
we're scanning an inheritance hierarchy, but I'm hoping Parallel
Append is also going to apply to UNION ALL queries, where there's no
concept of the total rel size.  For that we need something else, which
is why the algorithm that I proposed upthread doesn't rely on it.

>> The decision to use fewer workers for a smaller scan isn't really
>> because we think that using more workers will cause a regression.
>> It's because we think it may not help very much, and because it's not
>> worth firing up a ton of workers for a relatively small scan given
>> that workers are a limited resource.  I think once we've got a bunch
>> of workers started, we might as well try to use them.
>
> One possible side-effect I see due to this is : Other sessions might
> not get a fair share of workers due to this. But again, there might be
> counter argument that, because Append is now focussing all the workers
> on a last subplan, it may finish faster, and release *all* of its
> workers earlier.

Right.  I think in general it's pretty clear that there are possible
fairness problems with parallel query.  The first process that comes
along seizes however many workers it thinks it should use, and
everybody else can use whatever (if anything) is left.  In the long
run, I think it would be cool to have a system where workers can leave
one parallel query in progress and join a different one (or exit and
spawn a new worker to join a different one), automatically rebalancing
as the number of parallel queries in flight fluctuates.  But that's
clearly way beyond anything we can do right now.  I think we should
assume that any parallel workers our process has obtained are ours to
use for the duration of the query, and use them as best we can.  Note
that even if the Parallel Append tells one of the workers that there
are no more tuples and it should go away, some higher level of the
query plan could make a different choice anyway; there might be
another Append elsewhere in the plan tree.

> BTW, there is going to be some logic change in the choose-next-subplan
> algorithm if we consider giving extra workers to subplans.

I'm not sure that it's going to be useful to make this logic very
complicated.  I think the most important thing is to give 1 worker to
each plan before we give a second worker to any plan.  In general I
think it's sufficient to assign a worker that becomes available to the
subplan with the fewest number of workers (or one of them, if there's
a tie) without worrying too much about the target number of workers
for that subplan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Simon Riggs

On 16 February 2017 at 14:54, Thom Brown  wrote:
> Hi,
>
> At the moment, partitioned tables have a restriction that prevents
> them allowing INSERT ... ON CONFLICT ... statements:
>
> postgres=# INSERT INTO cities SELECT 1, 'Crawley',105000 ON CONFLICT
> (city_id) DO NOTHING;
> ERROR:  ON CONFLICT clause is not supported with partitioned tables
>
> Why do we have such a restriction?  And what would it take to remove it?

Partitioned tables don't yet support a global unique constraint that
would be required for support of ON CONFLICT processing.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Thom Brown

Hi,

At the moment, partitioned tables have a restriction that prevents
them allowing INSERT ... ON CONFLICT ... statements:

postgres=# INSERT INTO cities SELECT 1, 'Crawley',105000 ON CONFLICT
(city_id) DO NOTHING;
ERROR:  ON CONFLICT clause is not supported with partitioned tables

Why do we have such a restriction?  And what would it take to remove it?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-16 Thread Amit Kapila

On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas  wrote:
> On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila  wrote:
>> As discussed, attached are refactoring patches and a patch to enable
>> WAL for the hash index on top of them.
>
> Thanks.  I think that the refactoring patches shouldn't add
> START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
> for the final patch.  Nor should their comments reference the idea of
> writing WAL; that should also be for the final patch.
>

Okay, changed as per suggestion.

> PageGetFreeSpaceForMulTups doesn't seem like a great name.
> PageGetFreeSpaceForMultipleTuples?

Okay, changed as per suggestion.

>  Or maybe just use
> PageGetExactFreeSpace and then do the account in the caller.  I'm not
> sure it's really worth having a function just to subtract a multiple
> of sizeof(ItemIdData), and it would actually be more efficient to have
> the caller take care of this, since you wouldn't need to keep
> recalculating the value for every iteration of the loop.
>

I have tried this one, but I think even if track outside we need
something like PageGetFreeSpaceForMultipleTuples for the cases when we
have to try next write page/'s where data (set of index tuples) can be
accommodated.

> I think we ought to just rip (as a preliminary patch) out the support
> for freeing an overflow page in the middle of a bucket chain.  That
> code is impossible to hit, I believe, and instead of complicating the
> WAL machinery to cater to a nonexistent case, maybe we ought to just
> get rid of it.
>

I also could not think of a case where we need to care about the page
in the middle of bucket chain as per it's current usage.  In
particular, are you worried about the code related to nextblock number
in _hash_freeovflpage()?   Surely, we can remove it, but what
complexity are you referring to here?   There is some additional
book-keeping for primary bucket page, but there is nothing much about
maintaining a backward link.  One more thing to note is that this is
an exposed API, so to avoid getting it used in some wrong way, we
might want to make it static.

> +   /*
> +* We need to release and if required
> reacquire the lock on
> +* rbuf to ensure that standby
> shouldn't see an intermediate
> +* state of it.
> +*/
>
> How does releasing and reacquiring the lock on the master affect
> whether the standby can see an intermediate state?

I am talking about the intermediate state of standby with respect to
the master.  We will release the lock on standby after replaying the
WAL for moving tuples from read page to write page whereas master will
hold that for the much longer period (till we move all the tuples from
read page and free that page).  I understand that there is no
correctness issue here, but was trying to make operations on master
and standby similar and another thing is that it will help us in
obeying the coding rule of releasing the lock on buffers after writing
WAL record.  I see no harm in maintaining the existing coding pattern,
however, if you think that is not important in this case, then I can
change the code to avoid releasing the lock on read page.

>  That sounds
> totally wrong to me (and also doesn't belong in a refactoring patch
> anyway since we're not emitting any WAL here).
>

Agreed that even if we want to do such a change, then also it should
be part of WAL patch.  So for now, I have reverted that change.

> -   Assert(PageIsNew(page));
>
> This worries me a bit.  What's going on here?
>

This is related to below change.

+ /*
+ * Initialise the freed overflow page, here we can't complete zeroed the
+ *
page as WAL replay routines expect pages to be initialized. See
+ * explanation of RBM_NORMAL
mode atop XLogReadBufferExtended.
+ */
+ _hash_pageinit(ovflpage, BufferGetPageSize
(ovflbuf));

Basically, we need this routine to initialize freed overflow pages.
Also, if you see the similar function in btree (_bt_pageinit(Page
page, Size size)), it doesn't have any such Assertion.

Attached are refactoring patches.  WAL patch needs some changes based
on above comments, so will post it later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

0001-Expose-a-new-API-_hash_pgaddmultitup.patch
Description: Binary data

0002-Expose-an-API-hashinitbitmapbuffer.patch
Description: Binary data

0003-Restructure-_hash_addovflpage.patch
Description: Binary data

0004-Restructure-split-bucket-code.patch
Description: Binary data

0005-Restructure-hash-index-creation.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-02-16 Thread Robert Haas

On Wed, Feb 15, 2017 at 11:15 PM, Ashutosh Bapat
 wrote:
> If the user is ready throw 200 workers and if the subplans can use
> them to speed up the query 200 times (obviously I am exaggerating),
> why not to use those? When the user set
> max_parallel_workers_per_gather to that high a number, he meant it to
> be used by a gather, and that's what we should be doing.

The reason is because of what Amit Khandekar wrote in his email -- you
get a result with a partitioned table that is wildly inconsistent with
the result you get for an unpartitioned table.  You could equally well
argue that if the user sets max_parallel_workers_per_gather to 200,
and there's a parallel sequential scan of an 8MB table to be
performed, we ought to use all 200 workers for that.  But the planner
in fact estimates a much lesser number of workers, because using 200
workers for that task wastes a lot of resources for no real
performance benefit.  If you partition that 8MB table into 100 tables
that are each 80kB, that shouldn't radically increase the number of
workers that get used.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CREATE SUBSCRIPTION uninterruptable

2017-02-16 Thread Thom Brown

Hi,

I've noticed that when creating a subscription, it can't be
interrupted.  One must wait until it times out, which takes just over
2 minutes.  I'm guessing ALTER SUBSCRIPTION would have the same
problem.

Shouldn't we have an interrupt for this?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned tables and relfilenode

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs  wrote:
> Why not vacuum all partitions?
> Why not analyze all partitions?
> Truncate all partitions

I agree.  But, we need to be careful that a database-wide VACUUM or
ANALYZE doesn't hit the partitions multiple times, once for the parent
and again for each child.  Actually, a database-wide VACUUM should hit
each partition individually and do nothing for the parents, but a
database-wide ANALYZE should process the parents and do nothing for
the children, so that the inheritance statistics get updated.

>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>>   tables, because there is nothing to be done.
>>
>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
>
> Create index on all partitions

That one's more complicated, per what I wrote in
https://www.postgresql.org/message-id/CA+TgmoZUwj=qynak+f7xef4w_e2g3xxdmnsnzmzjuinhrco...@mail.gmail.com

> (It also seems like wasted effort to try to remove the overhead caused
> by a parent table for partitioning. Why introduce a special case to
> save a few bytes? Premature optimization, surely?)

I don't think it's wasted effort, actually.  My concern isn't so much
the empty file on disk (which is stupid, but maybe harmless) as
eliminating the dummy scan from the query plan.  I believe the
do-nothing scan can actually be a noticeable drag on performance in
some cases - e.g. if the scan of the partitioned table is on the
inside of a nested loop, so that instead of repeatedly doing an index
scan on each of 4 partitions, you repeatedly do an index scan on each
of 4 partitions and a sequential scan of an empty table.  A zero-page
sequential scan is pretty fast, but not free.  An even bigger problem
is that the planner may think that always-empty parent can contain
some rows, throwing planner estimates off and messing up the whole
plan.  We've been living with that problem for a long time, but now
that we have an opportunity to fix it, it would be good to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] 2 doc typos

2017-02-16 Thread Thom Brown

Hi,

Please find attached a patch to fix 2 typos.

1) s/mypubclication/mypublication/

2) Removed trailing comma from last column definition in example.

Thanks

Thom
diff --git a/doc/src/sgml/ref/create_subscription.sgml 
b/doc/src/sgml/ref/create_subscription.sgml
index 59e5ad0..250806f 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -142,7 +142,7 @@ CREATE SUBSCRIPTION subscription_name
Create a subscription to a remote server that replicates tables in
-   the publications mypubclication and
+   the publications mypublication and
insert_only and starts replicating immediately on
commit:
 
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index e0f7cd9..41c08bb 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1539,7 +1539,7 @@ CREATE TABLE measurement_year_month (
 CREATE TABLE cities (
 city_id  bigserial not null,
 name text not null,
-population   bigint,
+population   bigint
 ) PARTITION BY LIST (left(lower(name), 1));
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Robert Haas

On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan  wrote:
> At the risk of stating the obvious, ISTM that the right way to do
> this, at a high level, is to err on the side of unneeded extra
> unlink() calls, not leaking files. And, to make the window for problem
> ("remaining hole that you haven't quite managed to plug") practically
> indistinguishable from no hole at all, in a way that's kind of baked
> into the API.

I do not think there should be any reason why we can't get the
resource accounting exactly correct here.  If a single backend manages
to remove every temporary file that it creates exactly once (and
that's currently true, modulo system crashes), a group of cooperating
backends ought to be able to manage to remove every temporary file
that any of them create exactly once (again, modulo system crashes).

I do agree that a duplicate unlink() call isn't as bad as a missing
unlink() call, at least if there's no possibility that the filename
could have been reused by some other process, or some other part of
our own process, which doesn't want that new file unlinked.  But it's
messy.  If the seatbelts in your car were to randomly unbuckle, that
would be a safety hazard.  If they were to randomly refuse to
unbuckle, you wouldn't say "that's OK because it's not a safety
hazard", you'd say "these seatbelts are badly designed".  And I think
the same is true of this mechanism.

The way to make this 100% reliable is to set things up so that there
is joint ownership from the beginning and shared state that lets you
know whether the work has already been done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1

2017-02-16 Thread Amit Kapila

On Thu, Feb 16, 2017 at 11:45 AM, Justin Workman  wrote:
>> It would help to know the data types of the columns involved in this
>> query; but just eyeballing it, it doesn't look like it involves any
>> array operations, so it's pretty hard to believe that the expanded-object
>> code could have gotten invoked intentionally.  (The mere presence of
>> an array column somewhere in the vicinity would not do that; you'd
>> need to invoke an array-ish operation, or at least pass the array into
>> a plpgsql function.)
>> If I had to bet on the basis of this much info, I would bet that the
>> parallel-query infrastructure is dropping the ball somewhere and
>> transmitting a corrupted datum that accidentally looks like it is
>> an expanded-object reference.
>> If $customer wants a quick fix, I'd suggest seeing whether disabling
>> parallel query makes the problem go away.  That might be a good first
>> step anyway, just to narrow down where the problem lies.
>> regards, tom lane
>
>
> Hi Tom,
>
> My name is Justin, and I am $customer as it were. As Peter explained, we
> haven't seen the segfaults anymore since disabling parallel queries. This
> works as a quick fix and is much appreciated! If you would still like to get
> to the bottom of this, I am willing to help out with more information as
> needed. My knowledge of PG internals is extremely limited so I don't know
> how much help I can be, but we'd like to see this resolved beyond the quick
> fix, or at least understand why it happened.
>

Thanks Justin for the confirmation.  Is it possible for you to get a
standalone test case with which we can reproduce the problem?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-16 Thread Robert Haas

On Wed, Feb 15, 2017 at 8:13 PM, Andres Freund  wrote:
> I don't think general INSERTs are safe, if you consider unique indexes
> and foreign keys (both setting xmax in the simple cases and multixacts
> are likely to be problematic).

There's no real problem with setting xmax or creating multixacts - I
think - but there's definitely a problem if an INSERT can lead to the
creation of a new combo CID, because we have no way at present of
transmitting the new combo CID mapping to the workers, and if a worker
sees a combo CID for which it doesn't have a mapping, the world blows
up.  Do you think an insert can trigger combo CID creation?

(Of course, now that we've got DSA, it wouldn't be nearly as hard to
fix the combo CID synchronization problem.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-16 Thread Robert Haas

On Thu, Feb 16, 2017 at 5:07 AM, Alexander Korotkov
 wrote:
> On Wed, Feb 15, 2017 at 8:49 PM, Alvaro Herrera 
> wrote:
>> Alexander Korotkov wrote:
>>
>> > Difference between master, pgxact-align-2 and pgxact-align-3 doesn't
>> > exceed
>> > per run variation.
>>
>> FWIW this would be more visible if you added error bars to each data
>> point.  Should be simple enough in gnuplot ...
>
> Good point.
> Please find graph of mean and errors in attachment.

So ... no difference?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-16 Thread Robert Haas

On Wed, Feb 15, 2017 at 9:36 PM, Andres Freund  wrote:
> 0002-hj-add-dtrace-probes-v5.patch
>
> Hm. I'm personally very unenthusiastic about addming more of these, and
> would rather rip all of them out.  I tend to believe that static
> problems simply aren't a good approach for anything requiring a lot of
> detail.  But whatever.

I'm not a big fan of either static problems or static probes, myself.

> Isn't it kinda weird to do this from within dense_alloc()?  I mean that
> dumps a lot of data to disk, frees a bunch of memory and so on - not
> exactly what "dense_alloc" implies.  Isn't the free()ing part also
> dangerous, because the caller might actually use some of that memory,
> like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
> deeply enough to check whether that's an active bug, but it seems like
> inviting one if not.

I haven't looked at this, but one idea might be to just rename
dense_alloc() to ExecHashBlahBlahSomething().  If there's a real
abstraction layer problem here then we should definitely fix it, but
maybe it's just the angle at which you hold your head.

> To me that is a weakness in the ExecShutdownNode() API - imo child nodes
> should get the chance to shutdown before the upper-level node.
> ExecInitNode/ExecEndNode etc give individual nodes the freedom to do
> things in the right order, but ExecShutdownNode() doesn't.  I don't
> quite see why we'd want to invent a separate ExecDetachNode() that'd be
> called immediately before ExecShutdownNode().

Interestingly, the same point came up on the Parallel Bitmap Heap Scan thread.

> An easy way to change that would be to return in the
> ExecShutdownNode()'s T_GatherState case, and delegate the responsibility
> of calling it on Gather's children to ExecShutdownGather().
> Alternatively we could make it a full-blown thing like ExecInitNode()
> that every node needs to implement, but that seems a bit painful.

I was thinking we should just switch things so that ExecShutdownNode()
recurses first, and then does the current node.  There's no real
excuse for a node terminating the shutdown scan early, I think.

> Or have I missed something here?
>
> Random aside: Wondered before if having to provide all executor
> callbacks is a weakness of our executor integration, and whether it
> shouldn't be a struct of callbacks instead...

I honestly have no idea whether that would be better or worse from the
CPU's point of view.

> I think it's a bit weird that the parallel path now has an exit path
> that the non-parallel path doesn't have.

I'm not sure about this particular one, but in general those are
pretty common.  For example, look at the changes
569174f1be92be93f5366212cc46960d28a5c5cd made to _bt_first().  When
you get there, you can discover that you aren't actually the first,
and that in fact all the work is already complete, and there's nothing
left for you to do but give up.

> Hm. That seems a bit on the detailed side - if we're going that way it
> seems likely that we'll end up with hundreds of wait events. I don't
> think gradually evolving wait events into something like a query
> progress framework is a good idea.

I'm pretty strongly of the opinion that we should not reuse multiple
wait events for the same purpose.  The whole point of the wait event
system is to identify what caused the wait.  Having relatively
recently done a ton of work to separate all of the waits in the system
and identify them individually, I'm loathe to see us start melding
things back together again.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1

2017-02-16 Thread Justin Workman

>
> It would help to know the data types of the columns involved in this
> query; but just eyeballing it, it doesn't look like it involves any
> array operations, so it's pretty hard to believe that the expanded-object
> code could have gotten invoked intentionally.  (The mere presence of
> an array column somewhere in the vicinity would not do that; you'd
> need to invoke an array-ish operation, or at least pass the array into
> a plpgsql function.)
> If I had to bet on the basis of this much info, I would bet that the
> parallel-query infrastructure is dropping the ball somewhere and
> transmitting a corrupted datum that accidentally looks like it is
> an expanded-object reference.
> If $customer wants a quick fix, I'd suggest seeing whether disabling
> parallel query makes the problem go away.  That might be a good first
> step anyway, just to narrow down where the problem lies.
> regards, tom lane


Hi Tom,

My name is Justin, and I am $customer as it were. As Peter explained, we
haven't seen the segfaults anymore since disabling parallel queries. This
works as a quick fix and is much appreciated! If you would still like to
get to the bottom of this, I am willing to help out with more information
as needed. My knowledge of PG internals is extremely limited so I don't
know how much help I can be, but we'd like to see this resolved beyond the
quick fix, or at least understand why it happened.

album_photo_assignments.id, album_photo_assignments.album_id,
album_photo_assignments.photo_id and albums.id are all UUID columns.
albums.deleted_at is a timestamp.

Thanks so much for your time,

Justin Workman

Re: [HACKERS] AT detach partition is broken

2017-02-16 Thread Robert Haas

On Wed, Feb 15, 2017 at 7:45 PM, Amit Langote
 wrote:
> You're right.  Took this one out (except slightly tweaking the comment) in
> the attached updated patch.

Committed after tweaking the other comment in that function.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] drop support for Python 2.3

2017-02-16 Thread Peter Eisentraut

On 2/8/17 10:35, Devrim Gündüz wrote:
> On Wed, 2017-02-08 at 09:16 -0500, Peter Eisentraut wrote:
>> It appears that we don't have anything running 2.4.  A RHEL/CentOS 5
>> system with standard components would be a good addition to the build farm.
> 
> I have CentOS 5 instances running on buildfarm. I'll register them via
> buildfarm.pg.org soon.

I will wait for that before proceeding.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-16 Thread Peter Eisentraut

On 2/16/17 01:13, Peter Geoghegan wrote:
> On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
>  wrote:
>> I have changed it to use ucol_nextSortKeyPart() when appropriate.
> 
> I think it would be fine if the second last argument was
> "sizeof(Datum)", not "Min(sizeof(Datum), sss->buflen2)" here:

I don't see anything that guarantees that the buffer length is always at
least sizeof(Datum).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing query string to workers

2017-02-16 Thread Kuntal Ghosh

On Thu, Feb 16, 2017 at 5:47 PM, Rafia Sabih
 wrote:
> Other that that I updated some comments and other cleanup things. Please
> find the attached patch for the revised version.
Looks good.

It has passed the regression tests (with and without regress mode).
Query is getting displayed for parallel workers in pg_stat_activity,
log statements and failed-query statements. Moved status to Ready for
committer.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Alexander Korotkov

Hi, Nikita!

I have assigned as a reviewer for this patchset.  I took a fist look to
these patches.
At first, I'd like to notice that it's very cool that you picked up this
work.  I frequently hear people complains about lack of this feature.

 k  |  kNN-btree   |  kNN-GiST|  Opt. query   |  Seq. scan
> |  | (btree_gist) |  with UNION   |  with sort
> |--|--|---|
>   1 |  0.041 4 |  0.079 4 |   0.060 8 |  41.1 1824
>  10 |  0.048 7 |  0.091 9 |   0.09717 |  41.8 1824
> 100 |  0.10747 |  0.19252 |   0.342   104 |  42.3 1824
>1000 |  0.735   573 |  0.913   650 |   2.970  1160 |  43.5 1824
>   1 |  5.070  5622 |  6.240  6760 |  36.300 11031 |  54.1 1824
>  10 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824


These results looks quite expected.  KNN-btree uses about half of blocks in
comparison with UNION query, and it's more than twice faster.  In
comparison with kNN-GiST there is still some win.

1. Introduce amcanorderbyop() function
>
> This patch transforms existing boolean AM property amcanorderbyop into a
> method
> (function pointer).  This is necessary because, unlike GiST, kNN for btree
> supports only a one ordering operator on the first index column and we
> need a
> different pathkey matching logic for btree (there was a corresponding
> comment
> in match_pathkeys_to_index()).  GiST-specific logic has been moved from
> match_pathkeys_to_index() to gistcanorderbyop().


I'm not very excited about this design of amcanorderbyop callback.
Introducing new callback from index access method to the planner should
imply quite good flexibility to the future.  In this particular signature
of callback I see no potential future use-cases than your implementation
for btree.  We could just add amcanorderbyonlyfisrtop property and that
would give us same level of flexibility I think.
With existing index types, we could cover much more orderings that we
currently do.  Some of possible cases:
1) "ORDER BY col" for btree_gist, SP-GiST text_ops
2) "ORDER BY col1, col2 <-> const" for btree_gist
3) "ORDER BY col1, col2 <-> const" for btree

I understand that #3 is quite hard task and I don't ask you to implement it
now.  But it would be nice if some day we decide to add #3, we wouldn't
have to change IndexAmRoutine definition.

My idea is that we need more general redesign of specifying ordering which
index can produce.  Ideally, we should replace amcanorder, amcanbackward
and amcanorderbyop with single callback.  Such callback should take a list
of pathkeys and return number of leading pathkeys index could satisfy (with
corresponding information for index scan).  I'm not sure that other hackers
would agree with such design, but I'm very convinced that we need something
of this level of extendability.  Otherwise we would have to hack our
planner <-> index_access_method interface each time we decide to cover
another index produced ordering.

6. Remove duplicate distance operators from contrib/btree_gist.
>
> References to their own distance operators in btree_gist opclasses are
> replaced with references to the built-in operators and than duplicate
> operators are dropped.  But if the user is using somewhere these operators,
> upgrade of btree_gist from 1.3 to 1.4 would fail.


The query in "btree_gist--1.3--1.4.sql" which directly touches system
catalogue to update opfamilies looks too hackery.  I think we shouldn't use
such queries until we have no other choice.
In this particular case we can update opfamilies using legal mechanism
"ALTER OPERATOR FAMILY name USING index_method ADD/DROP ... " (note that
operator name could be schema-qualified if needed).  This way wouldn't be
that brief, but it is much more correct.

Also this like catch my eyes.

> info->amcanorderbyop = (void (*)()) amroutine->amcanorderbyop;

It's not necessary to use cast here.  For instance, we don't use cast
for amcostestimate.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] SERIALIZABLE with parallel query

2017-02-16 Thread Thomas Munro

On Thu, Feb 16, 2017 at 2:58 AM, Robert Haas  wrote:
> On Tue, Nov 8, 2016 at 4:51 PM, Thomas Munro
>  wrote:
>> Currently we don't generate parallel plans in SERIALIZABLE.  What
>> problems need to be solved to be able to do that?  I'm probably
>> steamrolling over a ton of subtleties and assumptions here, but it
>> occurred to me that a first step might be something like this:
>>
>> 1.  Hand the leader's SERIALIZABLEXACT to workers.
>> 2.  Make sure that workers don't release predicate locks.
>> 3.  Make sure that the leader doesn't release predicate locks until
>> after the workers have finished accessing the leader's
>> SERIALIZABLEXACT.  I think this is already the case.
>
> What happens if the workers exit at the end of the query and the
> leader then goes on and executes more queries?  Will the
> worker-acquired predicate locks be retained or will they go away when
> the leader exits?

All predicate locks last at least until ReleasePredicateLocks() run
after ProcReleaseLocks(), and sometimes longer.  Although
ReleasePredicateLocks() runs in workers too, this patch makes it
return without doing anything.  I suppose someone could say that
ReleasePredicateLocks() should at least run
hash_destroy(LocalPredicateLockHash) and set LocalPredicateLockHash to
NULL in workers.  This sort of thing could be important if we start
reusing worker processes.  I'll do that in the next version.

The predicate locks themselves consist of state in shared memory, and
those created by workers are indistinguishable from those created by
the leader process.  Having multiple workers and the leader all
creating predicate locks linked to the same SERIALIZABLEXACT is
*almost* OK, because most relevant shmem state is protected by locks
already in all paths (with the exception of the DOOMED flag already
mentioned, which seems to follow a "notice me as soon as possible"
philosophy, to avoid putting locking into the
CheckForSerializableConflict(In|Out) paths, with a definitive check at
commit time).

But... there is a problem with the management of the linked list of
predicate locks held by a transactions.  The locks themselves are
covered by partition locks, but the links are not, and that previous
patch broke the assumption that they could never be changed by another
process.

Specifically, DeleteChildTargetLocks() assumes it can walk
MySerializableXact->predicateLocks and throw away locks that are
covered by a new lock (ie throw away tuple locks because a covering
page lock has been acquired) without let or hindrance until it needs
to modify the locks themselves.  That assumption doesn't hold up with
that last patch and will require a new kind of mutual exclusion.  I
wonder if the solution is to introduce an LWLock into each
SERIALIZABLEXACT object, so DeleteChildTargetLocks() can prevent
workers from stepping on each others' toes during lock cleanup.  An
alternative would be to start taking SerializablePredicateLockListLock
in exclusive rather than shared mode, but that seems unnecessarily
course.

I have a patch that implements the above but I'm still figuring out
how to test it, and I'll need to do a bit more poking around for other
similar assumptions before I post a new version.

I tried to find any way that LocalPredicateLockHash could create
problems, but it's effectively a cache with
false-negatives-but-never-false-positives semantics.  In cache-miss
scenarios it we look in shmem data structures and are prepared to find
that our SERIALIZABLEXACT already has the predicate lock even though
there was a cache miss in LocalPredicateLockHash.  That works because
our SERIALIZABLEXACT's address is part of the tag, and it's stable
across backends.

Random observation:  The global variable MyXactDidWrite would probably
need to move into shared memory when parallel workers eventually learn
to write.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-16 Thread Robert Haas

On Wed, Feb 15, 2017 at 3:33 PM, David G. Johnston
 wrote:
> The description in the OP basically distinguishes between NORMALIZE and
> ALIGN in that ALIGN, as described above, affects an INTERSECTION on the two
> ranges - discarding the non-overlapping data - while NORMALIZE performs the
> alignment while also retaining the non-overlapping data.

Hmm.  So is ALIGN like an inner join and NORMALIZE like a full outer
join?  Couldn't you have left and right outer joins, too, like you
null-extend the parts of the lefthand range that don't match but throw
away parts of the righthand range that don't match?

Also, it sounds like all of this is intended to work with ranges that
are stored in different columns rather than with PostgreSQL's built-in
range types.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing query string to workers

2017-02-16 Thread Rafia Sabih

On Thu, Feb 16, 2017 at 5:06 PM, Kuntal Ghosh 
wrote:
>
> >>
> >> Another question is don't we need to set debug_query_string in worker?
> >
> > In the updated version I am setting it in ParallelQueryMain.
> >
> Ahh, I missed that. debug_query_string is used to show the log
> statements. Hence, it should be set.
>
> +   queryString = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT);
> +   debug_query_string = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT);
> +   pgstat_report_activity(STATE_RUNNING, shm_toc_lookup(toc,
> PARALLEL_KEY_QUERY_TEXT));
> Just one lookup is sufficient.
>
> Fixed.

Other that that I updated some comments and other cleanup things. Please
find the attached patch for the revised version.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


pass_queryText_to_workers_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

2017-02-16 Thread Kyotaro HORIGUCHI

Thank you very much for testing this!

At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote  
wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df...@lab.ntt.co.jp>
> Horiguchi-san,
> 
> On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
> 
> With the latest set of patches, I observe a crash due to an Assert failure:
> 
> #3  0x006883ed in ExecAsyncEventWait (estate=0x13c01b8,
> timeout=-1) at execAsync.c:345

This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

> I was running a query whose plan looked like:
> 
> explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
> group by 1,2 order by 1;
>   QUERY PLAN
> --
>  Sort
>Sort Key: ((ptab.tableoid)::regclass)
>->  HashAggregate
>  Group Key: (ptab.tableoid)::regclass, ptab.a
>  ->  Result
>->  Append
>  ->  Foreign Scan on ptab_1
>  ->  Foreign Scan on ptab_2
>  ->  Foreign Scan on ptab_3
>  ->  Foreign Scan on ptab_4
>  ->  Foreign Scan on ptab_5
>  ->  Foreign Scan on ptab_6
>  ->  Foreign Scan on ptab_7
>  ->  Foreign Scan on ptab_8
>  ->  Foreign Scan on ptab_9
>  ->  Foreign Scan on ptab_00010
>
> 
> The snipped part contains Foreign Scans on 90 more foreign partitions (in
> fact, I could see the crash even with 10 foreign table partitions for the
> same query).

Yeah, it seems to me unrelated to how many they are.

> There is a crash in one more case, which seems related to how WaitEventSet
> objects are manipulated during resource-owner-mediated cleanup of a failed
> query, such as after the FDW returned an error like below:
> 
> ERROR:  relation "public.ptab_00010" does not exist
> CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010
> 
> The backtrace in this looks like below:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> 301   lastidx = resarr->lastidx;
> (gdb)
> (gdb) bt
> #0  0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> #1  0x009c6578 in ResourceOwnerForgetWES
> (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
> #2  0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> #3  0x009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
> at resowner.c:566
> #4  0x009c5155 in ResourceOwnerRelease (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
> '\001') at resowner.c:485
> #5  0x00524172 in AbortTransaction () at xact.c:2588
> #6  0x00524854 in AbortCurrentTransaction () at xact.c:3016
> #7  0x00836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
> dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
> #8  0x007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
> #9  0x007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
> #10 0x007a0885 in ServerLoop () at postmaster.c:1722
> #11 0x0079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
> postmaster.c:1330
> #12 0x006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228
> 
> There is a segfault when accessing the events variable, whose members seem
> to be pfreed:
> 
> (gdb) f 2
> #2  0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> 600   ResourceOwnerForgetWES(set->resowner, set);
> (gdb) p *set
> $5 = {
>   nevents = 2139062143,
>   nevents_space = 2139062143,
>   resowner = 0x7f7f7f7f7f7f7f7f,
>   events = 0x7f7f7f7f7f7f7f7f,
>   latch = 0x7f7f7f7f7f7f7f7f,
>   latch_pos = 2139062143,
>   epoll_fd = 2139062143,
>   epoll_ret_events = 0x7f7f7f7f7f7f7f7f
> }

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

Great thanks for the report.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Simon Riggs

On 14 February 2017 at 11:48, Thomas Munro
 wrote:
> On Wed, Feb 1, 2017 at 5:21 PM, Michael Paquier
>  wrote:
>> On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro
>>  wrote:
>>> Ok.  I see that there is a new compelling reason to move the ring
>>> buffer to the sender side: then I think lag tracking will work
>>> automatically for the new logical replication that just landed on
>>> master.  I will try it that way.  Thanks for the feedback!
>>
>> Seeing no new patches, marked as returned with feedback. Feel free of
>> course to refresh the CF entry once you have a new patch!
>
> Here is a new version with the buffer on the sender side as requested.
> Since it now shows write, flush and replay lag, not just replay, I
> decide to rename it and start counting versions at 1 again.
> replication-lag-v1.patch is less than half the size of
> replay-lag-v16.patch and considerably simpler.  There is no more GUC
> and no more protocol change.
>
> While the write and flush locations are sent back at the right times
> already, I had to figure out how to get replies to be sent at the
> right time when WAL was replayed too.  Without doing anything special
> for that, you get the following cases:
>
> 1.  A busy system: replies flow regularly due to write and flush
> feedback, and those replies include replay position, so there is no
> problem.
>
> 2.  A system that has just streamed a lot of WAL causing the standby
> to fall behind in replaying, but the primary is now idle:  there will
> only be replies every 10 seconds (wal_receiver_status_interval), so
> pg_stat_replication.replay_lag only updates with that frequency.
> (That was already the case for replay_location).
>
> 3.  An idle system that has just replayed some WAL and is now fully
> caught up.  There is no reply until the next
> wal_receiver_status_interval; so now replay_lag shows a bogus number
> over 10 seconds.  Oops.
>
> Case 1 is good, and I suppose that 2 is OK, but I needed to do
> something about 3.  The solution I came up with was to force one reply
> to be sent whenever recovery runs out of WAL to replay and enters
> WaitForWALToBecomeAvailable().  This seems to work pretty well in
> initial testing.
>
> Thoughts?

Feeling happier about this for now at least.

I think we need to document how this works more in README or header
comments. That way I can review it against what it aims to do rather
than what I think it might do.

e.g. We need to document what replay_lag represents. Does it include
write_lag and flush_lag, or is it the time since the flush_lag. i.e.
do I add all 3 together to get the full lag, or would that cause me to
double count?

How sensitive is this? Does the lag spike quickly and then disappear
again quickly? If we're sampling this every N seconds, will we get a
realistic viewpoint or just a random sample? Should we smooth the
value, or present preak info?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing query string to workers

2017-02-16 Thread Kuntal Ghosh

On Sat, Feb 11, 2017 at 8:38 AM, Rafia Sabih
 wrote:
>
>>
>> Another question is don't we need to set debug_query_string in worker?
>
> In the updated version I am setting it in ParallelQueryMain.
>
Ahh, I missed that. debug_query_string is used to show the log
statements. Hence, it should be set.

> Please find the attached file for the revised version.
>
+   queryString = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT);
+   debug_query_string = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT);
+   pgstat_report_activity(STATE_RUNNING, shm_toc_lookup(toc,
PARALLEL_KEY_QUERY_TEXT));
Just one lookup is sufficient.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned tables and relfilenode

2017-02-16 Thread Simon Riggs

On 10 February 2017 at 06:19, Amit Langote
 wrote:

>  the "right thing" here being that the
> command's code either throws an error or warning (in some cases) if the
> specified table is a partitioned table or ignores any partitioned tables
> when it reads the list of relations to process from pg_class.

This is a massive assumption and deserves major discussion.

My expectation is that "partitioned tables" are "tables". Anything
else seems to fly in the face of both the SQL Standard and the POLA
principle for users coming from other database systems.

IMHO all the main actions should all "just work" not throw errors.

> Commands
> that need to be taught about this are vacuum, analyze, truncate, and alter
> table.  Specifically:
>
> - In case of vacuum, specifying a partitioned table causes a warning

Why not vacuum all partitions?

> - In case of analyze, we do not throw an error or warning but simply
>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>   acquire_inherited_sample_rows(), any partitioned tables in the list
>   returned by find_all_inheritors() are skipped.

Why not analyze all partitions?

> - In case of truncate, only the part which manipulates table's physical
>   storage is skipped for partitioned tables.

Truncate all partitions

> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned
>   tables, because there is nothing to be done.
>
> - Since we cannot create indexes on partitioned tables anyway, there is
>   no need to handle cluster and reindex (they throw a meaningful error
>   already due to the lack of indexes.)

Create index on all partitions

> Thoughts, comments?

We already have Inheritance. Anybody that likes that behaviour can use it.

Most people don't like that behaviour and wish to see change. They
want the same behaviour as they get from other database products where
a partitioned table is a table and you can do stuff to it just like
other tables. We have the opportunity to change things here and we
should do so.

(It also seems like wasted effort to try to remove the overhead caused
by a parent table for partitioning. Why introduce a special case to
save a few bytes? Premature optimization, surely?)

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-16 Thread Simon Riggs

On 15 February 2017 at 08:07, Masahiko Sawada  wrote:
>
> It's a bug. Attached latest version patch, which passed make check.

In its current form, I'm not sure this is a good idea. Problems...

1. I'm pretty sure the world doesn't need another VACUUM parameter

I suggest that we use the existing vacuum scale factor/4 to reflect
that indexes are more sensitive to bloat.

2. The current btree vacuum code requires 2 vacuums to fully reuse
half-dead pages. So skipping an index vacuum might mean that second
index scan never happens at all, which would be bad.

I suggest that we store the number of half-dead pages in the metapage
after each VACUUM, so we can decide whether to skip the scan or not.
And we use some math like each half-dead page that needs to be reused
is worth 250 index entries, so the decision to skip is based upon rows
and empty pages, not just recently vacuumed rows.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-16 Thread Greg Stark

On 13 February 2017 at 12:01, Amit Khandekar  wrote:
> There are a few things that can be discussed about :

If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain.  So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.

I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.

At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.

I think this would require another bit in the tuple info mask
indicating that this is tuple is the last version before a broken ctid
chain -- i.e. that it was updated by moving it to another partition.
Maybe there's some combination of bits you could use though since this
is only needed in a particular situation.

Offhand I don't know what other behaviours are dependent on the ctid
chain. I think you need to go search the docs -- and probably the code
just to be sure -- for any references to ctid to ensure you catch
every impact of breaking the ctid chain.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Index-only scan

2017-02-16 Thread Rafia Sabih

On Thu, Feb 16, 2017 at 3:40 PM, Rafia Sabih 
wrote:

>
> On Thu, Feb 16, 2017 at 1:26 PM, Rahila Syed 
> wrote:
>
>> I  reviewed the patch. Overall it looks fine to me.
>>
>> One comment,
>>
>> >-   if (index->amcanparallel &&
>> >-   !index_only_scan &&
>> >+   if ((index->amcanparallel ||
>> >+   index_only_scan) &&
>>
>> Why do we need to check for index_only_scan in the above condition. IIUC,
>> index->amcanparallel is necessary for
>> parallel index scan to be chosen. We cannot chose parallel index only
>> scan if index_only_scan is happening
>> without worrying about index->amcanparallel value. So OR-ing
>> index->amcanparallel with index_only_scan is probably not
>> correct.
>>
>> True, we do not need this, only removing !index_only_scan should work.
> Fixed
>
>>
>>
>> On Thu, Feb 16, 2017 at 1:06 AM, Robert Haas 
>> wrote:
>>>
>>>
>>> This again needs minor rebasing but basically looks fine.  It's a
>>> pretty straightforward extension of the parallel index scan work.
>>>
>>> Please make sure that this is pgindent-clean - i.e. that when you
>>> pgindent the files that it touches, pgindent doesn't change anything
>>> of the same parts of the file that you've changed in the patch.  Also,
>>> I believe Amit may have made some adjustments to the logic in
>>> nodeIndexScan.c; if so, it would be good to make sure that the
>>> nodeIndexOnlyScan.c changes match what was done there.  In particular,
>>> he's got this:
>>>
>>> if (reset_parallel_scan && node->iss_ScanDesc->parallel_s
>>> can)
>>> index_parallelrescan(node->iss_ScanDesc);
>>>
>>> And you've got this:
>>>
>>> +   if (reset_parallel_scan)
>>> +   index_parallelrescan(node->ioss_ScanDesc);
>>>
>>
> Fixed.
> Please find the attached patch for rebased and cleaner version.
>
> Please find the attached patch with a minor comment update.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


parallel_index_only_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Abhijit Menon-Sen

Hi Thomas.

At 2017-02-15 00:48:41 +1300, thomas.mu...@enterprisedb.com wrote:
>
> Here is a new version with the buffer on the sender side as requested.

This looks good.

> + write_lag
> + interval
> + Estimated time taken for recent WAL records to be written on this
> +  standby server

I think I would find a slightly more detailed explanation helpful here.

A few tiny nits:

> +  * If the lsn hasn't advanced since last time, then do nothing.  This 
> way
> +  * we only record a new sample when new WAL has been written, which is
> +  * simple proxy for the time at which the log was written.

"which is simple" → "which is a simple"

> +  * If the buffer if full, for now we just rewind by one slot and 
> overwrite
> +  * the last sample, as a simple if somewhat uneven way to lower the
> +  * sampling rate.  There may be better adaptive compaction algorithms.

"buffer if" → "buffer is"

> + * Find out how much time has elapsed since WAL position 'lsn' or earlier was
> + * written to the lag tracking buffer and 'now'.  Return -1 if no time is
> + * available, and otherwise the elapsed time in microseconds.

Find out how much time has elapsed "between X and 'now'", or "since X".
(I prefer the former, i.e., s/since/between/.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Index-only scan

2017-02-16 Thread Rafia Sabih

On Thu, Feb 16, 2017 at 1:26 PM, Rahila Syed  wrote:

> I  reviewed the patch. Overall it looks fine to me.
>
> One comment,
>
> >-   if (index->amcanparallel &&
> >-   !index_only_scan &&
> >+   if ((index->amcanparallel ||
> >+   index_only_scan) &&
>
> Why do we need to check for index_only_scan in the above condition. IIUC,
> index->amcanparallel is necessary for
> parallel index scan to be chosen. We cannot chose parallel index only scan
> if index_only_scan is happening
> without worrying about index->amcanparallel value. So OR-ing
> index->amcanparallel with index_only_scan is probably not
> correct.
>
> True, we do not need this, only removing !index_only_scan should work.
Fixed

>
>
> On Thu, Feb 16, 2017 at 1:06 AM, Robert Haas 
> wrote:
>>
>>
>> This again needs minor rebasing but basically looks fine.  It's a
>> pretty straightforward extension of the parallel index scan work.
>>
>> Please make sure that this is pgindent-clean - i.e. that when you
>> pgindent the files that it touches, pgindent doesn't change anything
>> of the same parts of the file that you've changed in the patch.  Also,
>> I believe Amit may have made some adjustments to the logic in
>> nodeIndexScan.c; if so, it would be good to make sure that the
>> nodeIndexOnlyScan.c changes match what was done there.  In particular,
>> he's got this:
>>
>> if (reset_parallel_scan && node->iss_ScanDesc->parallel_s
>> can)
>> index_parallelrescan(node->iss_ScanDesc);
>>
>> And you've got this:
>>
>> +   if (reset_parallel_scan)
>> +   index_parallelrescan(node->ioss_ScanDesc);
>>
>
Fixed.
Please find the attached patch for rebased and cleaner version.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


parallel_index_only_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-16 Thread Amit Khandekar

On 16 February 2017 at 14:42, Amit Langote
 wrote:
> On 2017/02/16 17:55, Amit Khandekar wrote:
>> On 16 February 2017 at 12:57, Amit Langote wrote:
>>> On 2017/02/16 15:50, Amit Khandekar wrote:
 On 15 February 2017 at 20:26, David Fetter  wrote:
> Does that make sense, and if so, is it super invasive to HINT that?

 Yeah, I think it should be possible to find the root partition with
>>>
>>> I assume you mean root *partitioned* table.
>>>
 the help of pg_partitioned_table,
>>>
>>> The pg_partitioned_table catalog does not store parent-child
>>> relationships, just information about the partition key of a table.  To
>>> get the root partitioned table, you might want to create a recursive
>>> version of get_partition_parent(), maybe called
>>> get_partition_root_parent().  By the way, get_partition_parent() scans
>>> pg_inherits to find the inheritance parent.
>>
>> Yeah. But we also want to make sure that it's a part of declarative
>> partition tree, and not just an inheritance tree ? I am not sure
>> whether it is currently possible to have a mix of these two. May be it
>> is easy to prevent that from happening.
>
> It is not possible to mix declarative partitioning and regular
> inheritance.  So, you cannot have a table in a declarative partitioning
> tree that is not a (sub-) partition of the root table.

Ok, then that makes things easy.

>
 and then run ExecFindPartition()
 again using the root. Will check. I am not sure right now how involved
 that would turn out to be, but I think that logic would not change the
 existing code, so in that sense it is not invasive.
>>>
>>> I couldn't understand why run ExecFindPartition() again on the root
>>> partitioned table, can you clarify?  ISTM, we just want to tell the user
>>> in the HINT that trying the same update query with root partitioned table
>>> might work. I'm not sure if it would work instead to find some
>>> intermediate partitioned table (that is, between the root and the one that
>>> update query was tried with) to include in the HINT.
>>
>> What I had in mind was : Give that hint only if there *was* a
>> subpartition that could accommodate that row. And if found, we can
>> only include the subpartition name.
>
> Asking to try the update query with the root table sounds like a good
> enough hint.  Trying to find the exact sub-partition (I assume you mean to
> imply sub-tree here) seems like an overkill, IMHO.
Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().

Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.

>
> Thanks,
> Amit
>
>



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-02-16 Thread Kuntal Ghosh

On Mon, Feb 6, 2017 at 11:09 PM, Beena Emerson  wrote:
>
> Hello,
> PFA the updated patches.
I've started reviewing the patches.
01-add-XLogSegmentOffset-macro.patch looks clean to me. I'll post my
detailed review after looking into the second patch. But, both the
patches needs a rebase based on the commit 85c11324cabaddcfaf3347df7
(Rename user-facing tools with "xlog" in the name to say "wal").

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-16 Thread Bernd Helmle

Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:
> +1
> And you could try to use pg_wait_sampling
>  to sampling of wait
> events.

Okay, i'm going to try this. Currently Tomas' scripts are still
running, i'll provide updates as soon as they are finished.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication existing data copy

2017-02-16 Thread Erik Rijkers


On 2017-02-16 00:43, Petr Jelinek wrote:

On 13/02/17 14:51, Erik Rijkers wrote:

On 2017-02-11 11:16, Erik Rijkers wrote:

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


This often works but it also fails far too often (in my hands).  I


Could you periodically dump contents of the pg_subscription_rel on
subscriber (ideally when dumping the md5 of the data) and attach that 
as

well?


I attach a bash script (and its output) that polls the 4 pgbench table's 
md5s and the pg_subscription_rel table, each second, while I run the 
pgbench_derail2.sh (for that see my earlier mail).


pgbench_derail2.sh writes a 'header' into the same output stream (search 
for '^===' ).


The .out file reflects a session where I started pgbench_derail2.sh 
twice (it removes the publication and subscription at startup).  So 
there are 2 headers in the attached  cb_20170216_10_04_47.out. The first 
run ended in a succesful replication (=all 4 pgbench tables 
md5-identical).  The second run does not end correctly: it has (one of) 
the typical faulty end-states: pgbench_accounts, the copy, has a few 
less rows than the master table.


Other typical endstates are:
same number of rows but content not identical (for some, typically < 20 
rows).

mostly pgbench_accounts and pgbench_history are affected.

(I see now that I made some mistakes in generating the timestamps in the 
.out file but I suppose it doesn't matter too much)


I hope it helps; let me know if I can do any other test(s).

20170216_10_04_49_1487  6972 a,b,t,h: 10  1 10776   24be8c7be  
cf860f1f2  aed87334f  f2bfaa587   master
20170216_10_04_50_1487  6973 a,b,t,h:  6  1 10776   74cd7528c  
cf860f1f2  aed87334f  f2bfaa587   replica NOK
  now  | srsubid | srrelid | srsubstate | srsublsn 
---+-+-++--
 2017-02-16 10:04:50.242616+01 |   25398 |   25375 | r  | 
 2017-02-16 10:04:50.242616+01 |   25398 |   25378 | r  | 
 2017-02-16 10:04:50.242616+01 |   25398 |   25381 | r  | 
 2017-02-16 10:04:50.242616+01 |   25398 |   25386 | r  | 
(4 rows)

20170216_10_04_51_1487  6972 a,b,t,h: 10  1 10776   24be8c7be  
cf860f1f2  aed87334f  f2bfaa587   master
20170216_10_04_51_1487  6973 a,b,t,h:  6  1 10776   74cd7528c  
cf860f1f2  aed87334f  f2bfaa587   replica NOK
  now  | srsubid | srrelid | srsubstate | srsublsn 
---+-+-++--
 2017-02-16 10:04:51.945931+01 |   25398 |   25375 | r  | 
 2017-02-16 10:04:51.945931+01 |   25398 |   25378 | r  | 
 2017-02-16 10:04:51.945931+01 |   25398 |   25381 | r  | 
 2017-02-16 10:04:51.945931+01 |   25398 |   25386 | r  | 
(4 rows)


--     20170216 10:04:S
-- scale  1 clients  1   INIT_WAIT  0
-- 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin.fast/postgres
20170216_10_04_53_1487  6972 a,b,t,h: 10  1 10776   24be8c7be  
cf860f1f2  aed87334f  f2bfaa587   master
20170216_10_04_53_1487  6973 a,b,t,h:  6  1 10776   74cd7528c  
cf860f1f2  aed87334f  f2bfaa587   replica NOK
  now  | srsubid | srrelid | srsubstate | srsublsn 
---+-+-++--
 2017-02-16 10:04:53.635163+01 |   25398 |   25375 | r  | 
 2017-02-16 10:04:53.635163+01 |   25398 |   25378 | r  | 
 2017-02-16 10:04:53.635163+01 |   25398 |   25381 | r  | 
 2017-02-16 10:04:53.635163+01 |   25398 |   25386 | r  | 
(4 rows)

20170216_10_04_54_1487  6972 a,b,t,h:  0  0  0  0   24be8c7be  
d41d8cd98  d41d8cd98  d41d8cd98   master
20170216_10_04_55_1487  6973 a,b,t,h:  0  0  0  0   d41d8cd98  
d41d8cd98  d41d8cd98  d41d8cd98   replica NOK
 now | srsubid | srrelid | srsubstate | srsublsn 
-+-+-++--
(0 rows)

20170216_10_04_56_1487  6972 a,b,t,h: 10  1 10  0   68d91d95b  
6c4f8b9aa  92162c9b8  d41d8cd98   master
20170216_10_04_56_1487  6973 a,b,t,h:  0  0  0  0   d41d8cd98  
d41d8cd98  d41d8cd98  d41d8cd98   replica NOK
 now | srsubid | srrelid | srsubstate | srsublsn 
-+-+-++--
(0 rows)

20170216_10_04_57_1487  6972 a,b,t,h: 10  1 10  1   68d91d95b  
6c4f8b9aa  92162c9b8  d41d8cd98   master
20170216_10_04_58_1487  6973 a,b,t,h:  0  0  0  0   d41d8c

Re: [HACKERS] UPDATE of partition key

2017-02-16 Thread Amit Langote

On 2017/02/16 17:55, Amit Khandekar wrote:
> On 16 February 2017 at 12:57, Amit Langote wrote:
>> On 2017/02/16 15:50, Amit Khandekar wrote:
>>> On 15 February 2017 at 20:26, David Fetter  wrote:
 Does that make sense, and if so, is it super invasive to HINT that?
>>>
>>> Yeah, I think it should be possible to find the root partition with
>>
>> I assume you mean root *partitioned* table.
>>
>>> the help of pg_partitioned_table,
>>
>> The pg_partitioned_table catalog does not store parent-child
>> relationships, just information about the partition key of a table.  To
>> get the root partitioned table, you might want to create a recursive
>> version of get_partition_parent(), maybe called
>> get_partition_root_parent().  By the way, get_partition_parent() scans
>> pg_inherits to find the inheritance parent.
> 
> Yeah. But we also want to make sure that it's a part of declarative
> partition tree, and not just an inheritance tree ? I am not sure
> whether it is currently possible to have a mix of these two. May be it
> is easy to prevent that from happening.

It is not possible to mix declarative partitioning and regular
inheritance.  So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.

>>> and then run ExecFindPartition()
>>> again using the root. Will check. I am not sure right now how involved
>>> that would turn out to be, but I think that logic would not change the
>>> existing code, so in that sense it is not invasive.
>>
>> I couldn't understand why run ExecFindPartition() again on the root
>> partitioned table, can you clarify?  ISTM, we just want to tell the user
>> in the HINT that trying the same update query with root partitioned table
>> might work. I'm not sure if it would work instead to find some
>> intermediate partitioned table (that is, between the root and the one that
>> update query was tried with) to include in the HINT.
> 
> What I had in mind was : Give that hint only if there *was* a
> subpartition that could accommodate that row. And if found, we can
> only include the subpartition name.

Asking to try the update query with the root table sounds like a good
enough hint.  Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-02-16 Thread Amit Khandekar

On 16 February 2017 at 12:57, Amit Langote
 wrote:
> On 2017/02/16 15:50, Amit Khandekar wrote:
>> On 15 February 2017 at 20:26, David Fetter  wrote:
>>> When an UPDATE can't happen, there are often ways to hint at
>>> what went wrong and how to correct it.  Violating a uniqueness
>>> constraint would be one example.
>>>
>>> When an UPDATE can't happen and the depth of the subtree is a
>>> plausible candidate for what prevents it, there might be a way to say
>>> so.
>>>
>>> Let's imagine a table called log with partitions on "stamp" log_
>>> and subpartitions, also on "stamp", log_MM.  If you do something
>>> like
>>>
>>> UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>>>
>>> it's possible to know that it might have worked had the UPDATE taken
>>> place on log rather than on log_2017.
>>>
>>> Does that make sense, and if so, is it super invasive to HINT that?
>>
>> Yeah, I think it should be possible to find the root partition with
>
> I assume you mean root *partitioned* table.
>
>> the help of pg_partitioned_table,
>
> The pg_partitioned_table catalog does not store parent-child
> relationships, just information about the partition key of a table.  To
> get the root partitioned table, you might want to create a recursive
> version of get_partition_parent(), maybe called
> get_partition_root_parent().  By the way, get_partition_parent() scans
> pg_inherits to find the inheritance parent.

Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.

>
>> and then run ExecFindPartition()
>> again using the root. Will check. I am not sure right now how involved
>> that would turn out to be, but I think that logic would not change the
>> existing code, so in that sense it is not invasive.
>
> I couldn't understand why run ExecFindPartition() again on the root
> partitioned table, can you clarify?  ISTM, we just want to tell the user
> in the HINT that trying the same update query with root partitioned table
> might work. I'm not sure if it would work instead to find some
> intermediate partitioned table (that is, between the root and the one that
> update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-16 Thread Kuntal Ghosh

On Mon, Feb 13, 2017 at 1:01 PM, Masahiko Sawada  wrote:

Thanks for the explanation. I've looked into the referred code. I'm
still in doubt. vacuumed_pages is incremented only when there are no
indexes, i.e. nindexes=0. Now, look at the following part in the
patch.

+   /*
+* Do post-vacuum cleanup and statistics update for each index if
+* the number of vacuumed page exceeds threshold.
+*/
+   cleanupidx_thresh = (float4) nblocks * vacuum_cleanup_index_scale;
+
+   elog(DEBUG3, "%s: vac: %d (threshold %0.f)",
+RelationGetRelationName(onerel), nblocks, cleanupidx_thresh);
+   if (vacuumed_pages >= cleanupidx_thresh)
+   {
+   for (i = 0; i < nindexes; i++)
+   lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
+   }
So, unless vacuum_cleanup_index_scale_thresh is zero,
lazy_cleanup_index will never be called. IMO, this seems to be
incorrect. Besides, I've tested with non-zero(0.5)
vacuum_cleanup_index_scale_thresh and the regression tests for brin
and gin fails. (make installcheck)

+   {"vacuum_cleanup_index_scale_factor", PGC_USERSET,
CLIENT_CONN_STATEMENT,
+gettext_noop("Number of pages containing dead tuple
prior to vacuum as a fraction of relpages."),
+   NULL
+   },
+   _cleanup_index_scale,
+   0.0, 0.0, 100.0,
+   NULL, NULL, NULL
+   },
Maximum value for vacuum_cleanup_index_scale_factor should be 1
instead of 100. As the code indicates, it is certainly not treated as
a percentage fraction of relpages.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-16 Thread Alvaro Herrera

Peter Eisentraut wrote:

> - Naming of collations:  Are we happy with the "de%icu" naming?  I might
> have come up with that while reviewing the IPv6 zone index patch. ;-)
> An alternative might be "de$icu" for more Oracle vibe and avoiding the
> need for double quotes in some cases.  (But we have mixed-case names
> like "de_AT%icu", so further changes would be necessary to fully get rid
> of the need for quoting.)  A more radical alternative would be to
> install ICU locales in a different schema and use the search_path, or
> even have a separate search path setting for collations only.  Which
> leads to ...
> 
> - Selecting default collation provider:  Maybe we want a setting, say in
> initdb, to determine which provider's collations get the "good" names?
> Maybe not necessary for this release, but something to think about.

I'm not sure I like "default collations" to depend on either search_path
or some hypothetical future setting.  That seems to lead to unproductive
discussions on mailing list questions,

User: hey, my sort doesn't work sanely
Pg person: okay, but what collation are you using?  Look for the collate
clause
User: Ah, it's de_DE
Pg person: oh, okay, but is it ICU's de_DE or the libc's?
User: ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 101 matches

Mail list logo