date:20180730

RE: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-30 Thread Tsunakawa, Takayuki

From: 'Andres Freund' [mailto:and...@anarazel.de]
> I'm continuing to work on it, but unfortunately there's a couple
> projects that have higher priority atm :(.  I'm doubtful I can have a
> patchset in a committable shape for v12, but I'm pretty sure I'll have
> it in a shape good enough to make progress towards v13.  Sorry :(

We'd like to do it for PG 12, if Kirk's current proposal doesn't find a good 
landing point.  Could you tell us about how many lines of code you predict, and 
what part would be difficult?  Is there any software you're referring to (e.g. 
Linux's fsync, MySQL, etc.)?

Regards
Takayuki Tsunakawa

RE: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-30 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> It's not clear to me whether it would be worth the overhead of doing
> something like this.

Quite frankly, not really to me, too.

> Making relation drops faster at the cost of
> making buffer cleaning slower could be a loser.

The purpose is not making relation drops faster (on the primary), but keeping 
failover time within 10 seconds.  I don't really know how crucial that 
requirement is, but I'm feeling it would be good for PostgreSQL to be able to 
guarantee shorter failover time.


Regards
Takayuki Tsunakawa

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-07-30 Thread Kyotaro HORIGUCHI

At Mon, 30 Jul 2018 16:59:16 +0900, Michael Paquier  wrote 
in <20180730075916.gb2...@paquier.xyz>
> On Fri, Jul 27, 2018 at 08:27:26AM +, Tsunakawa, Takayuki wrote:
> > I don't have a strong opinion, but I wonder which of namespace.c or
> > autovacuum.c is suitable, because isTempNamespaceInUse is specific to
> > autovacuum.
> 
> I think that there is also a point in allowing other backends to use it
> as well, so I left it in namespace.c.  I have been doing more testing
> with this patch today.  In order to catch code paths where this is

+1 for namespace.c from me.

> triggered I have for example added an infinite loop in do_autovacuum
> when finding a temp table which exits once a given on-disk file is
> found.  This lets plenty of time to attach autovacuum to a debugger, and
> play with other sessions in parallel, so I have checked the
> transactional assumption this patch relied on, and tested down to v10 as
> that's where removal of orphaned temp relations has been made more
> aggressive. This patch can also be applied cleanly there.
> 
> The check on MyDatabaseId does not actually matter much as the same
> temporary namespace OID would get reused only after an OID wraparound
> with a backend using a different database but I left it anyway as that's
> more consistent to me to check for database and namespace, and added a
> sanity check to make sure that the routine gets called only by a process
> connected to a database.
> 
> I am going to be out of town for a couple of days, and the next minor
> release planned is close by, so I am not pushing that today, but I'd
> like to do so after the next minor release if there are no objections.
> Attached is a patch with a proposal of commit message.

"int backendID" is left in autovacuum.c as an unused variable.

+  * decide if a temporary file entry can be marked as orphaned or not. Even
+  * if this is an atomic operation, this can be safely done lock-less as no

"Even if this is *not* an atomic operation" ?

+  * Mark the proc entry as owning this namespace which autovacuum uses to
+  * decide if a temporary file entry can be marked as orphaned or not. Even
+  * if this is an atomic operation, this can be safely done lock-less as no
+  * temporary relations associated to it can be seen yet by autovacuum when
+  * scanning pg_class.
+  */
+ MyProc->tempNamespaceId = namespaceId;

The comment looks wrong. After a crash having temp tables,
pg_class has orphaned temp relations and the namespace can be of
reconnected backends especially for low-numbered backends. So
autovacuum may look the shared variable while the reconnected
backend is writing it. The only harmful misbehavior is false
"unused" judgement. This happens only when reusing the same
namespace ID but the temp namespace is already cleaned up by
InitTempTableNamespace in the case. In the case of false "used"
judgement, the later autovacuum rounds will do the clean up
correctly.

In short, I think it is safely done lock-less but the reason is
not no concurrent reads but transient reads do not harm
(currently..).

| * Mark the proc entry as owning this namespace which autovacuum uses to
| * decide if a temporary file entry can be marked as orphaned or not. Even
| * if this is not an atomic operation, this can be safely done lock-less as
| * a transient read doesn't let autovacuum go wrong.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist

2018-07-30 Thread Andrey Lepikhov


With the consent of Anastasia I will improving this patch further.
Attachment contains next version of the patch set.

11.07.2018 00:03, Heikki Linnakangas пишет:

On 28/02/18 18:03, Anastasia Lubennikova wrote:



Implementation is based on generic_xlog.


Why? I think we should just add a log_relation() function in 
xloginsert.c directly, alongside log_newpage_buffer().



I have some arguments to stay this functionality at generic_xlog module:
1. xloginsert.c functions work on low level of abstraction, use buffers 
and pages.
2. Code size using generic_xlog service functions looks more compact and 
safe.
This makes the assumption that all the pages in these indexes used the 
standard page layout. I think that's a valid assumption, but needs at 
least a comment on that. And perhaps an Assert, to check that 
pd_lower/upper look sane.

Done


As a further optimization, would it be a win to WAL-log multiple pages 
in each record?
In this version of the patch we use simple optimization: pack 
XLR_NORMAL_MAX_BLOCK_ID blocks pieces into each WAL-record.


This leaves the XLOG_*_CREATE_INDEX WAL record types unused, BTW.


Done

- Heikki



Benchmarks:
---

Test: pgbench -f gin-WAL-test.sql -t 5:
---
master:
Latency average: 27696.299 ms
WAL size: 2.66 GB

patched
Latency average: 22812.103 ms
WAL size: 1.23 GB


Test: pgbench -f gist-WAL-test.sql -t 5:

master:
Latency average: 19928.284 ms
WAL size: 1.25 GB

patched
Latency average: 18175.064 ms
WAL size: 0.63 GB


Test: pgbench -f spgist-WAL-test.sql -t 5:
--
master:
Latency average: 11529.384 ms
WAL size: 1.07 GB

patched
Latency average: 9330.828 ms
WAL size: 0.6 GB

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From db400ce9532536da36812dbf0456e756a0ea4724 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 31 Jul 2018 07:22:17 +0500
Subject: [PATCH 1/4] Relation-into-WAL-function

---
 src/backend/access/transam/generic_xlog.c | 62 +++
 src/include/access/generic_xlog.h |  3 ++
 2 files changed, 65 insertions(+)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index ce023548ae..8397b58ee7 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -80,6 +80,7 @@ static void computeRegionDelta(PageData *pageData,
 static void computeDelta(PageData *pageData, Page curpage, Page targetpage);
 static void applyPageRedo(Page page, const char *delta, Size deltaSize);
 
+static void standard_page_layout_check(Buffer buf);
 
 /*
  * Write next fragment into pageData's delta.
@@ -545,3 +546,64 @@ generic_mask(char *page, BlockNumber blkno)
 
 	mask_unused_space(page);
 }
+
+/*
+ * Check page layout.
+ * Caller must lock the buffer
+ */
+static void
+standard_page_layout_check(Buffer buf)
+{
+	PageHeader ph = (PageHeader) BufferGetPage(buf);
+
+	Assert((ph->pd_lower >= SizeOfPageHeaderData) &&
+		   (ph->pd_lower <= ph->pd_upper) &&
+		   (ph->pd_upper <= ph->pd_special) &&
+		   (ph->pd_special <= BLCKSZ) &&
+		   (ph->pd_special == MAXALIGN(ph->pd_special)));
+}
+
+/*
+ * Function to write generic xlog for every existing block of a relation.
+ * Caller is responsible for locking the relation exclusively.
+ */
+void
+generic_log_relation(Relation rel)
+{
+	BlockNumber blkno;
+	BlockNumber nblocks;
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	elog(DEBUG2, "generic_log_relation '%s', nblocks %u BEGIN.",
+		 RelationGetRelationName(rel), nblocks);
+
+	for (blkno = 0; blkno < nblocks; )
+	{
+		GenericXLogState	*state;
+		Bufferbuffer[MAX_GENERIC_XLOG_PAGES];
+		int	counter,
+			blocks_pack;
+
+		CHECK_FOR_INTERRUPTS();
+
+		blocks_pack = ((nblocks-blkno) < MAX_GENERIC_XLOG_PAGES) ?
+		(nblocks-blkno) : MAX_GENERIC_XLOG_PAGES;
+
+		state = GenericXLogStart(rel);
+
+		for (counter = 0 ; counter < blocks_pack; counter++)
+		{
+			buffer[counter] = ReadBuffer(rel, blkno++);
+			standard_page_layout_check(buffer[counter]);
+			LockBuffer(buffer[counter], BUFFER_LOCK_EXCLUSIVE);
+			GenericXLogRegisterBuffer(state, buffer[counter], GENERIC_XLOG_FULL_IMAGE);
+		}
+
+		GenericXLogFinish(state);
+
+		for (counter = 0 ; counter < blocks_pack; counter++)
+			UnlockReleaseBuffer(buffer[counter]);
+	}
+	elog(DEBUG2, "generic_log_relation '%s' END.", RelationGetRelationName(rel));
+}
diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h
index b23e1f684b..1f4b3b7030 100644
--- a/src/include/access/generic_xlog.h
+++ b/src/include/access/generic_xlog.h
@@ -42,4 +42,7 @@ extern const char *generic_identify(uint8 info);
 extern void generic_desc(StringInfo buf, XLogReaderState *record);
 extern void generic_mask(char *pagedata, BlockNumber blkno);
 
+/* other utils */
+extern void generic_log_relation(Relation rel);
+
 #endif

Re: Explain buffers wrong counter with parallel plans

2018-07-30 Thread Amit Kapila

On Tue, Jul 31, 2018 at 12:58 AM, Andres Freund  wrote:
> Hi,
>
> I'm not an expert in the area of the code, but here's a review anyway. I
> did not read through the entire thread.
>
>
> I think we should try to get this fixed soon, to make some progress
> towards release-ability. Or just declare it to be entirely unrelated to
> the release, and remove it from the open items list; not an unreasonable
> choice either. This has been an open item for quite a while...
>

I am okay either way, but I can work on getting it committed for this release.

>
>> diff --git a/src/backend/executor/execParallel.c 
>> b/src/backend/executor/execParallel.c
>> index 60aaa822b7e..ac22bedf0e2 100644
>> --- a/src/backend/executor/execParallel.c
>> +++ b/src/backend/executor/execParallel.c
>> @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>>   /* Report workers' query for monitoring purposes */
>>   pgstat_report_activity(STATE_RUNNING, debug_query_string);
>>
>> - /* Prepare to track buffer usage during query execution. */
>> - InstrStartParallelQuery();
>> -
>>   /* Attach to the dynamic shared memory area. */
>>   area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false);
>>   area = dsa_attach_in_place(area_space, seg);
>> @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>>   queryDesc->planstate->state->es_query_dsa = area;
>>   ExecParallelInitializeWorker(queryDesc->planstate, toc);
>>
>> + /* Prepare to track buffer usage during query execution. */
>> + InstrStartParallelQuery();
>> +
>>   /* Run the plan */
>>   ExecutorRun(queryDesc, ForwardScanDirection, 0L, true);
>
> Isn't this an independent change?  And one with potentially negative
> side effects? I think there's some arguments for changing this (and some
> against),
>

This is to ensure that buffer usage stats are tracked consistently in
master and worker backends.   In the master backend, we don't track
the similar stats for ExecutorStart phase.  Also, it would help us
give the same stats for buffer usage in parallel and non-parallel
version of the query.

> but I think it's a bad idea to do so in the same patch.
>

Agreed, we can do this part separately.

>
>> diff --git a/src/backend/executor/execProcnode.c 
>> b/src/backend/executor/execProcnode.c
>> index 36d2914249c..a0d49ec0fba 100644
>> --- a/src/backend/executor/execProcnode.c
>> +++ b/src/backend/executor/execProcnode.c
>> @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node)
>>
>>   planstate_tree_walker(node, ExecShutdownNode, NULL);
>>
>> + /*
>> +  * Allow instrumentation to count stats collected during shutdown for
>> +  * nodes that are executed atleast once.
>> +  */
>> + if (node->instrument && node->instrument->running)
>> + InstrStartNode(node->instrument);
>> +
>>   switch (nodeTag(node))
>>   {
>>   case T_GatherState:
>> @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node)
>>   break;
>>   }
>>
>> + /*
>> +  * Allow instrumentation to count stats collected during shutdown for
>> +  * nodes that are executed atleast once.
>> +  */
>> + if (node->instrument && node->instrument->running)
>> + InstrStopNode(node->instrument, 0);
>> +
>>   return false;
>>  }
>
> Duplicating the comment doesn't seem like a good idea. Just say
> something like "/* see comment for InstrStartNode above */".
>

Okay, will change in the next version.

>
>> diff --git a/src/backend/executor/nodeLimit.c 
>> b/src/backend/executor/nodeLimit.c
>> index ac5a2ff0e60..cf1851e235f 100644
>> --- a/src/backend/executor/nodeLimit.c
>> +++ b/src/backend/executor/nodeLimit.c
>> @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate)
>>   node->position - node->offset >= 
>> node->count)
>>   {
>>   node->lstate = LIMIT_WINDOWEND;
>> + /* Allow nodes to release or shut down 
>> resources. */
>> + (void) ExecShutdownNode(outerPlan);
>>   return NULL;
>>   }
>>
>
> Um, is this actually correct?  What if somebody rewinds afterwards?
> That won't happen for parallel query currently, but architecturally I
> don't think we can do so unconditionally?
>

Currently,  this is done for the forward scans (basically under
ScanDirectionIsForward), is that enough, if not, what kind of
additional check do you think we need here?   I am not sure at this
stage how we will design it for backward scans, but I think we already
do this at other places in the code, see below code in ExecutePlan.

ExecutePlan()
{
..
if (numberTuples && numberTuples == current_tuple_count)
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}

-- 
With Regards,
Amit Kapila.
Enterprise

pg_default_acl missing 'n' case in doc

2018-07-30 Thread Fabien COELHO



While looking at ACL prettyprinting, I noticed that "pg_default_acl" 
documentation does not say anything about type 'n' for schema (namespace), 
which seems to be supported according to "\ddp" and the catalog code.


Here is a small addition to add that 'n' is allowed for schema.

--
Fabien.diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index fffb79f713..3bb48d4ccf 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2850,7 +2850,8 @@ SCRAM-SHA-256$:&l
r = relation (table, view),
S = sequence,
f = function,
-   T = type
+   T = type,
+   n = schema

Re: partition tree inspection functions

2018-07-30 Thread Amit Langote

Hi Jesper,

On 2018/07/30 22:21, Jesper Pedersen wrote:
> On 07/30/2018 05:21 AM, Amit Langote wrote:
>>> As 0 is a valid return value for root nodes I think we should use -1
>>> instead for these cases.
>>
>> Makes sense, changed to be that way.
>>
> 
> Looks good, the documentation for pg_partition_level could be expanded to
> describe the -1 scenario.

Done.

> New status: Ready for Committer

Thanks.  Updated patch attached.

Regards,
Amit
>From 71ab5dbcbc1985a3b6450c2bce41639e99f232da Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 16 Jan 2018 19:02:13 +0900
Subject: [PATCH v8] Add assorted partition reporting functions

---
 doc/src/sgml/func.sgml   |  89 ++
 src/backend/catalog/partition.c  | 244 ++-
 src/backend/utils/cache/lsyscache.c  |  22 +++
 src/include/catalog/pg_proc.dat  |  48 ++
 src/include/utils/lsyscache.h|   1 +
 src/test/regress/expected/partition_info.out | 204 ++
 src/test/regress/parallel_schedule   |   2 +-
 src/test/regress/serial_schedule |   1 +
 src/test/regress/sql/partition_info.sql  |  91 ++
 9 files changed, 697 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regress/expected/partition_info.out
 create mode 100644 src/test/regress/sql/partition_info.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..326f4dbe58 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19995,6 +19995,95 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
 The function returns the number of new collation objects it created.

 
+   
+Partitioning Information Functions
+
+ 
+  Name Return Type 
Description
+ 
+
+  
+  
+   
pg_partition_parent(regclass)
+   regclass
+   get parent if table is a partition, NULL 
otherwise
+  
+  
+   
pg_partition_root_parent(regclass)
+   regclass
+   get topmost parent of a partition within partition tree
+  
+  
+   pg_partition_level(regclass, 
regclass)
+   regclass
+   get level of a partition within partition tree with respect to 
given parent
+  
+  
+   
pg_partition_level(regclass)
+   regclass
+   get level of a partition within partition tree with respect to 
topmost root parent
+  
+  
+   pg_partition_children(regclass, 
bool)
+   setof regclass
+   
+get partitions of a table; only immediate partitions are returned,
+unless all tables in the partition tree, including itself and
+partitions of lower levels, are requested by passing
+true for second argument
+   
+  
+  
+   
pg_partition_children(regclass)
+   setof regclass
+   Shorthand for pg_partition_children(..., 
false)
+  
+  
+   
pg_partition_leaf_children(regclass)
+   setof regclass
+   get all leaf partitions of a given table
+  
+ 
+
+   
+
+   
+If the table passed to pg_partition_root_parent is not
+a partition, it is returned as the its own root parent.  If the table
+passed for the second argument of pg_partition_level
+is not same as the first argument or not an ancestor of the table in the
+first argument, -1 is returned as the result.
+   
+
+   
+To check the total size of the data contained in
+measurement table described in
+, use the following
+query:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(p))) as total_size from 
pg_partition_children('measurement', true) p;
+ total_size 
+
+ 24 kB
+(1 row)
+
+
+   
+One could have used pg_partition_leaf_children in
+this case and got the same result as shown below:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(p))) as total_size from 
pg_partition_leaf_children('measurement') p;
+ total_size 
+
+ 24 kB
+(1 row)
+
+
+
   
 
   
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 558022647c..d90aaf8bb8 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -23,13 +23,16 @@
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_partitioned_table.h"
+#include "funcapi.h"
 #include "nodes/makefuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -38,16 +41,14 @@
 static Oid get_partition_parent_worker(Relation inhRel, Oid relid);
 static void get_partition_ancestors_worker(Relation inhRel, Oid relid,
   List **ancestors);
+static Oid get_partition_root_parent(Oid relid);
+st

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-30 Thread Etsuro Fujita


(2018/07/31 4:06), Andres Freund wrote:

On 2018-07-20 08:38:09 -0400, Robert Haas wrote:

I'm going to study this some more now, but I really think this is
going in the wrong direction.


We're going to have to get somewhere on this topic soon. This thread has
been open for nearly half a year, and we're late in the beta phase now.
What can we do to get to an agreement soon?


I'd like to propose an updated patch as proposed in [1].

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5B59BA55.30200%40lab.ntt.co.jp

Re: documentation about explicit locking

2018-07-30 Thread Amit Langote

On 2018/07/31 5:27, Peter Eisentraut wrote:
> On 19/07/2018 06:17, Amit Langote wrote:
>> When playing with this, I observed that a less user-friendly error message
>> is emitted if multiple sessions race to create the same collation.
>>
>> Session 1:
>> begin;
>> create collation collname (...);
>>
>> Session 2:
>> create collation collname (...);
>> 
>>
>> Session 1:
>> commit;
>>
>> Session 2:
>> ERROR:  duplicate key value violates unique constraint
>> "pg_collation_name_enc_nsp_index"
>> DETAIL:  Key (collname, collencoding, collnamespace)=(collname, 6, 2200)
>> already exists.
> 
> You get the same behavior with for example CREATE FUNCTION or CREATE
> TYPE.  I don't think we need to fix this specifically for CREATE COLLATION.

Hmm, yeah.  Although fixing the race for CREATE COLLATION seems easier
than other cases due to the self-exclusive lock on the catalog, that
doesn't necessarily mean we have to.

Thanks,
Amit

Re: documentation about explicit locking

2018-07-30 Thread Amit Langote

On 2018/07/31 5:25, Peter Eisentraut wrote:
> On 20/07/2018 02:30, Amit Langote wrote:
>> We don't explicitly mention what locks we take on system catalogs
>> elsewhere, but CREATE COLLATION is different from other commands, so I'm
>> fine with adding more details as you suggest, so updated the text.
> 
> committed the documentation change

Thank you.

Regards,
Amit

Re: [HACKERS] Parallel Append implementation

2018-07-30 Thread Thomas Munro

On Tue, Jul 31, 2018 at 5:05 AM, Robert Haas  wrote:
> New version attached.

Looks good to me.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Making "COPY partitioned_table FROM" faster

2018-07-30 Thread David Rowley

On 31 July 2018 at 06:21, Peter Eisentraut
 wrote:
> On 30/07/2018 15:26, David Rowley wrote:
>>> - Add some tests.  The if (nBufferedTuples > 0) that flushes the tuples
>>> when the partition changes is not currently exercised.
>>
>> That seems like a good idea. In fact, it uncovered a bug around
>> ConvertPartitionTupleSlot() freeing the previously stored tuple out
>> the slot which resulted in a crash. I didn't notice before because my
>> test had previously not required any tuple conversions.
>
> I think we need to think of a better place to put that temporary file,
> and clean it up properly afterwards.  I'm not sure whether we have
> existing uses like that.

Ideally, I could have written the test in copy2.sql, but since I had
to insert over 1000 rows to trigger the multi-insert code, copy from
stdin was not practical in the .sql file. Instead, I just followed the
lead in copy.source and used the same temp file style as the previous
test. It also leaves that file laying around, it just happens to be
smaller.  I've updated the test to truncate the table again and
perform another COPY TO to empty the file.  I didn't see any way to
remove the file due to lack of standard way of removing files in the
OS. \! rm ... is not going to work on windows, for example.

I also failed to realise how the .source files work previously. I see
there are input and output directories. I'd missed updating the output
one in the v5 delta patch.

> Also, maybe the test should check afterwards that the right count of
> rows ended up in each partition?

Agreed. I've added that.

The attached v6 delta replaces the v5 delta and should be applied on
top of the full v4 patch. I'm using deltas in the hope they're easier
to review the few changes than reviewing the entire patch again.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v6_multiinsert_copy_delta.patch
Description: Binary data

Re: make installcheck-world in a clean environment

2018-07-30 Thread Tom Lane

I wrote:
> The original complaint about ecpg remains; I'm not terribly excited
> about messing with that.

Well, I got curious as to why we were seeing such a weird error,
and eventually traced it to this stuff in ecpg/test/Makefile.regress:

# Standard way to invoke the ecpg preprocessor
ECPG = ../../preproc/ecpg --regression -I$(srcdir)/../../include -I$(srcdir)

# Files that most or all ecpg preprocessor test outputs depend on
ECPG_TEST_DEPENDENCIES = ../../preproc/ecpg$(X) \
$(srcdir)/../regression.h \
$(srcdir)/../../include/sqlca.h \
$(srcdir)/../../include/sqlda.h \
$(srcdir)/../../include/sqltypes.h \
$(srcdir)/../../include/sql3types.h

%.c: %.pgc $(ECPG_TEST_DEPENDENCIES)
$(ECPG) -o $@ $<

Now, the fine gmake manual quoth:

`%' in a prerequisite of a pattern rule stands for the same stem
that was matched by the `%' in the target.  In order for the pattern
rule to apply, its target pattern must match the file name under
consideration and all of its prerequisites (after pattern substitution)
must name files that exist or can be made.

So the problem is that (after make clean) ../../preproc/ecpg doesn't
exist, and the Makefiles under test/ have no idea how to build it,
and thus this pattern rule is inapplicable.  Thus you end up getting
"No rule to make target" errors.

While it seems straightforward to fix this for "make check" scenarios ---
just go make ecpg before trying to make the tests --- I think these rules
are very surprising for "make installcheck" cases.  You'd expect "make
installcheck" to test the installed ecpg, as indeed Alexander pointed out
at the start of the thread.  But it's not doing that.

Alexander's USE_INSTALLED_ASSETS patch attempted to fix that, and I now
see the point of it, but it still seems pretty hacky and invasive (why
should an ecpg-only problem be touching stuff outside ecpg)?  Also,
unless I'm still missing something, it doesn't fix the problem with
"make clean check": ecpg won't have been built before we try to build the
test programs.

I'd suggest trying to simplify the USE_INSTALLED_ASSETS patch so it
doesn't muck with the rules for building pg_regress.  I don't find
that to be very relevant to the problem.

regards, tom lane

Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-30 Thread Thomas Munro

On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas  wrote:
> On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk  
> wrote:
>> 1. Because the multiple scans of the whole shared buffer per concurrent 
>> truncate/drop table was the cause of the time-consuming behavior, DURING the 
>> failover process while WAL is being applied, we temporary delay the scanning 
>> and invalidating of shared buffers. At the same time, we remember the 
>> relations/relfilenodes (of dropped/truncated tables) by adding them in a 
>> hash table called "skip list".
>> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared 
>> buffer only ONCE, compare the pages against the skip list, and invalidates 
>> the relevant pages. After deleting the relevant pages on the shared memory, 
>> it will not be written back to the disk.
>>
>> Assuming the theory works, this design will only affect the behavior of 
>> checkpointer (or maybe bg writer) during recovery process / failover. Any 
>> feedback, thoughts?
>
> How would this work if a relfilenode number that belonged to an old
> relation got recycled for a new relation?
>
> I think something like this could be made to work -- both on the
> master and the standby, and not just while waiting for a failover --
> if we did something like this:
>
> (1) Limit the number of deferred drops to a reasonably small number
> (one cache line?  1kB?).
> (2) Background writer regularly does scans do clean out everything
> from the deferred-drop list.
> (3) If a foreground process finds the deferred-drop list is full when
> it needs to add to it, it forces a clean-out of the list contents +
> whatever new stuff it has in a single pass.
> (4) If we are about generate a relfilenode that's in the list, we
> either force a clean out or generate a different relfilenode instead.
> (5) Every buffer cleaning operation checks the deferred-drop list and
> invalidates without write-out if the buffer is found.
>
> It's not clear to me whether it would be worth the overhead of doing
> something like this.  Making relation drops faster at the cost of
> making buffer cleaning slower could be a loser.

Perhaps you could make it a bit more incremental, avoiding the full
clean-out scans (your points 2 and 3) while still allowing the
background writer to bear the brunt in the best case?  Something like
this:

The zombie relfilenode tracker could be fixed-size and self-cleaning:
entries could be dropped when the CLOCK hand completes a full rotation
since the entry was created, and if it's full when you're trying to
insert a new entry you can scan the buffer pool until you've caused
one entry (the oldest, by clock hand-at-time-of-insertion) to be
remove to make space (so the worst case is a full pass, but hopefully
we can drop one entry before then).  The main argument I can think of
against that idea is that it is tied to the current buffer eviction
strategy, using its buffer index-based clock hand as a horizon.
Extracting a useful horizon for algorithms like ARC or CAR would
probably need more thought, because their 'hands' jump around
following next pointers.

Anyway, it's a lot of complexity, and it falls back to a worst cases
like today, and can also transfer work to innocent foreground
processes.  I see why Andres says we should just get a better data
structure so we can make the guy doing the dropping pay for it up
front, but more efficiently.  I suspect we may also want an ordered
data structure for write clustering purposes one day.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: FailedAssertion on partprune

2018-07-30 Thread Alvaro Herrera

On 2018-Jul-25, David Rowley wrote:

> Thinking again about the patch I submitted upthread; I wonder if it's
> actually possible to support pruning with Jamie's query. Without
> looking at the code, I don't quite see the reason that the
> sub-partitioned table wouldn't be correctly pruned by the run-time
> pruning code.  It could just be a matter of removing the failing
> Assert(). I'll do a bit more testing and confirm.

Not looking at the code right now either, but removing that assert and
then removing the TABLESAMPLE clause, the query returns identical
results with and without pruning, so maybe you're right.  No time for
further looking now.

(SELECT 'Jaime' <> 'Jamie' COLLATE es_EC)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoid extra Sort nodes between WindowAggs when sorting can be reused

2018-07-30 Thread Daniel Gustafsson

> On 27 Jul 2018, at 21:12, Alexander Kuzmenkov  
> wrote:

> Thanks for the update.

Thank you for reviewing and hacking!

> On 07/25/2018 01:37 AM, Daniel Gustafsson wrote:
>> 
>>> Hmm, this is missing the eqop fields of SortGroupClause. I haven't
>>> tested yet but does the similar degradation happen if two
>>> SortGroupCaluses have different eqop and the same other values?
>> I wasn’t able to construct a case showing this, but I also think you’re 
>> right.
>> Do you have an idea of a query that can trigger a regression?  The attached
>> patch adds a stab at this, but I’m not sure if it’s the right approach.
> 
> To trigger that, in your test example you could order by empno::int8 for one 
> window, and by empno::int2 for another. But don't I think you have to compare 
> eqop here, because if eqop differs, sortop will differ too. I removed the 
> comparison from the patch.

Right, that makes sense.

> I also clarified (I hope) the comments, and did the optimization I mentioned 
> earlier: using array instead of list for active clauses. Please see the 
> attached v6.

Thanks, looks good.

> Otherwise I think the patch is good.

Cool, thanks for reviewing!

cheers ./daniel

Re: negative bitmapset member not allowed Error with partition pruning

2018-07-30 Thread Alvaro Herrera

On 2018-Jul-27, David Rowley wrote:

> On 27 July 2018 at 15:14, Tom Lane  wrote:

> > Well, my thinking is that it helps nobody if call sites have to have
> > explicit workarounds for a totally-arbitrary refusal to handle edge
> > cases in the primitive functions.  I do not think that is good software
> > design.  If you want to have assertions that particular call sites are
> > specifying nonempty ranges, put those in the call sites where it's
> > important.  But as-is, this seems like, say, defining foreach() to
> > blow up on an empty list.
> 
> Okay, that's a fair point. I agree,  adding Asserts at the current
> call sites seems better.

Given the discussion, I pushed two commits: first, bms_add_range returns
the input bms if the range is empty, also adding Rajkumar's test case,
which I also verified to reproduce the bug, and passes (for me) with the
bms_add_range change.

The second commit includes the proposed asserts, but not the change to
avoid calling bms_add_range when the range is empty.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: make installcheck-world in a clean environment

2018-07-30 Thread Tom Lane

Alexander Lakhin  writes:
> 14.07.2018 13:57, Peter Eisentraut wrote:
>> On 06.07.18 09:45, Alexander Lakhin wrote:
>>> ./configure --enable-tap-tests
>>> make install
>>> make install -C contrib
>>> chown -R postgres:postgres /usr/local/pgsql/
>>> /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
>>> /usr/local/pgsql/bin/pg_ctl start -l logfile -D /usr/local/pgsql/data
>>> /make clean/
>>> # Also you can just install binary packages to get the same state.
>>> 
>>> make installcheck-world
>>> # This check fails.

I remain pretty skeptical that this is a sensible way to proceed,
especially not if what you're testing is installed binary packages.
You're creating yet *another* hazard for version-skew-like problems,
namely that there's no certainty that you gave configure arguments
that're compatible with what the installed packages used.

But having said that ...

>> For me, this fails at
>> In file included from ../../src/include/postgres.h:47,
>> from chklocale.c:17:
>> ../../src/include/utils/elog.h:71:10: fatal error: utils/errcodes.h: No
>> such file or directory
>> That's probably fixable.  But it's different from what you had initially
>> reported.

> This error wasn't present in master at the time of initial report (in
> April). As "git bisect" shows such errors appeared after 372728b.
> So in REL_10_STABLE (or in master before 372728b) you could "make
> installcheck" (but not installcheck-world) in a clean environment.

... I think this was actually induced by 3b8f6e75f, for which we already
had to make some adjustments to ensure that the generated headers got
built when needed.  This seems like another such case, so I stuck in a
couple more submake-generated-headers dependencies to fix it.

The original complaint about ecpg remains; I'm not terribly excited
about messing with that.

regards, tom lane

RE: GSOC 2018 Project - A New Sorting Routine

2018-07-30 Thread Kefan Yang

Hey Tomas! 

Sorry to bother but it would be great if we can get the test results this week.

Regards,
Kefan

From: Tomas Vondra
Sent: July 24, 2018 8:16 AM
To: Kefan Yang
Cc: Andrey Borodin; Peter Geoghegan; alvhe...@2ndquadrant.com; PostgreSQL 
Hackers
Subject: Re: GSOC 2018 Project - A New Sorting Routine



On 07/24/2018 12:21 AM, Kefan Yang wrote:
> Hi Tomas!
> 
> I did a few tests on my own Linux machine, but the problem is that my 
> resources on AWS(CPU, RAM and even Disk space) are very limited. I 
> considered establishing virtual machine on my own PC but the performance 
> is even worse.
> 
> My original patch has two main optimizations: (1) switch to heap sort 
> when depth limit exceeded (2) check whether the array is presorted only 
> once at the beginning. Now I want to test these optimizations 
> separately. On AWS EC2 instance, regressions on CREATE INDEX cases seems 
> to be less significant if we use (1) only, but I can only test up to 
> 10 records and 512MB memory using your scripts.
> 
> So would you mind re-running the tests using the two patches I provided 
> in the attachment? That will be very helpful
> 

I can do that, but it'll have to wait a couple of days. I'm currently 
using the boxes for some other tests.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Segfault logical replication PG 10.4

2018-07-30 Thread Mai Peng

Thank you Alvaro :)

> Le 30 juil. 2018 à 22:33, Alvaro Herrera  a écrit :
> 
> On 2018-Jul-28, Alvaro Herrera wrote:
> 
>> Aha, I see, thanks.  Here's a complete fix with included testcase.  In
>> an unpatched assert-enabled build, this crashes this
>> 
>> TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: 
>> "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line: 788)
>> 
>> Will push on Monday.
> 
> Pushed after changing the constraint in the test case to be less silly.
> Thanks for the report and diagnosis.
> 
> -- 
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Making "COPY partitioned_table FROM" faster

2018-07-30 Thread Melanie Plageman

On Mon, Jul 30, 2018 at 11:21 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 30/07/2018 15:26, David Rowley wrote:
> >> - Add some tests.  The if (nBufferedTuples > 0) that flushes the tuples
> >> when the partition changes is not currently exercised.
> >
> > That seems like a good idea. In fact, it uncovered a bug around
> > ConvertPartitionTupleSlot() freeing the previously stored tuple out
> > the slot which resulted in a crash. I didn't notice before because my
> > test had previously not required any tuple conversions.
>
> I think we need to think of a better place to put that temporary file,
> and clean it up properly afterwards.  I'm not sure whether we have
> existing uses like that.
>
> Also, maybe the test should check afterwards that the right count of
> rows ended up in each partition?
>
> Yea, I actually would suggest changing the data inserted in the third
insert statement to have 'Three' in the third column:
insert into parted_copytest select x,1,'One' from
generate_series(1011,1020) x;
And then this check:
select count(*) from parted_copytest group by a, b, c;

Re: Segfault logical replication PG 10.4

2018-07-30 Thread Alvaro Herrera

On 2018-Jul-28, Alvaro Herrera wrote:

> Aha, I see, thanks.  Here's a complete fix with included testcase.  In
> an unpatched assert-enabled build, this crashes this
> 
> TRAP: FailedAssertion("!(ActiveSnapshotSet())", File: 
> "/pgsql/source/REL_10_STABLE/src/backend/tcop/postgres.c", Line: 788)
> 
> Will push on Monday.

Pushed after changing the constraint in the test case to be less silly.
Thanks for the report and diagnosis.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: documentation about explicit locking

2018-07-30 Thread Peter Eisentraut

On 19/07/2018 06:17, Amit Langote wrote:
> When playing with this, I observed that a less user-friendly error message
> is emitted if multiple sessions race to create the same collation.
> 
> Session 1:
> begin;
> create collation collname (...);
> 
> Session 2:
> create collation collname (...);
> 
> 
> Session 1:
> commit;
> 
> Session 2:
> ERROR:  duplicate key value violates unique constraint
> "pg_collation_name_enc_nsp_index"
> DETAIL:  Key (collname, collencoding, collnamespace)=(collname, 6, 2200)
> already exists.

You get the same behavior with for example CREATE FUNCTION or CREATE
TYPE.  I don't think we need to fix this specifically for CREATE COLLATION.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: documentation about explicit locking

2018-07-30 Thread Peter Eisentraut

On 20/07/2018 02:30, Amit Langote wrote:
> We don't explicitly mention what locks we take on system catalogs
> elsewhere, but CREATE COLLATION is different from other commands, so I'm
> fine with adding more details as you suggest, so updated the text.

committed the documentation change

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-30 Thread 'Andres Freund'

Hi,

On 2018-07-30 05:22:48 +, Jamison, Kirk wrote:
> BTW, are there any updates whether the community will push through
> anytime soon regarding the buffer mapping implementation you
> mentioned?

I'm continuing to work on it, but unfortunately there's a couple
projects that have higher priority atm :(.  I'm doubtful I can have a
patchset in a committable shape for v12, but I'm pretty sure I'll have
it in a shape good enough to make progress towards v13.  Sorry :(

Greetings,

Andres Freund

Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-30 Thread Andres Freund

Hi,

On 2018-07-30 16:01:53 -0400, Robert Haas wrote:
> (1) Limit the number of deferred drops to a reasonably small number
> (one cache line?  1kB?).

Yea, you'd have to, because we'd frequently need to check it, and it'd
need to be in shared memory. But that'd still leave us to regress to
O(n^2) as soon as a transaction goes over that limit - which I don't
think is that infrequent...

Greetings,

Andres Freund

Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-30 Thread Robert Haas

On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk  wrote:
> 1. Because the multiple scans of the whole shared buffer per concurrent 
> truncate/drop table was the cause of the time-consuming behavior, DURING the 
> failover process while WAL is being applied, we temporary delay the scanning 
> and invalidating of shared buffers. At the same time, we remember the 
> relations/relfilenodes (of dropped/truncated tables) by adding them in a hash 
> table called "skip list".
> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared buffer 
> only ONCE, compare the pages against the skip list, and invalidates the 
> relevant pages. After deleting the relevant pages on the shared memory, it 
> will not be written back to the disk.
>
> Assuming the theory works, this design will only affect the behavior of 
> checkpointer (or maybe bg writer) during recovery process / failover. Any 
> feedback, thoughts?

How would this work if a relfilenode number that belonged to an old
relation got recycled for a new relation?

I think something like this could be made to work -- both on the
master and the standby, and not just while waiting for a failover --
if we did something like this:

(1) Limit the number of deferred drops to a reasonably small number
(one cache line?  1kB?).
(2) Background writer regularly does scans do clean out everything
from the deferred-drop list.
(3) If a foreground process finds the deferred-drop list is full when
it needs to add to it, it forces a clean-out of the list contents +
whatever new stuff it has in a single pass.
(4) If we are about generate a relfilenode that's in the list, we
either force a clean out or generate a different relfilenode instead.
(5) Every buffer cleaning operation checks the deferred-drop list and
invalidates without write-out if the buffer is found.

It's not clear to me whether it would be worth the overhead of doing
something like this.  Making relation drops faster at the cost of
making buffer cleaning slower could be a loser.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Explain buffers wrong counter with parallel plans

2018-07-30 Thread Andres Freund

Hi,

I'm not an expert in the area of the code, but here's a review anyway. I
did not read through the entire thread.


I think we should try to get this fixed soon, to make some progress
towards release-ability. Or just declare it to be entirely unrelated to
the release, and remove it from the open items list; not an unreasonable
choice either. This has been an open item for quite a while...


> diff --git a/src/backend/executor/execParallel.c 
> b/src/backend/executor/execParallel.c
> index 60aaa822b7e..ac22bedf0e2 100644
> --- a/src/backend/executor/execParallel.c
> +++ b/src/backend/executor/execParallel.c
> @@ -979,9 +979,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>   /* Report workers' query for monitoring purposes */
>   pgstat_report_activity(STATE_RUNNING, debug_query_string);
>  
> - /* Prepare to track buffer usage during query execution. */
> - InstrStartParallelQuery();
> -
>   /* Attach to the dynamic shared memory area. */
>   area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false);
>   area = dsa_attach_in_place(area_space, seg);
> @@ -993,6 +990,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
>   queryDesc->planstate->state->es_query_dsa = area;
>   ExecParallelInitializeWorker(queryDesc->planstate, toc);
>  
> + /* Prepare to track buffer usage during query execution. */
> + InstrStartParallelQuery();
> +
>   /* Run the plan */
>   ExecutorRun(queryDesc, ForwardScanDirection, 0L, true);

Isn't this an independent change?  And one with potentially negative
side effects? I think there's some arguments for changing this (and some
against), but I think it's a bad idea to do so in the same patch.


> diff --git a/src/backend/executor/execProcnode.c 
> b/src/backend/executor/execProcnode.c
> index 36d2914249c..a0d49ec0fba 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -737,6 +737,13 @@ ExecShutdownNode(PlanState *node)
>  
>   planstate_tree_walker(node, ExecShutdownNode, NULL);
>  
> + /*
> +  * Allow instrumentation to count stats collected during shutdown for
> +  * nodes that are executed atleast once.
> +  */
> + if (node->instrument && node->instrument->running)
> + InstrStartNode(node->instrument);
> +
>   switch (nodeTag(node))
>   {
>   case T_GatherState:
> @@ -755,5 +762,12 @@ ExecShutdownNode(PlanState *node)
>   break;
>   }
>  
> + /*
> +  * Allow instrumentation to count stats collected during shutdown for
> +  * nodes that are executed atleast once.
> +  */
> + if (node->instrument && node->instrument->running)
> + InstrStopNode(node->instrument, 0);
> +
>   return false;
>  }

Duplicating the comment doesn't seem like a good idea. Just say
something like "/* see comment for InstrStartNode above */".


> diff --git a/src/backend/executor/nodeLimit.c 
> b/src/backend/executor/nodeLimit.c
> index ac5a2ff0e60..cf1851e235f 100644
> --- a/src/backend/executor/nodeLimit.c
> +++ b/src/backend/executor/nodeLimit.c
> @@ -134,6 +134,8 @@ ExecLimit(PlanState *pstate)
>   node->position - node->offset >= 
> node->count)
>   {
>   node->lstate = LIMIT_WINDOWEND;
> + /* Allow nodes to release or shut down 
> resources. */
> + (void) ExecShutdownNode(outerPlan);
>   return NULL;
>   }
>  

Um, is this actually correct?  What if somebody rewinds afterwards?
That won't happen for parallel query currently, but architecturally I
don't think we can do so unconditionally?


Greetings,

Andres Freund

Re: Documenting that queries can be run over replication protocol

2018-07-30 Thread Peter Eisentraut

On 24/07/2018 09:54, Chris Travers wrote:
> In the process of building some tooling based on replication we
> discovered that PostgreSQL 10 uses COPY to populate the initial table on
> logical replication, see commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> 
> This is not currently documented.  Attached is a patch which adds that
> documentation to the logical replication part of the protocol section.

It already says earlier in that same section:

"In logical replication walsender mode, the replication commands shown
below as well as normal SQL commands can be issued."

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Getting rid of "accept incoming network connections" prompts on OS X

2018-07-30 Thread Peter Eisentraut

On 26/07/2018 23:45, Tom Lane wrote:
> This came up again today, and I've confirmed that the issue still exists
> in current macOS.  Did you get any response to your bug report, and if
> so what did they say?

There hasn't been any response to the radar.  I think our analysis is
correct, it's an OS bug.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-30 Thread Andres Freund

Hi,

On 2018-07-20 08:38:09 -0400, Robert Haas wrote:
> I'm going to study this some more now, but I really think this is
> going in the wrong direction.

We're going to have to get somewhere on this topic soon. This thread has
been open for nearly half a year, and we're late in the beta phase now.
What can we do to get to an agreement soon?

Greetings,

Andres Freund

Re: Making "COPY partitioned_table FROM" faster

2018-07-30 Thread Peter Eisentraut

On 30/07/2018 15:26, David Rowley wrote:
>> - Add some tests.  The if (nBufferedTuples > 0) that flushes the tuples
>> when the partition changes is not currently exercised.
> 
> That seems like a good idea. In fact, it uncovered a bug around
> ConvertPartitionTupleSlot() freeing the previously stored tuple out
> the slot which resulted in a crash. I didn't notice before because my
> test had previously not required any tuple conversions.

I think we need to think of a better place to put that temporary file,
and clean it up properly afterwards.  I'm not sure whether we have
existing uses like that.

Also, maybe the test should check afterwards that the right count of
rows ended up in each partition?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Adding a note to protocol.sgml regarding CopyData

2018-07-30 Thread Fabien COELHO




Hello Tatsuo-san,

Minor suggestions, although I'm not a native English speaker.


In libpq.sgml following is stated:

   Before PostgreSQL protocol 3.0, it was 
necessary
   for the application to explicitly send the two characters
   \. as a final line to indicate to the server that it


ISTM That "it" may refer to the server... Put "the application" instead?


had
   finished sending COPY data.  While this still works, 
it is deprecated and the
   special meaning of \. can be expected to be removed 
in a
   future release.


Maybe be more assertive, "will be removed"?


 It is sufficient to call PQendcopy after


"It is now sufficient ..."?


   having sent the actual data.

I think this should be mentioned in protocol.sgml as well. Developers
who wish to develop programs that understand frontend/backend protocol
should be able to focus on protocol.sgml. Attached is a patch for
this.



--
Fabien.

Re: [HACKERS] PoC: full merge join on comparison clause

2018-07-30 Thread Alexander Kuzmenkov


El 18/07/18 a las 16:58, Ashutosh Bapat escribió:


Thanks for the commit messages. I would use word "in-equality" instead
of "comparison" since equality is also a comparison.


Fixed.


Comparing this with the original code, I think, is_mj_equality should be true
if restrictinfo->mergeopfamilies is not NIL.


My mistake, fixed.


With this work the meaning of oprcanmerge (See pg_operator catalog and also
CREATE OPERATOR syntax) changes. Every btree operator can now be used to
perform a merge join. oprcanmerge however only indicates whether an operator is
an equality or not. Have you thought about that? Do we require to re-define
oprcanmerge?


For now we can test with old oprcanmerge meaning, not to bump the 
catalog version. Merge join needs only BTORDER_PROC function, which is 
required for btree opfamilies. This means that it should be always 
possible to merge join on operators that correspond to standard btree 
strategies. We could set oprcanmerge to true for all built-in btree 
comparison operators, and leave the possibility to disable it for custom 
operators.



I think, it should be possible to use this technique with more than one
inequality clauses as long as all the operators require the input to be ordered
in the same direction and the clauses are ANDed. In that case the for a given
outer tuple the matching inner tuples form a contiguous interval.


Consider a table "t(a int, b int)", the value of each column can be 1, 
2, 3, 4 and the table contains all possible combinations. If merge 
condition is "a < 2 and b < 2", for each of the four possible sorting 
directions, the result set won't be contiguous. Generally speaking, this 
happens when we have several groups with the same value of first column, 
and the first column matches the join condition. But inside each group, 
for some rows the second column doesn't match.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From c817ac1f93b83bcf43afac4af2dbaed37403a4a2 Mon Sep 17 00:00:00 2001
From: Alexander Kuzmenkov 
Date: Tue, 10 Apr 2018 12:31:21 +0300
Subject: [PATCH 2/2] Inequality merge join.

Perform merge joins on inequality clause. The current merge join
algorithm requires minimal modification to support one inequality clause
at the final position. This has performance benefits in some cases, and also
allows to perform full joins on inequality, which was not possible
before.
This commit modifies the merge join path generation logic and cost functions to
account for inequality clauses, and adds some tests.
---
 src/backend/executor/nodeMergejoin.c | 136 +--
 src/backend/optimizer/path/costsize.c|  27 ++-
 src/backend/optimizer/path/joinpath.c|  27 ++-
 src/backend/optimizer/path/pathkeys.c| 218 ---
 src/backend/optimizer/plan/initsplan.c   |  28 ++-
 src/backend/utils/adt/selfuncs.c |  40 +++--
 src/backend/utils/cache/lsyscache.c  |  40 +
 src/include/nodes/execnodes.h|   5 +
 src/include/optimizer/paths.h|   3 +-
 src/include/utils/lsyscache.h|   1 +
 src/test/regress/expected/join.out   | 250 +++
 src/test/regress/expected/partition_join.out |   4 +
 src/test/regress/sql/join.sql|  66 ++-
 src/test/regress/sql/partition_join.sql  |   5 +
 14 files changed, 750 insertions(+), 100 deletions(-)

diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 7298e1c..d6e5556 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -89,6 +89,45 @@
  *		proceed to another state.  This state is stored in the node's
  *		execution state information and is preserved across calls to
  *		ExecMergeJoin. -cim 10/31/89
+ *
+ * 		This algorithm can work almost as-is when the last join clause
+ * 		is an inequality. This introduces an additional restriction to
+ * 		the ordering of the inputs: when moving to the next outer tuple,
+ * 		the beginning of the matching interval of inner tuples must not
+ * 		change. For example, if the join operator is ">=", inputs must
+ * 		be in ascending order.
+ *
+ * 		Consider this example:
+ * 			outer  >= innner
+ * 1		0 - first match for outer = 1, 2
+ * 2		1 - last match for outer = 1
+ * 		2 - last match for outer = 2
+ *
+ * 		And if the inputs were sorted in descending order:
+ * 			outer  >= inner
+ * 2		2 - first match for outer = 2
+ * 1		1 - first match for outer = 1
+ * 		0 - last match for outer = 1, 2
+ *
+ * 		It can be seen that the beginning of the matching interval of
+ * 		inner tuples changes when we move to the next outer tuple.
+ * 		Supporting this, i.e. testing and advancing the marked tuple,
+ * 		would complicate the join algorithm. Instead of that, we have
+ * 		the planner ensure that the inputs are suitably ordered, and
+ * 		recheck this on in

Re: request for new parameter for disable promote (slave only mode)

2018-07-30 Thread Robert Haas

On Fri, Jul 27, 2018 at 12:05 PM, Ioseph Kim  wrote:
> I want to build one master & multi slave environments to use physical 
> replication.
> Slave nodes have low hardware spec, so I changed max_connection server 
> parameters, and try start slave node.
> But I could not start slave nodes,
> because CheckRequiredParameterValues function (in 
> src/backend/access/transam/xlog.c) reject to set less values then master’s 
> values.
>
> It is maybe,
> This concept is base on some slave node can be promte master.

No, it's because the machinery that computes MVCC snapshots on slaves
relies on max_connections being set high enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel Append implementation

2018-07-30 Thread Robert Haas

On Sun, Jul 29, 2018 at 5:49 PM, Thomas Munro
 wrote:
> On Thu, May 10, 2018 at 7:08 AM, Robert Haas  wrote:
>>  [parallel-append-doc-v2.patch]
>
> +plans just as they can in any other plan.  However, in a parallel plan,
> +it is also possible that the planner may choose to substitute a
> +Parallel Append node.
>
> Maybe drop "it is also possible that "?  It seems a bit unnecessary
> and sounds a bit odd followed by "may ", but maybe it's just me.

Changed.

> +Also, unlike a regular Append node, which can only 
> have
> +partial children when used within a parallel plan, Parallel
> +Append node can have both partial and non-partial child plans.
>
> Missing "a" before "Parallel".

Fixed.

> +Non-partial children will be scanned by only a single worker, since
>
> Are we using "worker" in a more general sense that possibly includes
> the leader?  Hmm, yes, other text on this page does that too.  Ho hum.

Tried to be more careful about this.

New version attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


parallel-append-doc-v3.patch
Description: Binary data

Re: Would like to help with documentation for Postgres 11

2018-07-30 Thread Michael Goldshteyn

Thanks for the replies, I'll investigate further..

On Sun, Jul 29, 2018 at 7:11 PM Tatsuo Ishii  wrote:

> > Justin Pryzby  writes:
> >> On Sun, Jul 29, 2018 at 11:50:40AM -0500, Michael Goldshteyn wrote:
> >>> I would like to offer some help writing and improving the English
> >>> documentation for some of the new features and changes in Postgres 11.
> If I
> >>> can get an email of where such help would be appreciated, so I can
> choose a
> >>> feature I am familiar with, I would be glad to help.
> >
> >> The documentation is expected to be commited with the feature, so what's
> >> currently in place is expected to be adequate and accuate.
> >
> > There is, however, a fair amount of stuff that was written by people
> whose
> > first language isn't English, and it shows.  So plain old copy-editing
> > is surely welcome, if you have a mind to do that.
>
> BTW, the CommitFest application seems not to pick up threads other
> than in pgsql-hackers. This is usually fine. However if people want to
> contribute document *only* patches and send it to pgsql-docs then
> tries to register it into CF, the CF app does not recognize it. I am
> not sure if this is intended behavior or not.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>

Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

2018-07-30 Thread Bossart, Nathan

On 7/29/18, 7:35 PM, "Michael Paquier"  wrote:
> Yeah, I was testing that yesterday night and bumped on this case when
> trying do a REINDEX SCHEMA pg_class.  The point is that you can simplify
> the check and remove pg_database_ownercheck as there is already an ACL
> check on the database/system/schema at the top of the routine, so you
> are already sure that pg_database_ownercheck() or
> pg_namespace_ownercheck would return true.  This shaves a few cycles as
> well for each relation checked.

Makes sense.

>> I also noticed that this patch causes shared relations to be skipped
>> silently.  Perhaps we should emit a WARNING or DEBUG message when this
>> happens, at least for REINDEXOPT_VERBOSE.
>
> That's intentional.  I thought about that as well, but I am hesitant to
> do so as we don't bother mentioning the other relations skipped.
> REINDEX VERBOSE also shows up what are the tables processed, so it is
> easy to guess what are the tables skipped, still more painful.  And the
> documentation changes added cover the gap.

Okay, that seems reasonable to me, too.

> +1, I have included your suggestions.  The patch attached applies easily
> down to 9.5 where REINDEX SCHEMA was added.  For 9.4 and 9.3, there is
> no schema case, still the new check is similar.  The commit message is
> slightly changed so as there is no mention of REINDEX SCHEMA.  9.3 needs
> a slight change compared to 9.4 as well.

For 9.3 and 9.4, it might be nice to add the "even if the user is the
owner of the specified database" part, too.

> So attached are patches for 9.5~master, 9.4 and 9.3 with commit
> messages.  Does that look fine to folks of this thread?

Looks good to me.  Since REINDEX can be used to block calls to
load_critical_index() from new connections, back-patching seems appropriate.

Nathan

Re: GiST VACUUM

2018-07-30 Thread Heikki Linnakangas


On 29/07/18 14:47, Andrey Borodin wrote:

Fixed both problems. PFA v14.


Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.


In second step I still use paloc's memory, but only to store two
bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
physical scan only reads internal pages. I can omit that bitmap, if
I'll scan everything. Also, I can replace emptyLeafs bitmap with
array\list, but I do not really think it will be big.


On a typical GiST index, what's the ratio of leaf vs. internal pages? 
Perhaps an array would indeed be better. If you have a really large 
index, the bitmaps can take a fair amount of memory, on top of the 
memory used for tracking the dead TIDs. I.e. that memory will be in 
addition to maintenance_work_mem. That's not nice, but I think it's OK 
in practice, and not worth spending too much effort to eliminate. For a 
1 TB index with default 8k block size, the two bitmaps will take 32 MB 
of memory in total. If you're dealing with a database of that size, you 
ought to have some memory to spare. But if an array would use less 
memory, that'd be better.


If you go with bitmaps, please use the existing Bitmapset instead of 
rolling your own. Saves some code, and it provides more optimized 
routines for iterating through all the set bits, too 
(bms_next_member()). Another possibility would be to use Tidbitmap, in 
the "lossy" mode, i.e. add the pages with tbm_add_page(). That might 
save some memory, compared to Bitmapset, if the bitmap is very sparse. 
Not sure how it compares with a plain array.


A straightforward little optimization would be to skip scanning the 
internal pages, when the first scan didn't find any empty pages. And 
stop the scan of the internal pages as soon as all the empty pages have 
been recycled.


- Heikki

Re: Explain buffers wrong counter with parallel plans

2018-07-30 Thread Jonathan S. Katz


> On Jul 28, 2018, at 2:14 AM, Amit Kapila  wrote:
> 
> On Fri, Jul 27, 2018 at 11:12 PM, Jonathan S. Katz
>  wrote:
>> 
>>> On Jul 27, 2018, at 8:31 AM, Amit Kapila  wrote:
>>> 
>>> 
>>> Yeah, that would be better.  Today, I have tried the patch on both
>>> Head and PG11 and I am getting same and correct results.
>> 
>> I have applied the the patch to PG11beta2 and tested.
>> 
> 
> I think we should backpatch this till 9.6 where the parallel query was
> introduced.  Note, that for HEAD and 11, the patch is same.  For PG10,
> the patch code is same, but because surrounding code is different, the
> same patch didn't apply.  For 9.6, we don't need to collect stats in
> ExecShutdownNode.   I have tested it in all the back branches and it
> works fine.

The logic on backpatching seems sounds. I confirmed my tests of the respective
patches against 9.6.9 and 10.4. I'll defer to someone else for comments on the
code, but from my read it appears to be a consistent approach for each version.

Jonathan



signature.asc
Description: Message signed with OpenPGP

Re: Making "COPY partitioned_table FROM" faster

2018-07-30 Thread David Rowley

On 30 July 2018 at 20:33, Peter Eisentraut
 wrote:
> Two more thoughts:
>
> - Add some tests.  The if (nBufferedTuples > 0) that flushes the tuples
> when the partition changes is not currently exercised.

That seems like a good idea. In fact, it uncovered a bug around
ConvertPartitionTupleSlot() freeing the previously stored tuple out
the slot which resulted in a crash. I didn't notice before because my
test had previously not required any tuple conversions.

While ensuring my test was working correctly I noticed that I had a
thinko in the logic that decided if another avgTuplesPerPartChange
calculation was due.  The problem was that the (cstate->cur_lineno -
RECHECK_MULTI_INSERT_THRESHOLD) is unsigned and when cur_lineno is
below RECHECK_MULTI_INSERT_THRESHOLD it results in an underflow which
makes the if test always true until cur_lineno gets up to 1000. I
considered just making all those counters signed, but chickened out
when it came to changing "processed". I thought it was quite strange
to have "processed" unsigned and the other counters that do similar
things signed.  Of course, signed would have enough range, but it
would mean changing the return type of CopyFrom() which I wasn't
willing to do.

In the end, I just protected the if test so that it only calculates
the average again if cur_lineno is at least
RECHECK_MULTI_INSERT_THRESHOLD. This slightly changes when
avgTuplesPerPartChange is first set, but it'll still be at least 1000
tuples before we start doing multi-inserts. Another option would be to
cast the expression as int64... I'm a bit undecided what's best here.

> - With proute becoming a function-level variable,
> cstate->partition_tuple_routing is obsolete and could be removed.  (No
> point in saving this in cstate if it's only used in one function anyway.)

Good idea. I've removed that.

I've attached a delta patch based on the v4 patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v5_multiinsert_copy_delta.patch
Description: Binary data

Re: partition tree inspection functions

2018-07-30 Thread Jesper Pedersen


Hi Amit,

On 07/30/2018 05:21 AM, Amit Langote wrote:

As 0 is a valid return value for root nodes I think we should use -1
instead for these cases.


Makes sense, changed to be that way.



Looks good, the documentation for pg_partition_level could be expanded 
to describe the -1 scenario.


New status: Ready for Committer

Best regards,
 Jesper

Re: Usability fail with psql's \dp command

2018-07-30 Thread Fabien COELHO




Hello Kyotaro-san,


Note that 'No privileges' could be somehow interpreted as "default
privileges" (no "special/given" privileges) or as "no permissions at
all", so there is still some ambiguity, at least for me.


FWIW "No privileges" seems to me as "The user cannot access it at
all" with no ambiguity.


Ok. For me '' and 'No privileges' still looks like they mean the same, 
whereas the point of the patch is to solve the ambiguity.



Currently the behavior is documented here. (This needs to be
edited.)

https://www.postgresql.org/docs/10/static/sql-grant.html


Sure, but user friendlyness would suggest that the output should not be 
misleading from the start.



So it changes the existing documented behavior.


Sure. The behavior is misleading, and documentation is of little help in 
such a case.



\dp is a convenient shortcut for users so the output should be
intuitive or easy-to-grasp.


Yes!


[...]

 relacl | Access privileges
+--
 (null) | '(default)'
 {}  | '(no privilege)'


This suggestion is better as it avoids the "empty/no" ambiguity. It breaks 
the documented behavior, but I'm fine with that anyway.


The human I am now has to know what "default" permissions are depending on 
the kind of object, where it could have said it to me directly. Moreover, 
the line is not self-contained because the default permission depends on 
the owner, but "\dp" does not tell who the owner is, which is another 
annoyance.


A benefit of your approach is that its coding is easy because it does not 
have to fetch the owner tuple and reconstruct the default perms depending 
on the kind of object.


A cleaner approach would be to have a NOT NULL column and have the default 
always explicit, instead of having a lazy instantiation of the field 
managed on GRANT/REVOKE but not on initialization. However this is 
probably too big a change for the problem at hand, and it would not solve 
the ambiguity issue for previous versions.


--
Fabien.

Re: Tips on committing

2018-07-30 Thread Peter Eisentraut

On 23/07/2018 05:58, Tom Lane wrote:
> 013f320dc reminds me of something I check for religiously: look for
> alternative output files for any regression test you're updating the
> output of.
> 
> Actually updating said files, once you notice you need to, can be tricky
> in itself.  Most of the time it works to just apply the same patch to the
> other variants as the delta you're observing for the output file that's
> relevant to your own platform.  Or you may be able to change configuration
> so as to replicate the correct output for another alternative file.  But
> sometimes you just have to guess (and then watch the buildfarm to see if
> you guessed right).

There is a recipe for doing this using the "merge" command here:
https://wiki.postgresql.org/wiki/Regression_test_authoring#Updating_an_existing_regression_test

YMMV

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full

2018-07-30 Thread Julian Markwort


On 07/19/2018 03:00 AM, Thomas Munro wrote:

Some more comments:

 if (parsedline->auth_method == uaCert)
 {
-   parsedline->clientcert = true;
+   parsedline->clientcert = clientCertOn;
 }

The "cert" method is technically redundant with this patch, because
it's equivalent to "trust clientcert=verify-full", right?  That gave
me an idea: wouldn't the control flow be a bit nicer if you just
treated uaCert the same as uaTrust in the big switch statement, but
also enforced that clientcert = clientCertFull in the hunk above?
Then you could just have one common code path to reach CheckCertAuth()
for all auth methods after that switch statement, instead of the more
complicated conditional coding you have now with two ways to reach it.
Would that be better?
That would result in a couple less LOC and a bit clearer conditionals, I 
agree.
If there are no objections to make uaCert a quasi-alias of uaTrust with 
clientcert=verify-full, I'll go ahead and change the code accordingly.
I'll make sure that the error messages are still handled based on the 
auth method and not only depending on the clientcert flag.



In the "auth-cert" section there are already a couple of examples
using lower case "cn":

 will be sent to the client.  The cn (Common Name)

 is a check that the cn attribute matches the database

I guess you should do the same, or if there is a good reason to use
upper case "CN" instead then you should change those to match.
I see that there are currently no places that use CN  
right now.
However, we refer to Certification Authority as CA in most cases 
(without the literal tag surrounding it).
The different fields of certificates seem to be abbreviated with capital 
letters in most literature; That was my reasoning for capitalizing it in 
the first place.
I'm open for suggestions, but in absence of objections I might just 
capitalize all occurrences of CN.



There is still a place in the documentation where we refer to "1":

 In a pg_hba.conf record specifying certificate
 authentication, the authentication option clientcert is
 assumed to be 1, and it cannot be turned off
since a client
 certificate is necessary for this method.  What the cert
 method adds to the basic clientcert certificate
validity test
 is a check that the cn attribute matches the database
 user name.

Maybe we should use "verify-ca" here, though as I noted above I think
it should really "verify-full" and we should simplify the flow.

Yes, we should adopt the new style in all places.
I'll rewrite that passage to indicate that cert authentication 
essentially results in the same behaviour as clientcert=verify-full.
The existing text is somewhat ambiguous anyway, in case somebody decides 
to skip over the restriction described in the second sentence.



I think we should also document that "1" and "0" are older spellings
still accepted, where the setting names are introduced.

+typedef enum ClientCertMode
+{
+   clientCertOff,
+   clientCertOn,
+   clientCertFull
+} ClientCertMode;
+

+1

What do you think about using clientCertCA for the enumerator name
instead of clientCertOn?  That would correspond better to the names
"verify-ca" and "verify-full".


+1
I'm not sure if Magnus had any other cases in mind when he named it 
clientCertOn?



Should I mark this entry as "Needs review" in the commitfest? It seems 
some discussion is still needed to get this commited...


kind regards
Julian

Re: ssl_library parameter

2018-07-30 Thread Peter Eisentraut

On 26/06/2018 11:49, Daniel Gustafsson wrote:
>> Extracted from the GnuTLS thread/patch, here is a patch to add a
>> server-side read-only parameter ssl_library, which currently reports
>> either 'OpenSSL' or an empty string, depending on what SSL library was
>> built with.  This is analogous to the libpq function call
>> PQsslAttribute(conn, "library"), but there was no equivalent
>> functionality on the server side.
> 
> Looks good to me, +1

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Make deparsing of column defaults faster

2018-07-30 Thread Jeff Janes

On Mon, Jul 30, 2018 at 7:03 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 07/07/2018 20:07, Jeff Janes wrote:
> > One case that your patch doesn't improve (neither does my posted one) is
> > check constraints.  To fix that, pg_get_constraintdef_worker would also
> > need to grow a cache as well.  I don't know how often people put check
> > constraints on most of the columns of a table.  Some people like their
> > NOT NULL constraints to be named, not implicit.
> >
> > But from the bigger picture of making pg_upgrade faster, a major issue
> > is that while pg_dump -s gets faster for the column default case, the
> > restore of that dump is still slow (again, my posted patch also doesn't
> > fix that).  In that case it is deparse_context_for called from
> > StoreAttrDefault which is slow.
>
> Any thoughts on how to proceed here?  It seems there is more work to do
> to cover all the issues with dumping and restoring tables with many
> columns.  Since the original report was in the context of pg_upgrade, we
> should surely address at least the pg_restore slowness.
>
>
I'll working on solving the problem using a hash table at the lowest level
(making column names unique), for a future commit fest.  That should drop
it from N^3 to N^2, which since N can't go above 1600 should be good enough.

So we can set this to rejected, as that will be an entirely different
approach.

Your caching patch might be worthwhile on its own, though.

Cheers,

Jeff

Re: Make deparsing of column defaults faster

2018-07-30 Thread Peter Eisentraut

On 07/07/2018 20:07, Jeff Janes wrote:
> One case that your patch doesn't improve (neither does my posted one) is
> check constraints.  To fix that, pg_get_constraintdef_worker would also
> need to grow a cache as well.  I don't know how often people put check
> constraints on most of the columns of a table.  Some people like their
> NOT NULL constraints to be named, not implicit.
> 
> But from the bigger picture of making pg_upgrade faster, a major issue
> is that while pg_dump -s gets faster for the column default case, the
> restore of that dump is still slow (again, my posted patch also doesn't
> fix that).  In that case it is deparse_context_for called from
> StoreAttrDefault which is slow.

Any thoughts on how to proceed here?  It seems there is more work to do
to cover all the issues with dumping and restoring tables with many
columns.  Since the original report was in the context of pg_upgrade, we
should surely address at least the pg_restore slowness.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Improve geometric types

2018-07-30 Thread Tomas Vondra





On 07/30/2018 11:57 AM, Emre Hasegeli wrote:

OK, thanks for confirming. I'll get it committed and we'll see what the
animals think soon.


Thank you for fixing this.  I wanted to preserve this code but wasn't
sure about the correct place or whether it is still necessary.

There are more places we produce -0.  The regression tests have
alternative results to cover them.  I have the "float-zero" patch for
this.  Although I am not sure if it is a correct fix.  I think we
should find the correct fix, and apply it globally to floating point
operations.  This can be only enabled for platforms which produce -0,
so the others don't have to pay the price.



Hmmm. It'll be difficult to review such patch without access to a 
platform exhibiting such behavior ... IIRC IBM offers free access to 
open-source devs, I wonder if that would be a way.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix for documentation of Covering Indexes

2018-07-30 Thread Liudmila Mantrova


On 04/18/2018 12:52 PM, Heikki Linnakangas wrote:

On 11/04/18 04:20, Michael Paquier wrote:

Hi all,

The documentation of covering indexes is incorrect for CREATE and ALTER
TABLE:
- ALTER TABLE's page is missing the call.
- Exclusion constraints can use INCLUDE clauses.

In order to simplify the documentation, please let me suggest the
attached which moves the definition of the INCLUDE clause into the
section index_parameters, which is compatible with what I read from the
parser.


Committed, thanks!

- Heikki

Following this change, I believe we need to modify UNIQUE and PRIMARY 
KEY descriptions in CREATE TABLE as they still mention INCLUDE but not 
the other index_parameters. The attached patch fixes this inconsistency, 
as well as adds a separate paragraph for INCLUDE in CREATE TABLE to 
clarify its purpose and avoid repetition in constraint descriptions. It 
also reorders the paragraphs in CREATE INDEX to follow the syntax.


--
Liudmila Mantrova
Technical writer at Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 3c1223b..c67f187 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -145,52 +145,6 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 
  
-  INCLUDE
-  
-   
-The optional INCLUDE clause specifies a
-list of columns which will be included in the index
-as non-key columns.  A non-key column cannot
-be used in an index scan search qualification, and it is disregarded
-for purposes of any uniqueness or exclusion constraint enforced by
-the index.  However, an index-only scan can return the contents of
-non-key columns without having to visit the index's table, since
-they are available directly from the index entry.  Thus, addition of
-non-key columns allows index-only scans to be used for queries that
-otherwise could not use them.
-   
-
-   
-It's wise to be conservative about adding non-key columns to an
-index, especially wide columns.  If an index tuple exceeds the
-maximum size allowed for the index type, data insertion will fail.
-In any case, non-key columns duplicate data from the index's table
-and bloat the size of the index, thus potentially slowing searches.
-   
-
-   
-Columns listed in the INCLUDE clause don't need
-appropriate operator classes; the clause can include
-columns whose data types don't have operator classes defined for
-a given access method.
-   
-
-   
-Expressions are not supported as included columns since they cannot be
-used in index-only scans.
-   
-
-   
-Currently, only the B-tree index access method supports this feature.
-In B-tree indexes, the values of columns listed in the
-INCLUDE clause are included in leaf tuples which
-correspond to heap tuples, but are not included in upper-level
-index entries used for tree navigation.
-   
-  
- 
-
- 
   name
   

@@ -317,6 +271,52 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 
  
+  INCLUDE
+  
+   
+The optional INCLUDE clause specifies a
+list of columns which will be included in the index
+as non-key columns.  A non-key column cannot
+be used in an index scan search qualification, and it is disregarded
+for purposes of any uniqueness or exclusion constraint enforced by
+the index.  However, an index-only scan can return the contents of
+non-key columns without having to visit the index's table, since
+they are available directly from the index entry.  Thus, addition of
+non-key columns allows index-only scans to be used for queries that
+otherwise could not use them.
+   
+
+   
+It's wise to be conservative about adding non-key columns to an
+index, especially wide columns.  If an index tuple exceeds the
+maximum size allowed for the index type, data insertion will fail.
+In any case, non-key columns duplicate data from the index's table
+and bloat the size of the index, thus potentially slowing searches.
+   
+
+   
+Columns listed in the INCLUDE clause don't need
+appropriate operator classes; the clause can include
+columns whose data types don't have operator classes defined for
+a given access method.
+   
+
+   
+Expressions are not supported as included columns since they cannot be
+used in index-only scans.
+   
+
+   
+Currently, only the B-tree index access method supports this feature.
+In B-tree indexes, the values of columns listed in the
+INCLUDE clause are included in leaf t

Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

2018-07-30 Thread Michael Paquier

On Mon, Jul 30, 2018 at 05:53:54PM +0900, Kyotaro HORIGUCHI wrote:
> I feel that just being a database owner doesn't justify to cause
> this problem innocently. Catalog owner is also doubious but we
> can carefully configure the ownerships to avoid the problem since
> only superuser can change it. So I vote +1 for the revised patch.

Thanks for the review.  Yes that sucks that being just a database or a
schema owner allows such a user to take an exclusive lock on shared
catalogs.  Please note that depending on the order of the relations,
authentication may or may not be blocked depending on what kind of locks
the second session takes.

> | Parameters
> ...
> | SYSTEM
> |   Recreate all indexes on system catalogs within the current
> |   database. *Indexes on shared system catalogs are included*.
> |   Indexes on user tables are not processed. This form
> |   of REINDEX cannot be executed inside a transaction block.

This looks correct to me, shared catalogs are included, and the "notes"
section clealy mentions that being an owner of the shared catalog is
required.

> This apparently changes the documented behavior and the problem
> seems to be a result of a rather malicious/intentional
> combination of operatoins (especially named vacuum on a shared
> catalog). I vote -0.5 to backpatch unless we categorize this as a
> security issue.

Ask that to any vendors doing shared hosting of Postgres :)
A backpatch looks like the correct course of events to me.  Anybody here
is free to express his/her concerns of course.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Include application_name in "connection authorized" log message

2018-07-30 Thread Peter Eisentraut

On 13/07/2018 20:20, Don Seiler wrote:
> See attached for latest revision.

This doesn't compile with SSL enabled because there is a comma missing.

This implementation doesn't run the application_name through
check_application_name(), so it could end up logging application_name
values that are otherwise not acceptable.  I'm not sure of the best way
to fix that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Improve geometric types

2018-07-30 Thread Emre Hasegeli

> OK, thanks for confirming. I'll get it committed and we'll see what the
> animals think soon.

Thank you for fixing this.  I wanted to preserve this code but wasn't
sure about the correct place or whether it is still necessary.

There are more places we produce -0.  The regression tests have
alternative results to cover them.  I have the "float-zero" patch for
this.  Although I am not sure if it is a correct fix.  I think we
should find the correct fix, and apply it globally to floating point
operations.  This can be only enabled for platforms which produce -0,
so the others don't have to pay the price.

Re: patch to allow disable of WAL recycling

2018-07-30 Thread Kyotaro HORIGUCHI

At Mon, 30 Jul 2018 10:43:20 +0200, Peter Eisentraut 
 wrote in 

> On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote:
> > My result is that we cannot disable recycling perfectly just by
> > setting min/max_wal_size.
> 
> Maybe the behavior of min_wal_size should be rethought?  Elsewhere in
> this thread, there was also a complaint that max_wal_size isn't actually
> a max.  It seems like there might be some interest in making these
> settings more accurate.
> 
> I mean, what is the point of the min_wal_size setting if not controlling
> this very thing?

Sorry, I have forgotten to mention it.

The definition of the variable is "We won't reduce segments to no
less than this segments (but in MB) even if we don't need such
many segments until the next checkpoint". I couldn't guess a
proper value for it to indicate the behaior that "I don't want to
keep (recycle) preallocated segments even for expected checkpint
interval.".  In short, since I thought that it's not intuitive at
that time.

Reconsidering the candidate values:

0 seems to keep segments for the next checkpoit interval.

-1 seems that it just disables segment reduction (this is the
same with setting the same value with max_wal_size?)

Maybe we could -1 for this purpose.

guc.c
| {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
| gettext_noop("Sets the minimum size to shrink the WAL to."),
+ gettext_noop("-1 turns off WAL recycling."),

# This seems somewhat.. out-of-the-blue?

wal-configuraiton.html

| The number of WAL segment files in pg_wal directory depends on
| min_wal_size, max_wal_size and the amount of WAL generated in
| previous checkpoint cycles. When old log segment files are no
| longer needed, they are removed or recycled (that is, renamed
| to become future segments in the numbered sequence). If, due to
...
| extent. min_wal_size puts a minimum on the amount of WAL files
| recycled for future usage; that much WAL is always recycled for
| future use, even if the system is idle and the WAL usage
| estimate suggests that little WAL is needed.
+ If you don't need the recycling feature, setting -1 to
+ min_wal_size disables the feature and WAL files are created on
+ demand.

# I'm not sure this makes sense for readers.

Besides the above, I supppose that this also turns off
preallcoation of a whole segment at the first use, which could
cause problems here and there...

If we allowed a string value like 'no-prealloc' for min_wal_size,
it might be comprehensive?

# Sorry for the scattered thoughts

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Improve geometric types

2018-07-30 Thread Emre Hasegeli

> This should fix it I guess, and it's how we deal with unused return
> values elsewhere. I've considered using USE_ASSERT_CHECKING here, but it
> seems rather ugly with that. I'll wait for Emre's opinion ...

Assert() is the wrong thing to do in here.  Drawn-perpendicular lines
may not intersect because of precision loss.  We have to check it and
return NULL.  There a few of those that we crash, or return garbage,
or get NULL and fail in DirectFunctionCall()s.  The next patch
"line-fixes" fixes them.

Re: partition tree inspection functions

2018-07-30 Thread Amit Langote

Hi,

On 2018/07/27 21:21, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 07/26/2018 10:33 PM, Amit Langote wrote:
>> Optional parameter sounds good, so made it get_partition_level(regclass [
>> , regclass ]) in the updated patch.  Although, adding that argument is not
>> without possible surprises its result might evoke.  Like, what happens if
>> you try to find the level of the root table by passing a leaf partition
>> oid for the root table argument, or pass a totally unrelated table for the
>> root table argument.  For now, I've made the function return 0 for such
>> cases.
>>
> 
> As 0 is a valid return value for root nodes I think we should use -1
> instead for these cases.

Makes sense, changed to be that way.

Thanks,
Amit
>From bf12e5a4ee378603dec0216201c8234c96d7a973 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 16 Jan 2018 19:02:13 +0900
Subject: [PATCH v7] Add assorted partition reporting functions

---
 doc/src/sgml/func.sgml   |  86 ++
 src/backend/catalog/partition.c  | 244 ++-
 src/backend/utils/cache/lsyscache.c  |  22 +++
 src/include/catalog/pg_proc.dat  |  48 ++
 src/include/utils/lsyscache.h|   1 +
 src/test/regress/expected/partition_info.out | 204 ++
 src/test/regress/parallel_schedule   |   2 +-
 src/test/regress/serial_schedule |   1 +
 src/test/regress/sql/partition_info.sql  |  91 ++
 9 files changed, 694 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regress/expected/partition_info.out
 create mode 100644 src/test/regress/sql/partition_info.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..e1b7ace898 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19995,6 +19995,92 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
 The function returns the number of new collation objects it created.

 
+   
+Partitioning Information Functions
+
+ 
+  Name Return Type 
Description
+ 
+
+  
+  
+   
pg_partition_parent(regclass)
+   regclass
+   get parent if table is a partition, NULL 
otherwise
+  
+  
+   
pg_partition_root_parent(regclass)
+   regclass
+   get topmost parent of a partition within partition tree
+  
+  
+   pg_partition_level(regclass, 
regclass)
+   regclass
+   get level of a partition within partition tree with respect to 
given parent
+  
+  
+   
pg_partition_level(regclass)
+   regclass
+   get level of a partition within partition tree with respect to 
topmost root parent
+  
+  
+   pg_partition_children(regclass, 
bool)
+   setof regclass
+   
+get partitions of a table; only immediate partitions are returned,
+unless all tables in the partition tree, including itself and
+partitions of lower levels, are requested by passing
+true for second argument
+   
+  
+  
+   
pg_partition_children(regclass)
+   setof regclass
+   Shorthand for pg_partition_children(..., 
false)
+  
+  
+   
pg_partition_leaf_children(regclass)
+   setof regclass
+   get all leaf partitions of a given table
+  
+ 
+
+   
+
+   
+If the table passed to pg_partition_root_parent is not
+a partition, the same table is returned as the result.
+   
+
+   
+For example, to check the total size of the data contained in
+measurement table described in
+, use the following
+query:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(p))) as total_size from 
pg_partition_children('measurement', true) p;
+ total_size 
+
+ 24 kB
+(1 row)
+
+
+   
+One could have used pg_partition_leaf_children in
+this case and got the same result as shown below:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(p))) as total_size from 
pg_partition_leaf_children('measurement') p;
+ total_size 
+
+ 24 kB
+(1 row)
+
+
+
   
 
   
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 558022647c..d90aaf8bb8 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -23,13 +23,16 @@
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_partitioned_table.h"
+#include "funcapi.h"
 #include "nodes/makefuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -38,16 +41,14 @@
 static Oid get_partition_parent_worker(Relation inhRel, Oid relid);
 static void get_partition_ancestors_worker(Relation inhRel, Oid relid,

Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

2018-07-30 Thread Kyotaro HORIGUCHI

At Mon, 30 Jul 2018 09:34:22 +0900, Michael Paquier  wrote 
in <20180730003422.ga2...@paquier.xyz>
> On Sun, Jul 29, 2018 at 04:11:38PM +, Bossart, Nathan wrote:
> > On 7/27/18, 7:10 PM, "Michael Paquier"  wrote:
> > This is added to ReindexMultipleTables(), which is used for REINDEX
> > SCHEMA, REINDEX SYSTEM, and REINDEX DATABASE.  Presently, REINDEX
> > SCHEMA requires being the owner of the schema, and REINDEX SYSTEM and
> > REINDEX DATABASE require being the owner of the database.  So, if a
> > role is an owner of a database or the pg_catalog schema, they are able
> > to reindex shared catalogs like pg_authid.
> 
> Yeah, I was testing that yesterday night and bumped on this case when
> trying do a REINDEX SCHEMA pg_class.  The point is that you can simplify
> the check and remove pg_database_ownercheck as there is already an ACL
> check on the database/system/schema at the top of the routine, so you
> are already sure that pg_database_ownercheck() or
> pg_namespace_ownercheck would return true.  This shaves a few cycles as
> well for each relation checked.

There's a case where the database owner differs from catalogs'
owner. ALTER DATABASE OWNER TO can make such configuration.
In this case, the previous code allows REINDEX of a shared
catalog for the database owner who is not the catalog's owner.

Superuser  : alice
Database owner : joe
Catalog owner  : andy
Schema owner   : joe

"REINDEX SCHERMA " by joe allows shared catalog to be
reindexed, even he is *not* the owner of the catalog.

The revised patch behaves differently to this case. "REINDEX
SCHERMA " by joe *skips* shared catalog since he is
not the owner of the catalog.

# Uh? Alice doesn't involved anywhere..

I feel that just being a database owner doesn't justify to cause
this problem innocently. Catalog owner is also doubious but we
can carefully configure the ownerships to avoid the problem since
only superuser can change it. So I vote +1 for the revised patch.

> > I also noticed that this patch causes shared relations to be skipped
> > silently.  Perhaps we should emit a WARNING or DEBUG message when this
> > happens, at least for REINDEXOPT_VERBOSE.
> 
> That's intentional.  I thought about that as well, but I am hesitant to
> do so as we don't bother mentioning the other relations skipped.
> REINDEX VERBOSE also shows up what are the tables processed, so it is
> easy to guess what are the tables skipped, still more painful.  And the
> documentation changes added cover the gap.

Even if it is written in the "Notes" section, doesn't the
following need some fix? It is the same for the DATBASE item.

| Parameters
...
| SYSTEM
|   Recreate all indexes on system catalogs within the current
|   database. *Indexes on shared system catalogs are included*.
|   Indexes on user tables are not processed. This form
|   of REINDEX cannot be executed inside a transaction block.

> > I noticed that there is no mention that the owner of a schema can do
> > REINDEX SCHEMA, which seems important to note.  Also, the proposed
> > wording might seem slightly ambiguous for the REINDEX DATABASE case.
> > It might be clearer to say something like the following:
> > 
> > Reindexing a single index or table requires being the owner of
> > that index of table.  REINDEX DATABASE and REINDEX SYSTEM
> > require being the owner of the database, and REINDEX SCHEMA
> > requires being the owner of the schema (note that the user can
> > therefore rebuild indexes of tables owned by other users).
> > Reindexing a shared catalog requires being the owner of the
> > shared catalog, even if the user is the owner of the specified
> > database or schema.  Of course, superusers can always reindex
> > anything.
> 
> +1, I have included your suggestions.  The patch attached applies easily
> down to 9.5 where REINDEX SCHEMA was added.  For 9.4 and 9.3, there is
> no schema case, still the new check is similar.  The commit message is
> slightly changed so as there is no mention of REINDEX SCHEMA.  9.3 needs
> a slight change compared to 9.4 as well.
> 
> So attached are patches for 9.5~master, 9.4 and 9.3 with commit
> messages.  Does that look fine to folks of this thread?

This apparently changes the documented behavior and the problem
seems to be a result of a rather malicious/intentional
combination of operatoins (especially named vacuum on a shared
catalog). I vote -0.5 to backpatch unless we categorize this as a
security issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: patch to allow disable of WAL recycling

2018-07-30 Thread Peter Eisentraut

On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote:
> My result is that we cannot disable recycling perfectly just by
> setting min/max_wal_size.

Maybe the behavior of min_wal_size should be rethought?  Elsewhere in
this thread, there was also a complaint that max_wal_size isn't actually
a max.  It seems like there might be some interest in making these
settings more accurate.

I mean, what is the point of the min_wal_size setting if not controlling
this very thing?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Making "COPY partitioned_table FROM" faster

2018-07-30 Thread Amit Langote

On 2018/07/30 17:33, Peter Eisentraut wrote:
> - With proute becoming a function-level variable,
> cstate->partition_tuple_routing is obsolete and could be removed.  (No
> point in saving this in cstate if it's only used in one function anyway.)

+1.  Also seems to apply to transition_capture, which afaict, was added in
cstate only because partition_tuple_routing is there.

Thanks,
Amit

Re: Making "COPY partitioned_table FROM" faster

2018-07-30 Thread Peter Eisentraut

Two more thoughts:

- Add some tests.  The if (nBufferedTuples > 0) that flushes the tuples
when the partition changes is not currently exercised.

- With proute becoming a function-level variable,
cstate->partition_tuple_routing is obsolete and could be removed.  (No
point in saving this in cstate if it's only used in one function anyway.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-30 Thread Amit Langote

On 2018/07/28 10:54, David Rowley wrote:
> On 27 July 2018 at 19:11, Amit Langote  wrote:
>> I've attached a delta patch to make the above changes.  I'm leaving the
>> hash table rename up to you though.
> 
> Thanks for the delta patch. I took all of it, just rewrote a comment slightly.
> 
> I also renamed the hash table to your suggestion and changed a few more 
> things.
> 
> Attached a delta based on v2 and the full v3 patch.
> 
> This includes another small change to make
> PartitionDispatchData->indexes an array that's allocated in the same
> memory as the PartitionDispatchData. This will save a palloc() call
> and also should be a bit more cache friendly.
> 
> This also required a rebase on master again due to 3e32109049.

Thanks for the updated patch.

I couldn't find much to complain about in the latest v3, except I noticed
a few instances of the word "setup" where I think what's really meant is
"set up".

+ * must be setup, but any sub-partitioned tables can be setup lazily as

+ * A ResultRelInfo has not been setup for this partition yet,


By the way, when going over the updated code, I noticed that the code
around child_parent_tupconv_maps could use some refactoring too.
Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
child-to-parent map array needed for transition tuple capture even if not
needed by any of the leaf partitions.  I'm attaching here a patch that
applies on top of your v3 to show what I'm thinking we could do.

Thanks,
Amit
From 6ce1654aa929c7f8112c430914af7f464474ed31 Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 30 Jul 2018 14:05:17 +0900
Subject: [PATCH 2/2] Some refactoring around child_parent_tupconv_maps

Just like parent_child_tupconv_maps, we should allocate it only if
needed.  Also, if none of the partitions ended up needing a map, we
should not have allocated the child_parent_tupconv_maps array, only
the child_parent_map_not_required one.  So, get rid of
ExecSetupChildParentMapForLeaf(), which currently does initial,
possibly useless, allocation of both of the above mentioned arrays.
Instead, have TupConvMapForLeaf() allocate the needed array, just
like ExecInitRoutingInfo does, when it needs to store a
parent-to-child map.

Finally, rename the function TupConvMapForLeaf to
LeafToParentTupConvMapForTC for clarity; TC stands for "Transition
Capture".
---
 src/backend/commands/copy.c|  19 +-
 src/backend/executor/execPartition.c   | 102 -
 src/backend/executor/nodeModifyTable.c |   4 +-
 src/include/executor/execPartition.h   |  12 ++--
 4 files changed, 59 insertions(+), 78 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6fc1e2b41c..6d0e9229e0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2503,22 +2503,9 @@ CopyFrom(CopyState cstate)
 * CopyFrom tuple routing.
 */
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-   {
-   PartitionTupleRouting *proute;
-
-   proute = cstate->partition_tuple_routing =
+   cstate->partition_tuple_routing =
ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-   /*
-* If we are capturing transition tuples, they may need to be
-* converted from partition format back to partitioned table 
format
-* (this is only ever necessary if a BEFORE trigger modifies the
-* tuple).
-*/
-   if (cstate->transition_capture != NULL)
-   ExecSetupChildParentMapForLeaf(proute);
-   }
-
/*
 * It's more efficient to prepare a bunch of tuples for insertion, and
 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2666,8 +2653,8 @@ CopyFrom(CopyState cstate)
 */

cstate->transition_capture->tcs_original_insert_tuple = NULL;
cstate->transition_capture->tcs_map =
-   TupConvMapForLeaf(proute, 
saved_resultRelInfo,
-   
  leaf_part_index);
+   
LeafToParentTupConvMapForTC(proute, saved_resultRelInfo,
+   
leaf_part_index);
}
else
{
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 2a18a30b3e..d183e8b758 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -400,7 +400,7 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
   sizeof(TupleConversi

Re: adding tab completions

2018-07-30 Thread Arthur Zakirov

On Sun, Jul 29, 2018 at 07:42:43PM -0500, Justin Pryzby wrote:
> Your suggestion is good, so attached updated patch.

The patch is in good shape. It compiles without errors. The patch
doesn't need in documentation.

I marked the patch as "Ready for Commiter".

> > > Actually..another thought: since toast tables may be VACUUM-ed, should I
> > > introduce Query_for_list_of_tpmt ?
> 
> I didn't include this one yet though.
> 
> Feel free to bump to next CF.

I think it could be done by a separate patch.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-07-30 Thread Michael Paquier

On Fri, Jul 27, 2018 at 08:27:26AM +, Tsunakawa, Takayuki wrote:
> I don't have a strong opinion, but I wonder which of namespace.c or
> autovacuum.c is suitable, because isTempNamespaceInUse is specific to
> autovacuum.

I think that there is also a point in allowing other backends to use it
as well, so I left it in namespace.c.  I have been doing more testing
with this patch today.  In order to catch code paths where this is
triggered I have for example added an infinite loop in do_autovacuum
when finding a temp table which exits once a given on-disk file is
found.  This lets plenty of time to attach autovacuum to a debugger, and
play with other sessions in parallel, so I have checked the
transactional assumption this patch relied on, and tested down to v10 as
that's where removal of orphaned temp relations has been made more
aggressive. This patch can also be applied cleanly there.

The check on MyDatabaseId does not actually matter much as the same
temporary namespace OID would get reused only after an OID wraparound
with a backend using a different database but I left it anyway as that's
more consistent to me to check for database and namespace, and added a
sanity check to make sure that the routine gets called only by a process
connected to a database.

I am going to be out of town for a couple of days, and the next minor
release planned is close by, so I am not pushing that today, but I'd
like to do so after the next minor release if there are no objections.
Attached is a patch with a proposal of commit message.
--
Michael
From 503868a13ff3c9cf41a3f3295977bd7c41f4ab6f Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 30 Jul 2018 16:54:27 +0900
Subject: [PATCH] Make autovacuum more aggressive to remove orphaned temp
 tables

Commitdafa084, added in 10, made the removal of temporary orphaned
tables more aggressive.  This commit makes an extra step in the
aggressiveness by adding a flag in each backend's MyProc which tracks
down any namespace currently in use.  The flag is set when the namespace
gets created and can be reset if the temporary namespace has been
created in a transaction or sub-transaction which got aborted.  This
flag can be updated in a lock-less fashion, as autovacuum workers would
scan pg_class without seeing the temporary namespace or any temporary
relations on the transaction creating it gets committed.

The base idea of this patch comes from Robert Haas.

Author: Tsunakawa Takayuki
Reviewed-by: Michael Paquier, Kyotaro Horiguchi
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05
Backpatch: 10-
---
 src/backend/catalog/namespace.c | 65 -
 src/backend/postmaster/autovacuum.c | 19 -
 src/backend/storage/lmgr/proc.c |  2 +
 src/include/catalog/namespace.h |  1 +
 src/include/storage/proc.h  |  3 ++
 5 files changed, 77 insertions(+), 13 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 0f67a122ed..b0fbb0b010 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -47,7 +47,7 @@
 #include "parser/parse_func.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/sinval.h"
+#include "storage/sinvaladt.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/catcache.h"
@@ -3204,6 +3204,45 @@ isOtherTempNamespace(Oid namespaceId)
 	return isAnyTempNamespace(namespaceId);
 }

+/*
+ * isTempNamespaceInUse - is the given namespace owned and actively used
+ * by a backend?
+ *
+ * Note: this can be used while scanning relations in pg_class to detect
+ * orphaned temporary tables or namespaces with a backend connected to a
+ * given database.
+ */
+bool
+isTempNamespaceInUse(Oid namespaceId)
+{
+	PGPROC	   *proc;
+	int			backendId;
+
+	Assert(OidIsValid(MyDatabaseId));
+
+	backendId = GetTempNamespaceBackendId(namespaceId);
+
+	if (backendId == InvalidBackendId ||
+		backendId == MyBackendId)
+		return false;
+
+	/* Is the backend alive? */
+	proc = BackendIdGetProc(backendId);
+	if (proc == NULL)
+		return false;
+
+	/* Is the backend connected to the same database we are looking at? */
+	if (proc->databaseId != MyDatabaseId)
+		return false;
+
+	/* Does the backend own the temporary namespace? */
+	if (proc->tempNamespaceId != namespaceId)
+		return false;
+
+	/* all good to go */
+	return true;
+}
+
 /*
  * GetTempNamespaceBackendId - if the given namespace is a temporary-table
  * namespace (either my own, or another backend's), return the BackendId
@@ -3893,6 +3932,14 @@ InitTempTableNamespace(void)
 	myTempNamespace = namespaceId;
 	myTempToastNamespace = toastspaceId;

+	/*
+	 * Mark MyProc as owning this namespace which other processes can use to
+	 * decide if a temporary namespace is in use or not. Even if this is an
+	 * atomic operation, this can be safely done lock-less as no temporary
+	 * relations associated to it can be seen yet if scanning pg_class.
+	 */
+	M

63 matches

Mail list logo