date:20180412

On 2018/04/13 14:38, Amit Langote wrote:
> About the specific relation_open(.., NoLock) under question, I think there
> might be a way to address this by opening the tables with the appropriate
> lock mode in partitioned_rels list in ExecLockNonleafAppendTables

That may have sounded a bit confusing:

I meant to say: "by opening the tables in partitioned_rels list with the
appropriate lock mode in ExecLockNonleafAppendTables"

> Attached a PoC patch.
There were a couple of unnecessary hunks, which removed in the attached.

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+   
palloc(sizeof(Relation) *
+  
list_length(node->part_prune_infos));
+   appendstate->num_partitioned_rels =
+  
list_length(node->part_prune_infos);
+   }
+   }
+
+   i = 0;
foreach(lc, partitioned_rels)
{
ListCell   *l;
Index   rti = lfirst_int(lc);
boolis_result_rel = false;
Oid relid = getrelid(rti, 
estate->es_range_table);
+   int lockmode;
 
/* If this is a result relation,

Re: [HACKERS] Runtime Partition Pruning

On 2018/04/13 1:57, Robert Haas wrote:
>> It might be possible to do something better in each module by keeping
>> an array indexed by RTI which have each entry NULL initially then on
>> first relation_open set the element in the array to that pointer.
> 
> I'm not sure that makes a lot of sense in the planner, but in the
> executor it might be a good idea.   See also
> https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
> for related discussion.  I think that a coding pattern where we rely
> on relation_open(..., NoLock) is inherently dangerous -- it's too easy
> to be wrong about whether the lock is sure to have been taken.  It
> would be much better to open the relation once and hold onto the
> pointer, not just for performance reasons, but for robustness.

About the specific relation_open(.., NoLock) under question, I think there
might be a way to address this by opening the tables with the appropriate
lock mode in partitioned_rels list in ExecLockNonleafAppendTables and
keeping the Relation pointers around until ExecEndNode.  Then instead of
ExecSetupPartitionPruneState doing relation_open/close(.., NoLock), it
just reuses the one that's passed by the caller.

Attached a PoC patch.  David, thoughts?

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+

Re: MinIndexTupleSize seems slightly wrong

2018-04-12 Thread Kyotaro HORIGUCHI

At Thu, 12 Apr 2018 17:26:33 -0700, Peter Geoghegan  wrote in 

> itup.h says of MinIndexTupleSize/MaxIndexTuplesPerPage:
> 
> /*
>  * MaxIndexTuplesPerPage is an upper bound on the number of tuples that can
>  * fit on one index page.  An index tuple must have either data or a null
>  * bitmap, so we can safely assume it's at least 1 byte bigger than a bare
>  * IndexTupleData struct.  We arrive at the divisor because each tuple
>  * must be maxaligned, and it must have an associated item pointer.
>  */
> #define MinIndexTupleSize MAXALIGN(sizeof(IndexTupleData) + 1)
> #define MaxIndexTuplesPerPage   \
> ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
> (MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData
> 
> However, that at least seems questionable to me. See _bt_pgaddtup()
> for a simple example of this -- "minus infinity" items on internal
> pages are sized sizeof(IndexTupleData).
> 
> The code still seems fine to me, since that only happens at most once
> per page. Is it worth noting the exception?

MinIndexTupleSize is not used at all. It doesn't harm as long as
MaxIndexTuplesPerPage is not less than the maximum number of
tuples in reality.

Applying values on my environment,

BLCKSZ = 8192
SizeOfPageHeaderData = 24
sizeof(IndexTupleData) = 8
sizeof(ItemIdData) = 4

MaxIndexTuplesPerPage is 408. If we have 407 normal and one
0-attr index tuple, they leave 16 byte spare space, in which no
normal tuple fits. In the case where we have BLCKSZ of 3 * 8192 =
24576 bytes, MaxIndexTuplesPerPage is 1227. 1226 normal and one
0-attr tuples uses 24556 bytes and it leaves spare 20 bytes, in
which one extra normal tuple can reside. This can cause
off-by-one error.

In reality, all types of index have opaque area with at least 8
bytes long and it prevent pages from having extra tuples. But it
doesn't seem to me stable enough.

reagards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Overcoming SELECT ... FOR UPDATE permission restrictions

2018-04-12 Thread Alexander Lakhin


Hello hackers,

Can you please explain, is this a bug or intended behaviour?

Running as non-privileged user:

postgres=> SELECT datid, datname FROM pg_stat_database FOR UPDATE;
ERROR: permission denied for view pg_stat_database (SQLState: 42501)

But:

postgres=> CREATE VIEW pgsd AS SELECT * FROM pg_stat_database; SELECT 
datid, datname FROM pgsd FOR UPDATE;

CREATE VIEW
 datid |  datname
---+---
 13021 | postgres
 1 | template1
 13020 | template0
(3 rows)


(And lock is really held by the second SELECT.)

Best regards,

--
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Make description of heap records more talkative for flags

On Thu, Apr 12, 2018 at 08:49:03PM -0700, Andres Freund wrote:
> OTOH, that also kinda bloats the output noticeably... I'm somewhat
> inclined to just put the hex value or such there?

That would do as well for me.
--
Michael


signature.asc
Description: PGP signature

Re: Make description of heap records more talkative for flags

2018-04-12 Thread Andres Freund

On 2018-04-13 12:47:34 +0900, Michael Paquier wrote:
> Hi all,
> 
> I was just playing with the WAL consistency issue with rows moved across
> partitions when I noticed that heapdesc.c is not really talkative about
> the different flag records set.
> 
> What about something like the patch attached?  I found that useful for
> debugging.
> 
> (One comment of heapam_xlog.h mentions xl_heap_delete instead of
> xl_heap_truncate, noticed it on the way.)

OTOH, that also kinda bloats the output noticeably... I'm somewhat
inclined to just put the hex value or such there?

Greetings,

Andres Freund

Make description of heap records more talkative for flags

Hi all,

I was just playing with the WAL consistency issue with rows moved across
partitions when I noticed that heapdesc.c is not really talkative about
the different flag records set.

What about something like the patch attached?  I found that useful for
debugging.

(One comment of heapam_xlog.h mentions xl_heap_delete instead of
xl_heap_truncate, noticed it on the way.)

Thanks,
--
Michael
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 318a281d7f..f03dbca55b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -42,12 +42,32 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_insert *xlrec = (xl_heap_insert *) rec;
 
+		if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+			appendStringInfo(buf, "all_visible_cleared ");
+		if (xlrec->flags & XLH_INSERT_LAST_IN_MULTI)
+			appendStringInfo(buf, "last_in_multi ");
+		if (xlrec->flags & XLH_INSERT_IS_SPECULATIVE)
+			appendStringInfo(buf, "is_speculative ");
+		if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)
+			appendStringInfo(buf, "contains_new_tuple ");
+
 		appendStringInfo(buf, "off %u", xlrec->offnum);
 	}
 	else if (info == XLOG_HEAP_DELETE)
 	{
 		xl_heap_delete *xlrec = (xl_heap_delete *) rec;
 
+		if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+			appendStringInfo(buf, "all_visible_cleared ");
+		if (xlrec->flags & XLH_DELETE_CONTAINS_OLD_TUPLE)
+			appendStringInfo(buf, "contains_old_tuple ");
+		if (xlrec->flags & XLH_DELETE_CONTAINS_OLD_KEY)
+			appendStringInfo(buf, "contains_old_key ");
+		if (xlrec->flags & XLH_DELETE_IS_SUPER)
+			appendStringInfo(buf, "is_super ");
+		if (xlrec->flags & XLH_DELETE_IS_PARTITION_MOVE)
+			appendStringInfo(buf, "is_partition_move ");
+
 		appendStringInfo(buf, "off %u ", xlrec->offnum);
 		out_infobits(buf, xlrec->infobits_set);
 	}
@@ -55,6 +75,21 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_update *xlrec = (xl_heap_update *) rec;
 
+		if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+			appendStringInfo(buf, "old_all_visible_cleared ");
+		if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+			appendStringInfo(buf, "new_all_visible_cleared ");
+		if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD_TUPLE)
+			appendStringInfo(buf, "contains_old_tuple ");
+		if (xlrec->flags & XLH_UPDATE_CONTAINS_OLD_KEY)
+			appendStringInfo(buf, "contains_old_key ");
+		if (xlrec->flags & XLH_UPDATE_CONTAINS_NEW_TUPLE)
+			appendStringInfo(buf, "contains_new_tuple ");
+		if (xlrec->flags & XLH_UPDATE_PREFIX_FROM_OLD)
+			appendStringInfo(buf, "prefix_from_old ");
+		if (xlrec->flags & XLH_UPDATE_SUFFIX_FROM_OLD)
+			appendStringInfo(buf, "suffix_from_old ");
+
 		appendStringInfo(buf, "off %u xmax %u ",
 		 xlrec->old_offnum,
 		 xlrec->old_xmax);
@@ -146,6 +181,15 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
 
+		if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+			appendStringInfo(buf, "all_visible_cleared ");
+		if (xlrec->flags & XLH_INSERT_LAST_IN_MULTI)
+			appendStringInfo(buf, "last_in_multi ");
+		if (xlrec->flags & XLH_INSERT_IS_SPECULATIVE)
+			appendStringInfo(buf, "is_speculative ");
+		if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)
+			appendStringInfo(buf, "contains_new_tuple ");
+
 		appendStringInfo(buf, "%d tuples", xlrec->ntuples);
 	}
 	else if (info == XLOG_HEAP2_LOCK_UPDATED)
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index cf88ff7cb4..cc85f4b3c7 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -111,7 +111,7 @@ typedef struct xl_heap_delete
 #define SizeOfHeapDelete	(offsetof(xl_heap_delete, flags) + sizeof(uint8))
 
 /*
- * xl_heap_delete flag values, 8 bits are available.
+ * xl_heap_truncate flag values, 8 bits are available.
  */
 #define XLH_TRUNCATE_CASCADE	(1<<0)
 #define XLH_TRUNCATE_RESTART_SEQS(1<<1)


signature.asc
Description: PGP signature

Re: wal_consistency_checking reports an inconsistency on master branch

2018-04-12 Thread amul sul

Will look into this, thanks.

Regards,
Amul

Sent from a mobile device. Please excuse brevity and tpyos.

On Fri, Apr 13, 2018, 9:06 AM Andres Freund  wrote:

> On 2018-04-13 12:29:21 +0900, Amit Langote wrote:
> > On 2018/04/13 7:36, Peter Geoghegan wrote:
> > > On Thu, Apr 12, 2018 at 11:47 AM, Peter Geoghegan  wrote:
> > >> In short, it looks like the tests added to update.sql by commit
> > >> 2f178441 ("Allow UPDATE to move rows between partitions") lead to this
> > >> failure, since I always hit a problem when update.sql is reached. I
> > >> haven't gone to the trouble of digging any deeper than that just yet.
> > >
> > > Without having looked at it in much detail, this seems rather more
> > > likely to be the fault of 2f178441. That was recent enough that it's
> > > easy to believe that I'd be the first to notice it, and actually has
> > > on-disk changes, in the form of ItemPointerSetMovedPartitions().
> >
> > I think you meant f16241bef7cc2 (Raise error when affecting tuple moved
> > into different partition), because it's the one that adds
> > ItemPointerSetMovedPartitions.  Commit 2f178441 you quoted added the test
> > that triggered the failure.
>
> Right. Amul, have you looked at this?
>
> Greetings,
>
> Andres Freund
>

Re: wal_consistency_checking reports an inconsistency on master branch

2018-04-12 Thread Andres Freund

On 2018-04-13 12:29:21 +0900, Amit Langote wrote:
> On 2018/04/13 7:36, Peter Geoghegan wrote:
> > On Thu, Apr 12, 2018 at 11:47 AM, Peter Geoghegan  wrote:
> >> In short, it looks like the tests added to update.sql by commit
> >> 2f178441 ("Allow UPDATE to move rows between partitions") lead to this
> >> failure, since I always hit a problem when update.sql is reached. I
> >> haven't gone to the trouble of digging any deeper than that just yet.
> > 
> > Without having looked at it in much detail, this seems rather more
> > likely to be the fault of 2f178441. That was recent enough that it's
> > easy to believe that I'd be the first to notice it, and actually has
> > on-disk changes, in the form of ItemPointerSetMovedPartitions().
> 
> I think you meant f16241bef7cc2 (Raise error when affecting tuple moved
> into different partition), because it's the one that adds
> ItemPointerSetMovedPartitions.  Commit 2f178441 you quoted added the test
> that triggered the failure.

Right. Amul, have you looked at this?

Greetings,

Andres Freund

Re: [HACKERS] path toward faster partition pruning

On Fri, Apr 13, 2018 at 7:45 AM, Amit Langote
 wrote:
> On 2018/04/13 1:47, Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera  
>> wrote:
>>> Here's an idea.  Why don't we move the function/opclass creation lines
>>> to insert.sql, without the DROPs, and use the same functions/opclasses
>>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>>> partition_prune.sql, i.e. not recreate what are essentially the same
>>> objects three times?  This also leaves them around for the pg_upgrade
>>> test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> Do you mean to do that for *all* files that have tests exercising some
> partitioning code, even if it's just one test?  I can see that tests in at
> least some of them could be put into their own partition_ file as follows:
>
> partition_insert (including tests in insert_conflict)
> partition_update
> partition_triggers
> partition_indexing (indexing.sql added when partitioned indexes went in)
> partition_ddl (for the tests in create_table and alter_table)
>

We haven't generally created test files specific to a table type for
example temporary tables or unlogged tables, instead have created
files by SQL commands. But then that's not true for indexes; we have
separate files for indexes and we also have separate file for
materialized views and also for various data types. So, our test file
organization seems to have cut across of SQL commands and object
types. But partitioning seems an area large enough to have files of
its own; we already have partition_join and partition_aggregate.

Do we want to move to a directory based organization for tests also,
where sql/ expected/ will have directories within them for various
types of objects like partitioned tables, indexes, regular tables,
datatypes etc. and each of those will have files organized by sql
commands? An immediate question arises as to where to add the files
which exercises a mixture of objects; may be in a directory
corresponding to the primary object like materialized views over
partitioned tables, would fit materialized view (or just views?)
directory.

Whatever organization we want to use, it should be easy to find
testcases for relevant functionality e.g. all tests for partitioned
tables or all alter table command tests.

> What about the tests in inherit.sql that start with:
>
> --
> -- Check that constraint exclusion works correctly with partitions using
> -- implicit constraints generated from the partition bound information.
> --
>
> Maybe, just move all of them to partition_prune.sql, because we no longer
> use constraint exclusion for pruning?

I think we need to have some testcases somwhere to test constraint
exclusion on partitions and partitioned tables, those do not
necessarily fit partition pruning.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Instability in partition_prune test?

On 13 April 2018 at 14:41, David Rowley  wrote:
> I'll just need to go think about how we can make the test stable now.

Thomas and I discussed this a bit off-list.

The attached basically adds:

set max_parallel_workers = 0;

before the Parallel Append tests.

All those tests were intended to do what check that "(never executed)"
appeared for the correct nodes. They still do that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


make_partition_prune_tests_stable_hopefully.patch
Description: Binary data

Re: wal_consistency_checking reports an inconsistency on master branch

On 2018/04/13 7:36, Peter Geoghegan wrote:
> On Thu, Apr 12, 2018 at 11:47 AM, Peter Geoghegan  wrote:
>> In short, it looks like the tests added to update.sql by commit
>> 2f178441 ("Allow UPDATE to move rows between partitions") lead to this
>> failure, since I always hit a problem when update.sql is reached. I
>> haven't gone to the trouble of digging any deeper than that just yet.
> 
> Without having looked at it in much detail, this seems rather more
> likely to be the fault of 2f178441. That was recent enough that it's
> easy to believe that I'd be the first to notice it, and actually has
> on-disk changes, in the form of ItemPointerSetMovedPartitions().

I think you meant f16241bef7cc2 (Raise error when affecting tuple moved
into different partition), because it's the one that adds
ItemPointerSetMovedPartitions.  Commit 2f178441 you quoted added the test
that triggered the failure.

Thanks,
Amit

Re: Instability in the postgres_fdw regression test

2018-04-12 Thread Etsuro Fujita


(2018/04/13 3:49), Robert Haas wrote:

On Thu, Apr 12, 2018 at 12:49 PM, Tom Lane  wrote:

We've been seeing $subject since commit 1bc0100d2, with little clue
as to the cause.  Previous attempts to fix it by preventing autovacuum
from running on the relevant tables didn't seem to help.

I have now managed to reproduce the issue reliably enough to study it.
(It turns out that on longfin's host, running the buildfarm script *under
cron* produces the failure a reasonable percentage of the time.  The
identical test case, launched from an interactive shell, never fails.
I speculate that the kernel scheduler treats cron jobs differently.)
It is in fact a timing issue, and what triggers it is there being an
active autovacuum task in another database.  The fact that the buildfarm
script uses USE_MODULE_DB in the contrib tests greatly increases the
surface area of the problem, since that creates a lot more databases that
autovacuum could be active in.  Once I realized that, I found that it can
trivially be reproduced by hand, in a "make installcheck" test, simply by
having an open transaction in another DB in the cluster.  For instance
"select pg_sleep(60);" and then run make installcheck in postgres_fdw.

So now, drilling down to the specific problem: what we're seeing is that
the plans for some queries change because the "remote" server reports
a different rowcount estimate than usual for

EXPLAIN SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 
1">  2000)) FOR UPDATE

Normally the estimate is one, but in the failure cases it's 12 or so.
This estimate is coming out of ineq_histogram_selectivity, of course.
Furthermore, the highest value actually present in the histogram is
1000, because that was the highest value of "C 1" when the test did
ANALYZE up at the top.  That means we'll invoke get_actual_variable_range
to try to identify the actual current maximum.  Most of the time, it
returns 2010, which is correct, and we end up estimating that only
about one row will exceed 2000.

However, in the failure cases, what's getting extracted from the index
is .  That's because we'd inserted and deleted that value further
up in the test, and if there's been an open transaction holding back
RecentGlobalXmin, then SnapshotNonVacuumable is going to consider that
entry still good.  This value is enough larger than 2000 to move the
selectivity estimate appreciably, and then kaboom.

This theory successfully explains the observed fact that some buildfarm
failures show differences in two query plans, while others show a
difference just in the first one.  In the latter cases, the external
transaction ended, so that RecentGlobalXmin could advance, in between.

What I propose to do to fix the instability is to change the test
stanza that uses  as a key-chosen-at-random to use something less
far away from the normal range of "C 1" values, so that whether it's
still visible to get_actual_variable_range has less effect on this
selectivity estimate.  That's a hack, for sure, but I don't see any
other fix that's much less of a hack.


Thanks for the detective work.


Thanks a lot!

Best regards,
Etsuro Fujita

Re: wal_consistency_checking reports an inconsistency on master branch

On Thu, Apr 12, 2018 at 03:36:12PM -0700, Peter Geoghegan wrote:
> Without having looked at it in much detail, this seems rather more
> likely to be the fault of 2f178441. That was recent enough that it's
> easy to believe that I'd be the first to notice it, and actually has
> on-disk changes, in the form of ItemPointerSetMovedPartitions().

Since f16241be the following comment has been added to heap_mask():
/*
 * NB: Not ignoring ctid changes due to the tuple having moved
 * (i.e. HeapTupleHeaderIndicatesMovedPartitions), because that's
 * important information that needs to be in-sync between primary
 * and standby, and thus is WAL logged.
 */

And actually, if you remove this query from update.sql, then the
consistency checks are able to finish:
UPDATE upview set c = 120 WHERE b = 4;

This triggers in the test suite a CHECK violation but this should not
result in a row being moved as even c is updated it would remain on the
same child partition so no rows are moved across partitions here.

Could this be pointing to an older issue?
--
Michael

signature.asc
Description: PGP signature

Re: [HACKERS] path toward faster partition pruning

On 13 April 2018 at 14:15, Amit Langote  wrote:
> On 2018/04/13 1:47, Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera  
>> wrote:
>>> Here's an idea.  Why don't we move the function/opclass creation lines
>>> to insert.sql, without the DROPs, and use the same functions/opclasses
>>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>>> partition_prune.sql, i.e. not recreate what are essentially the same
>>> objects three times?  This also leaves them around for the pg_upgrade
>>> test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> Do you mean to do that for *all* files that have tests exercising some
> partitioning code, even if it's just one test?  I can see that tests in at
> least some of them could be put into their own partition_ file as follows:

Wouldn't it be best to just move hash partition tests into hash_part?
Leave all the other stuff where it is?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Problem while setting the fpw with SIGHUP

2018-04-12 Thread Amit Kapila

On Fri, Apr 13, 2018 at 6:59 AM, Michael Paquier  wrote:
> On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
>> I think it may actually be confusing.  If you run pg_ctl reload and it
>> reports that the value has changed, you'll expect it to have taken
>> effect.  But really, it will take effect at some later time.
>

+1. I also think it is confusing and it could be difficult for end
users to know when the setting is effective.

> It is true that sometimes some people like to temporarily disable
> full_page_writes particularly when doing some bulk load of data to
> minimize the effort on WAL, and then re-enable it just after doing
> the inserting this data.
>
> Still does it matter when the change is effective?  By disabling
> full_page_writes even temporarily, you accept the fact that this
> instance would not be safe until the next checkpoint completes.  The
> instance even finishes by writing less unnecessary WAL data if the
> change is only effective at the next checkpoint.  Well, it is true that
> this increases potential torn pages problems but the user is already
> accepting that risk if a crash happens until the next checkpoint then it
> exposes itself to torn pages anyway as it chose to disable
> full_page_writes.
>

I think this means that is will be difficult for end users to predict
unless they track the next checkpoint which isn't too bad, but won't
be convenient either.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Instability in partition_prune test?

On 13 April 2018 at 14:37, Thomas Munro  wrote:
> I think it might be working as designed!  The participants are allowed
> to run other subplans, because they're parallel sub-plans: it's just a
> question of whether any backend manages to complete its subplan and
> then go looking for another subplan to attack before it is marked as
> complete.  If that's true, then the loop number shown can't
> technically be made stable, can it?  Unless you make it so that the
> scans were not allowed to be run by more than one worker.

Yeah, you're right.

It's pretty easy to show this by giving Parallel Append a small
subnode and a larger subnode to work on:

drop table if exists t1;
drop table if exists t2;
create table t1 (a int);
create table t2 (a int);
insert into t1 select generate_Series(1,100);
insert into t2 select generate_Series(1,1);

explain (analyze, costs off, summary off, timing off) select count(*)
from (select * from t1 union all select * from t2) t1_t2;
  QUERY PLAN
--
 Finalize Aggregate (actual rows=1 loops=1)
   ->  Gather (actual rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate (actual rows=1 loops=3)
   ->  Parallel Append (actual rows=336667 loops=3)
 ->  Parallel Seq Scan on t1 (actual rows=33 loops=3)
 ->  Parallel Seq Scan on t2 (actual rows=1 loops=1)
(8 rows)

I'll just need to go think about how we can make the test stable now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Instability in partition_prune test?

2018-04-12 Thread Thomas Munro

On Fri, Apr 13, 2018 at 1:21 PM, David Rowley
 wrote:
> On 13 April 2018 at 10:29, Thomas Munro  wrote:
>> This is a Parallel Append with three processes working on three
>> subplans.  It looks like one of the subplans got executed twice?
>
> Hi Thomas,
>
> Thanks for the report. If you're able to run the scripts in [1] and
> confirm you can reproduce the failure, if so, then revert the code
> back to 5c0675215 and see if you still get the additional loop.
>
> You'll need to update the expected results once back in 5c0675215 as
> the 6 subplans will no longer be removed.
>
> I've been unable to reproduce this so keen to get results from a
> machine that can.

I tried for some time with 1, 2 and 3 parallel copies of that shell
script you showed (after tweaking it to stick $$ into the filename
used for test.out so they didn't clobber each other).  No dice,
couldn't make it fail.

So, as the option of last resort, I tried thinking about what's
actually going on here.

I think it might be working as designed!  The participants are allowed
to run other subplans, because they're parallel sub-plans: it's just a
question of whether any backend manages to complete its subplan and
then go looking for another subplan to attack before it is marked as
complete.  If that's true, then the loop number shown can't
technically be made stable, can it?  Unless you make it so that the
scans were not allowed to be run by more than one worker.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] path toward faster partition pruning

On 2018/04/13 1:47, Robert Haas wrote:
> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera  
> wrote:
>> Here's an idea.  Why don't we move the function/opclass creation lines
>> to insert.sql, without the DROPs, and use the same functions/opclasses
>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>> partition_prune.sql, i.e. not recreate what are essentially the same
>> objects three times?  This also leaves them around for the pg_upgrade
>> test, which is not a bad thing.
> 
> That sounds good, but maybe we should go further and move the
> partitioning tests out of generically-named things like insert.sql
> altogether and have test names that actually mention partitioning.

Do you mean to do that for *all* files that have tests exercising some
partitioning code, even if it's just one test?  I can see that tests in at
least some of them could be put into their own partition_ file as follows:

partition_insert (including tests in insert_conflict)
partition_update
partition_triggers
partition_indexing (indexing.sql added when partitioned indexes went in)
partition_ddl (for the tests in create_table and alter_table)

That leaves:

cluster
create_index (one test here could be moved to partition_indexing?)
foreign_data (could be moved to partition_ddl?)
foreign_key  (could be moved to partition_ddl?)
hash_part(leave alone, because already contains 'part' in the name?)
identity
join
plancache
plpgsql
publication
rowsecurity
rules
stats_ext
tablesample
truncate
updatable_views
vacuum


What about the tests in inherit.sql that start with:

--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
--

Maybe, just move all of them to partition_prune.sql, because we no longer
use constraint exclusion for pruning?

Thanks,
Amit

Re: Problem while setting the fpw with SIGHUP

On Thu, Apr 12, 2018 at 02:55:53PM -0400, Robert Haas wrote:
> I think it may actually be confusing.  If you run pg_ctl reload and it
> reports that the value has changed, you'll expect it to have taken
> effect.  But really, it will take effect at some later time.

It is true that sometimes some people like to temporarily disable
full_page_writes particularly when doing some bulk load of data to
minimize the effort on WAL, and then re-enable it just after doing 
the inserting this data.

Still does it matter when the change is effective?  By disabling
full_page_writes even temporarily, you accept the fact that this
instance would not be safe until the next checkpoint completes.  The
instance even finishes by writing less unnecessary WAL data if the
change is only effective at the next checkpoint.  Well, it is true that
this increases potential torn pages problems but the user is already
accepting that risk if a crash happens until the next checkpoint then it
exposes itself to torn pages anyway as it chose to disable
full_page_writes.
--
Michael


signature.asc
Description: PGP signature

Re: Instability in partition_prune test?

On 13 April 2018 at 10:29, Thomas Munro  wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver=2018-04-12%2018%3A18%3A05
>
>  partition_prune  ... FAILED
>
>Subplans Removed: 6
>->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 
> loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
> !  ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 
> loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
>->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 
> loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
> --- 1608,1614 
>Subplans Removed: 6
>->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 
> loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
> !  ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 
> loops=2)
>  Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
>->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 
> loops=1)
>  Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
>
> This is a Parallel Append with three processes working on three
> subplans.  It looks like one of the subplans got executed twice?

Hi Thomas,

Thanks for the report. If you're able to run the scripts in [1] and
confirm you can reproduce the failure, if so, then revert the code
back to 5c0675215 and see if you still get the additional loop.

You'll need to update the expected results once back in 5c0675215 as
the 6 subplans will no longer be removed.

I've been unable to reproduce this so keen to get results from a
machine that can.

[1] 
https://www.postgresql.org/message-id/CAKJS1f8o2Yd%3DrOP%3DEt3A0FWgF%2BgSAOkFSU6eNhnGzTPV7nN8sQ%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

MinIndexTupleSize seems slightly wrong

itup.h says of MinIndexTupleSize/MaxIndexTuplesPerPage:

/*
 * MaxIndexTuplesPerPage is an upper bound on the number of tuples that can
 * fit on one index page.  An index tuple must have either data or a null
 * bitmap, so we can safely assume it's at least 1 byte bigger than a bare
 * IndexTupleData struct.  We arrive at the divisor because each tuple
 * must be maxaligned, and it must have an associated item pointer.
 */
#define MinIndexTupleSize MAXALIGN(sizeof(IndexTupleData) + 1)
#define MaxIndexTuplesPerPage   \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData

However, that at least seems questionable to me. See _bt_pgaddtup()
for a simple example of this -- "minus infinity" items on internal
pages are sized sizeof(IndexTupleData).

The code still seems fine to me, since that only happens at most once
per page. Is it worth noting the exception?

-- 
Peter Geoghegan

Re: Native partitioning tablespace inheritance

On Thu, Apr 12, 2018 at 03:39:19PM -0400, Robert Haas wrote:
> Well I don't object to updating the documentation, but just because
> something isn't what the user expects doesn't make it a bug.  Users
> can have arbitrary expectations.

Yes, I agree that this is not a bug, and should not be categorized as
such.  Mentioning in the documentation explicitely that tablespaces are
not inherited may be worth it.

> On a practical level, what I think users *should* expect is that table
> partitioning behaves like table inheritance except in cases where
> we've gotten around to doing something different.  Of course, the
> behavior of table partitioning here is no worse than what table
> inheritance does.  The behavior doesn't cascade from parent to child
> there, either.  If we start classifying as a bug every area where
> table partitioning works like table inheritance but someone can think
> of something better, we've probably got about 50 bugs, some of which
> will require years of work to fix.

+1.

> Unless done rather carefully, it's also going to break
> dump-and-restore and, I think, likely also pg_upgrade.  Suppose that
> in the original cluster TABLESPACE was set on the parent but not on
> the children.  The children are therefore dumped without a TABLESPACE
> clause.  On the old cluster that would have worked as intended, but
> with this change the children will end up in the parent's tablespace
> instead of the database default tablespace.

I have not looked at the problem closely, but now pg_dump relies heavily
on default_tablespace to make sure that a table is using the correctly
tablespace and does not append a TABLESPACE clause to CREATE TABLE so as
pg_restore can work with --no-tablespaces.  So we'll surely get some
regressions and corner cases if not careful.
--
Michael


signature.asc
Description: PGP signature

Re: Instability in partition_prune test?

Thomas Munro  writes:
> Apologies if this was already discussed, I didn't see it.  One of my
> animals elver had a one-off failure this morning:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver=2018-04-12%2018%3A18%3A05

Yeah, this looks very much like my recent complaint:

https://www.postgresql.org/message-id/1876.1523224471%40sss.pgh.pa.us

Yours is the third case so far.  But we've had little luck reproducing it,
so there's no way to be sure if Rowley's proposed fix helps.  If you can
help test that, please do.

regards, tom lane

Re: wal_consistency_checking reports an inconsistency on master branch

On Thu, Apr 12, 2018 at 11:47 AM, Peter Geoghegan  wrote:
> In short, it looks like the tests added to update.sql by commit
> 2f178441 ("Allow UPDATE to move rows between partitions") lead to this
> failure, since I always hit a problem when update.sql is reached. I
> haven't gone to the trouble of digging any deeper than that just yet.

Without having looked at it in much detail, this seems rather more
likely to be the fault of 2f178441. That was recent enough that it's
easy to believe that I'd be the first to notice it, and actually has
on-disk changes, in the form of ItemPointerSetMovedPartitions().

-- 
Peter Geoghegan

Re: [HACKERS] Runtime Partition Pruning

On 13 April 2018 at 04:57, Robert Haas  wrote:
> BTW, looking at ExecSetupPartitionPruneState:
>
> /*
>  * Create a sub memory context which we'll use when making calls to 
> the
>  * query planner's function to determine which partitions will
> match.  The
>  * planner is not too careful about freeing memory, so we'll ensure we
>  * call the function in this context to avoid any memory leaking in 
> the
>  * executor's memory context.
>  */
>
> This is a sloppy cut-and-paste job, not only because somebody changed
> one copy of the word "planner" to "executor" and left the others
> untouched, but also because the rationale isn't really correct for the
> executor anyway, which has memory contexts all over the place and
> frees them all the time.  I don't know whether the context is not
> needed at all or whether the context is needed but the rationale is
> different, but I don't buy that explanation.

The comment is written exactly as intended. Unsure which of the
"planner"s you think should be "executor".

The context is needed. I can easily produce an OOM without it.



-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: crash with sql language partition support function

I wonder what prompted people to #include "catalog/partition.h" in
executor.h.

Amit Langote wrote:

> Anyway, after reading your replies, I thought of taking a stab at unifying
> the partitioning information that's cached by relcache.c.

After going over your patch, I think you went slightly overboard here.
Or maybe not, but this patch is so large that it's hard to form an
opinion about it.  Some of these cleanups we should probably adopt per
discussion upthread, but I think we're at a point where we have to work
in smaller steps to avoid destabilizing the code.

I'm not sure about the new PartitionInfo that you propose.  I see you
had to add partitioning/partbounds.h to rel.h -- not a fan of that.
I was thinking of a simpler fix there, just remove one of the two
memcxts in RelationData and put both structs in the remaining one.
Maybe I'm not ambitious enough.

Here's what I would propose: create partitioning/partbounds.c to deal
with partition bounds (i.e. mostly PartitionBoundInfoData and
PartitionBoundSpec), and have utils/cache/partcache.c (with
corresponding utils/partcache.h) for RelationGetPartitionDesc and
RelationGetPartitionKey (not much else, it seems).

Maybe also move get_partition_for_tuple to execPartition.c?

Preliminarly, I would say that the following functions would be in
partbounds.c (in roughly this order):

Exported:

get_qual_from_partbound
partition_bounds_equal
partition_bounds_copy
check_new_partition_bound
check_default_allows_bound
get_hash_partition_greatest_modulus
make_one_range_bound

qsort_partition_list_value_cmp
partition_rbound_cmp
partition_rbound_datum_cmp
qsort_partition_hbound_cmp
partition_hbound_cmp

partition_list_bsearch
partition_range_bsearch
partition_range_datum_bsearch
partition_hash_bsearch

static:
get_partition_bound_num_indexes
make_partition_op_expr
get_partition_operator
get_qual_for_hash
get_qual_for_list
get_qual_for_range
get_range_key_properties
get_range_nulltest

Unsure yet about compute_hash_value and satisfies_hash_partition.

The rest would remain in catalog/partition.c, which should hopefully not
be a lot.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: submake-errcodes

2018-04-12 Thread Christoph Berg

Re: Tom Lane 2018-04-12 <10354.1523558...@sss.pgh.pa.us>
> ... or then again, maybe I do.  Is it possible that your build
> recipe involves invoking our makefiles from an outer "make" run?
> If so, maybe you need to explicitly set MAKELEVEL=0 when invoking
> our build, to keep it from thinking it is a sub-make.  Not sure
> about whether it'd be wise to reset MAKEFLAGS as well.

I don't know about Devrim's case, but debian/rules is indeed a
makefile. Will look closer again once I'm back from pgconf.de, thanks
for the hint.

Christoph

Re: Covering GiST indexes

2018-04-12 Thread Komяpa

> Another thing that could be done for PostGIS geometries is just another
> opclass which
> stores geometries "as is" in leafs.  As I know, geometries contain MBRs
> inside their
> own, so there is no need to store extra MBR.  I think the reason why
> PostGIS
> doesn't have such opclass yet is that geometries are frequently large and
> easily
> can exceed maximum size of index tuple.
>

Geometry datatype layout was designed with TOASTing in mind: most of data
is stored in the header, including type, SRID, box and some other flags, so
getting just several first bytes tells you a lot.

PostGIS datasets are often of a mixed binary length: in buildings, for
example, it is quite common to have a lot of four corner houses, and just
one mapped as a circle, that digitizing software decided to make via
720-point polygon. Since reading it from TOAST for an index would require a
seek of some kind, it may be as efficient to just look it up in the table.

This way a lossy decompress function can help with index only scans: up to
some binary length, try to store the original geometry in the index. After
that, store a shape that has less points in it but covers slightly larger
area, plus a flag that it's not precise. There are a lot of ways to
generate a covering shape with less points: obvious is a box, less obvious
is non axis aligned box, a collection of boxes for a multipart shape, an
outer ring for an area with lots of holes, a convex hull or a smallest
enclosing k-gon.

In GIS there is a problem of border of Russia: the country overlaps over
180 meridian and has a complex border shape. if you take a box of it, it
spans from -180 to 180. If you query any spot in US or in Europe, you'll
have it intersecting with your area, require a full recheck, complete
detoast and decompression, and then "no, it's not a thing we need, skip".
Allowing anything better than a box would help. If we're allowing a complex
shape - we've got the container for it, geometry.

If we're storing geometry in index and original's small, why not allow
complete Index Only Scan on it, and let it skip recheck? :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

Robert Haas wrote:

> Please revert the part of this commit that changed the lock level.

Done.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Native partitioning tablespace inheritance

2018-04-12 Thread Keith Fiske

On Thu, Apr 12, 2018 at 3:39 PM, Robert Haas  wrote:

> On Thu, Apr 12, 2018 at 3:15 PM, Jonathan S. Katz
>  wrote:
> > Behavior-wise it’s certainly a bug: you add a TABLESPACE on the parent
> > table, and that property is not passed down to the children, which is not
> > what the user expects.  At a minimum, if we don’t back patch it, we
> probably
> > need to update the documentation to let people know.
>
> Well I don't object to updating the documentation, but just because
> something isn't what the user expects doesn't make it a bug.  Users
> can have arbitrary expectations.
>
> On a practical level, what I think users *should* expect is that table
> partitioning behaves like table inheritance except in cases where
> we've gotten around to doing something different.  Of course, the
> behavior of table partitioning here is no worse than what table
> inheritance does.  The behavior doesn't cascade from parent to child
> there, either.  If we start classifying as a bug every area where
> table partitioning works like table inheritance but someone can think
> of something better, we've probably got about 50 bugs, some of which
> will require years of work to fix.
>
> This is clearly new development aimed at delivering a novel behavior.
> It isn't going to fix an error in code that already exists.  It's
> going to write new code to do something that the system has never done
> before.
>
> Unless done rather carefully, it's also going to break
> dump-and-restore and, I think, likely also pg_upgrade.  Suppose that
> in the original cluster TABLESPACE was set on the parent but not on
> the children.  The children are therefore dumped without a TABLESPACE
> clause.  On the old cluster that would have worked as intended, but
> with this change the children will end up in the parent's tablespace
> instead of the database default tablespace.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Your last example is why I proposed taking the TABLESPACE defined on the
parent and applying it to the children. Then all the children have one
defined and nothing breaks as long as all tablespaces are properly defined
as part of the restoration.

I'm also not sure that we should have this mindset of partitioning working
as inheritance does either. Inheritance was only used before because it was
the only mechanism available. And while you do still use it under the hood
for parts of partitioning, I don't see any reason we should be let people
assume that partitioning works anything like inheritance. In my opinion
they are two very distinct options. Now we have "real" partitioning, so
let's act like it. That doesn't mean defining old behavior as a "bug", I
agree. It just means we're defining and clarifying how partitioning itself
is supposed to be working.

-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com

Re: Native partitioning tablespace inheritance

On Thu, Apr 12, 2018 at 3:15 PM, Jonathan S. Katz
 wrote:
> Behavior-wise it’s certainly a bug: you add a TABLESPACE on the parent
> table, and that property is not passed down to the children, which is not
> what the user expects.  At a minimum, if we don’t back patch it, we probably
> need to update the documentation to let people know.

Well I don't object to updating the documentation, but just because
something isn't what the user expects doesn't make it a bug.  Users
can have arbitrary expectations.

On a practical level, what I think users *should* expect is that table
partitioning behaves like table inheritance except in cases where
we've gotten around to doing something different.  Of course, the
behavior of table partitioning here is no worse than what table
inheritance does.  The behavior doesn't cascade from parent to child
there, either.  If we start classifying as a bug every area where
table partitioning works like table inheritance but someone can think
of something better, we've probably got about 50 bugs, some of which
will require years of work to fix.

This is clearly new development aimed at delivering a novel behavior.
It isn't going to fix an error in code that already exists.  It's
going to write new code to do something that the system has never done
before.

Unless done rather carefully, it's also going to break
dump-and-restore and, I think, likely also pg_upgrade.  Suppose that
in the original cluster TABLESPACE was set on the parent but not on
the children.  The children are therefore dumped without a TABLESPACE
clause.  On the old cluster that would have worked as intended, but
with this change the children will end up in the parent's tablespace
instead of the database default tablespace.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: submake-errcodes

2018-04-12 Thread Andres Freund

On 2018-04-12 11:22:45 -0700, Jacob Champion wrote:
> On Thu, Apr 12, 2018 at 11:00 AM, Tom Lane  wrote:
> > Andrew Gierth  writes:
> >> Is it worth exploring the idea of changing to a non-recursive style of
> >> makefile?
> >
> > Not post-feature-freeze, for sure.  Whether it's worth the work as a
> > long-term project, I dunno.
> 
> I've been taking a look at this in my limited free time, so I might as
> well publicly register my interest here. Moving to non-recursive Make
> will probably be quite a bit of work/pain, but I also think it's
> probably worth it in the end.

Yea, it'd sure be nice.  Whether it's worth the pain or whether the time
is better spent moving to cmake or such, I'm not sure.

Greetings,

Andres Freund

Re: Native partitioning tablespace inheritance

2018-04-12 Thread Keith Fiske

On Thu, Apr 12, 2018 at 3:15 PM, Jonathan S. Katz <
jonathan.k...@excoventures.com> wrote:

>
> > On Apr 12, 2018, at 3:10 PM, Alvaro Herrera 
> wrote:
> >
> > Robert Haas wrote:
> >> On Thu, Apr 12, 2018 at 2:40 PM, Jonathan S. Katz
> >>  wrote:
> >>> If there are no strong objections I am going to add this to the “Older
> Bugs”
> >>> section of Open Items in a little bit.
> >>
> >> I strongly object.  This is not a bug.  The TABLESPACE clause doing
> >> exactly what it was intended to do, which is determine where all of
> >> the storage associated with the partitioned table itself goes.  It so
> >> happens that there is no storage, so now somebody would like to
> >> repurpose the same option to do something different.  That's fine, but
> >> it doesn't make the current behavior wrong.  And we're certainly not
> >> going to back-patch a behavior change like that.
>
> Behavior-wise it’s certainly a bug: you add a TABLESPACE on the parent
> table, and that property is not passed down to the children, which is not
> what the user expects.  At a minimum, if we don’t back patch it, we
> probably
> need to update the documentation to let people know.
>
> > Keep in mind that we do not offer any promises to fix items listed in
> > the Older Bugs section; as I said elsewhere, it's mostly a dumping
> > ground for things that get ignored later.  I think it's fine to add it
> > there, if Jon wants to keep track of it, on the agreement that it will
> > probably not lead to a backpatched fix.
>
> Per an off-list discussion, it does not make sense to back patch but
> it does make sense to try to get it into 11 as part of making things more
> stable.
>
> Perhaps as a short-term fix, we update the docs to let users know that if
> you put a TABLESPACE on the parent table it does not get passed down
> to the children?
>
> Jonathan
>
>

I also think it's rather confusing that, even though there is technically
no data storage going on with the parent, that the parent itself does not
get placed in the tablespace given to the creation command. Just completely
ignoring a flag given to a command with zero feedback is my biggest
complaint on this. If it's going to be ignored, at least giving some sort
of feedback (kind of like long name truncation does) would be useful here
and should be considered for back-patching to 10.
-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com

Proposal: Remove "no" from the default english.stop word list

2018-04-12 Thread Peter Marreck

I recently ran into an issue where (after implementing fulltext search on
my site) a user searching real estate listings for "no pets" also got
results for "pets OK"! This was obviously a problem. After investigating,
it seems the word "no" is considered a stopword by default (it's in the
english.stop word list), and is therefore not indexed. I am here to propose
that this is wrong based on the following reasons:

1) The word "yes" is not also included in this stopword list, a bizarre
omission if the reason "no" was included was due to lack of significance
(although I would recommend omitting both and arguing that both are
significant)
2) The word "no" IS significant as a qualifier (such as, in my case, "no
pets", or more usefully, "no<->pets" if using to_tsquery instead of
plainto_tsquery), especially on boolean-like data that is brought into
fulltext search scope (so for example, if some attribute "balcony" is
false/not checked, you could index that as "no balcony" which then makes
both the presence AND the absence of it searchable...)

That's basically it. Thoughts?

-Peter

Native partitioning tablespace inheritance

2018-04-12 Thread David G. Johnston

On Thursday, April 12, 2018, Robert Haas  wrote:

> On Thu, Apr 12, 2018 at 2:40 PM, Jonathan S. Katz
>  wrote:
> > If there are no strong objections I am going to add this to the “Older
> Bugs”
> > section of Open Items in a little bit.
>
> I strongly object.  This is not a bug.  The TABLESPACE clause doing
> exactly what it was intended to do, which is determine where all of
> the storage associated with the partitioned table itself goes.  It so
> happens that there is no storage, so now somebody would like to
> repurpose the same option to do something different.
>

The part about accepting an option that is basically invalid is reasonably
bug-like.  Having tablespace and partition by clauses be mutually exclusive
would be worthy of fixing though it couldn't be back-patched.
Documentation is good but outright prevention is better.

If we can't agree on the future behavior we should at least prevent the
existing situation in v11.  I'm doubting whether redefine behavior of the
existing option to anything other than an error would be acceptable.

David J.

Re: Native partitioning tablespace inheritance

On Thu, Apr 12, 2018 at 3:10 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Thu, Apr 12, 2018 at 2:40 PM, Jonathan S. Katz
>>  wrote:
>> > If there are no strong objections I am going to add this to the “Older 
>> > Bugs”
>> > section of Open Items in a little bit.
>>
>> I strongly object.  This is not a bug.  The TABLESPACE clause doing
>> exactly what it was intended to do, which is determine where all of
>> the storage associated with the partitioned table itself goes.  It so
>> happens that there is no storage, so now somebody would like to
>> repurpose the same option to do something different.  That's fine, but
>> it doesn't make the current behavior wrong.  And we're certainly not
>> going to back-patch a behavior change like that.
>
> Keep in mind that we do not offer any promises to fix items listed in
> the Older Bugs section; as I said elsewhere, it's mostly a dumping
> ground for things that get ignored later.  I think it's fine to add it
> there, if Jon wants to keep track of it, on the agreement that it will
> probably not lead to a backpatched fix.

*shrug*

If it's not a bug, then it doesn't make sense to add it to a list of
bugs just as a way of keeping track of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Native partitioning tablespace inheritance

Robert Haas wrote:
> On Thu, Apr 12, 2018 at 2:40 PM, Jonathan S. Katz
>  wrote:
> > If there are no strong objections I am going to add this to the “Older Bugs”
> > section of Open Items in a little bit.
> 
> I strongly object.  This is not a bug.  The TABLESPACE clause doing
> exactly what it was intended to do, which is determine where all of
> the storage associated with the partitioned table itself goes.  It so
> happens that there is no storage, so now somebody would like to
> repurpose the same option to do something different.  That's fine, but
> it doesn't make the current behavior wrong.  And we're certainly not
> going to back-patch a behavior change like that.

Keep in mind that we do not offer any promises to fix items listed in
the Older Bugs section; as I said elsewhere, it's mostly a dumping
ground for things that get ignored later.  I think it's fine to add it
there, if Jon wants to keep track of it, on the agreement that it will
probably not lead to a backpatched fix.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Native partitioning tablespace inheritance

On Thu, Apr 12, 2018 at 2:40 PM, Jonathan S. Katz
 wrote:
> If there are no strong objections I am going to add this to the “Older Bugs”
> section of Open Items in a little bit.

I strongly object.  This is not a bug.  The TABLESPACE clause doing
exactly what it was intended to do, which is determine where all of
the storage associated with the partitioned table itself goes.  It so
happens that there is no storage, so now somebody would like to
repurpose the same option to do something different.  That's fine, but
it doesn't make the current behavior wrong.  And we're certainly not
going to back-patch a behavior change like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem while setting the fpw with SIGHUP

On Wed, Apr 11, 2018 at 7:09 AM, Heikki Linnakangas  wrote:
> I think the new behavior where the GUC only takes effect at next checkpoint
> is OK. It seems quite intuitive.

I think it may actually be confusing.  If you run pg_ctl reload and it
reports that the value has changed, you'll expect it to have taken
effect.  But really, it will take effect at some later time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Instability in the postgres_fdw regression test

On Thu, Apr 12, 2018 at 12:49 PM, Tom Lane  wrote:
> We've been seeing $subject since commit 1bc0100d2, with little clue
> as to the cause.  Previous attempts to fix it by preventing autovacuum
> from running on the relevant tables didn't seem to help.
>
> I have now managed to reproduce the issue reliably enough to study it.
> (It turns out that on longfin's host, running the buildfarm script *under
> cron* produces the failure a reasonable percentage of the time.  The
> identical test case, launched from an interactive shell, never fails.
> I speculate that the kernel scheduler treats cron jobs differently.)
> It is in fact a timing issue, and what triggers it is there being an
> active autovacuum task in another database.  The fact that the buildfarm
> script uses USE_MODULE_DB in the contrib tests greatly increases the
> surface area of the problem, since that creates a lot more databases that
> autovacuum could be active in.  Once I realized that, I found that it can
> trivially be reproduced by hand, in a "make installcheck" test, simply by
> having an open transaction in another DB in the cluster.  For instance
> "select pg_sleep(60);" and then run make installcheck in postgres_fdw.
>
> So now, drilling down to the specific problem: what we're seeing is that
> the plans for some queries change because the "remote" server reports
> a different rowcount estimate than usual for
>
> EXPLAIN SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE 
> (("C 1" > 2000)) FOR UPDATE
>
> Normally the estimate is one, but in the failure cases it's 12 or so.
> This estimate is coming out of ineq_histogram_selectivity, of course.
> Furthermore, the highest value actually present in the histogram is
> 1000, because that was the highest value of "C 1" when the test did
> ANALYZE up at the top.  That means we'll invoke get_actual_variable_range
> to try to identify the actual current maximum.  Most of the time, it
> returns 2010, which is correct, and we end up estimating that only
> about one row will exceed 2000.
>
> However, in the failure cases, what's getting extracted from the index
> is .  That's because we'd inserted and deleted that value further
> up in the test, and if there's been an open transaction holding back
> RecentGlobalXmin, then SnapshotNonVacuumable is going to consider that
> entry still good.  This value is enough larger than 2000 to move the
> selectivity estimate appreciably, and then kaboom.
>
> This theory successfully explains the observed fact that some buildfarm
> failures show differences in two query plans, while others show a
> difference just in the first one.  In the latter cases, the external
> transaction ended, so that RecentGlobalXmin could advance, in between.
>
> What I propose to do to fix the instability is to change the test
> stanza that uses  as a key-chosen-at-random to use something less
> far away from the normal range of "C 1" values, so that whether it's
> still visible to get_actual_variable_range has less effect on this
> selectivity estimate.  That's a hack, for sure, but I don't see any
> other fix that's much less of a hack.

Thanks for the detective work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

wal_consistency_checking reports an inconsistency on master branch

Running "make installcheck" with wal_consistency_checking='all' on the
master branch shows the follow failure on a streaming replica:

19696/2018-04-12 11:35:29 PDT FATAL:  inconsistent page found, rel
1663/50192/66636, forknum 0, blkno 0
19696/2018-04-12 11:35:29 PDT CONTEXT:  WAL redo at 2/6D8411F8 for
Heap/DELETE: off 4 KEYS_UPDATED
19695/2018-04-12 11:35:29 PDT LOG:  startup process (PID 19696) exited
with exit code 1
19695/2018-04-12 11:35:29 PDT LOG:  terminating any other active
server processes
19695/2018-04-12 11:35:29 PDT LOG:  database system is shut down

I can correlate it with this wal_debug output on the primary:

18713/2018-04-12 11:20:40 PDT ERROR:  new row violates check option
for view "upview"
18713/2018-04-12 11:20:40 PDT DETAIL:  Failing row contains (a, 4, 120, 1, 1).
18713/2018-04-12 11:20:40 PDT STATEMENT:  UPDATE upview set c = 120 WHERE b = 4;
18713/2018-04-12 11:20:40 PDT LOG:  INSERT @ 2/6D8411F8:  -
Transaction/ABORT: 2018-04-12 11:20:40.085145-07
18073/2018-04-12 11:20:40 PDT LOG:  xlog bg flush request write
2/6D84; flush: 0/0, current is write 2/6D84; flush 2/6D7B40B0
18713/2018-04-12 11:20:40 PDT LOG:  INSERT @ 2/6D841378:  -
Heap/DELETE: off 4 KEYS_UPDATED
18713/2018-04-12 11:20:40 PDT STATEMENT:  UPDATE upview set a = 'b', b
= 15, c = 120 WHERE b = 4;
18713/2018-04-12 11:20:40 PDT LOG:  INSERT @ 2/6D8415F0:  - Heap/INSERT: off 10
18713/2018-04-12 11:20:40 PDT STATEMENT:  UPDATE upview set a = 'b', b
= 15, c = 120 WHERE b = 4;
18713/2018-04-12 11:20:40 PDT ERROR:  new row violates check option
for view "upview"
18713/2018-04-12 11:20:40 PDT DETAIL:  Failing row contains (b, 15, 120, 1, 1).
18713/2018-04-12 11:20:40 PDT STATEMENT:  UPDATE upview set a = 'b', b
= 15, c = 120 WHERE b = 4;

In short, it looks like the tests added to update.sql by commit
2f178441 ("Allow UPDATE to move rows between partitions") lead to this
failure, since I always hit a problem when update.sql is reached. I
haven't gone to the trouble of digging any deeper than that just yet.

-- 
Peter Geoghegan

Re: Native partitioning tablespace inheritance

2018-04-12 Thread Jonathan S. Katz


> On Apr 12, 2018, at 2:36 PM, Christophe Pettus  wrote:
> 
> 
>> On Apr 12, 2018, at 09:17, Robert Haas  wrote:
>> Hmm, that's interesting.  So you want the children to inherit the
>> parent's tablespace when they are created, but if the parent's
>> tablespace is later changed, the existing children don't move?
> 
> +1 to that behavior.
> 
> While it's always possible to just say "do the right thing" to the 
> application when creating new children (that is, expect that they will always 
> specify a tablespace if it's not the default), this seems like the 
> least-surprising behavior.
> 
> It's true that an unpartitioned table will always be created in the default 
> tablespace unless otherwise specified, but child tables are sufficiently 
> distinct from that case that I don't see it as a painful asymmetry.

If there are no strong objections I am going to add this to the “Older Bugs”
section of Open Items in a little bit.

Jonathan

Re: submake-errcodes

2018-04-12 Thread Andrew Gierth

> "Alvaro" == Alvaro Herrera  writes:

 Alvaro> I'm altogether ignorant on how might we use it for the headers
 Alvaro> problem, mind. I only tried to tackle the main executable.

Solving the headers problem would seem to require making things
non-recursive at the topmost level rather than trying to do only a
partial conversion.

-- 
Andrew (irc:RhodiumToad)

Re: Native partitioning tablespace inheritance

2018-04-12 Thread Christophe Pettus

> On Apr 12, 2018, at 09:17, Robert Haas  wrote:
> Hmm, that's interesting.  So you want the children to inherit the
> parent's tablespace when they are created, but if the parent's
> tablespace is later changed, the existing children don't move?

+1 to that behavior.

While it's always possible to just say "do the right thing" to the application 
when creating new children (that is, expect that they will always specify a 
tablespace if it's not the default), this seems like the least-surprising 
behavior.

It's true that an unpartitioned table will always be created in the default 
tablespace unless otherwise specified, but child tables are sufficiently 
distinct from that case that I don't see it as a painful asymmetry.

--
-- Christophe Pettus
   x...@thebuild.com

Re: submake-errcodes

Devrim =?ISO-8859-1?Q?G=FCnd=FCz?=  writes:
> ... still same error. Easy to reproduce on F-27 box. 

I don't have F27 at hand, but I tried F26 and F28, and I can't reproduce
on either one.  I tried various combinations of python2 versus python3,
in-tree build versus VPATH from bare checkout versus VPATH from
distprep'd tree, and they all work for me.

Just to clarify, I'm experimenting with

$ git clean -dfx
$ ./configure ... --with-python [ PYTHON=/usr/bin/python3 ]
$ cd src/pl/plpython/
$ make -j25

and variants of that, and what I get as a result is a make trace
starting with

make -C ../../../src/backend generated-headers

If you're not seeing that, something's very wrong, and I do not
know what.

[ time passes ]

... or then again, maybe I do.  Is it possible that your build
recipe involves invoking our makefiles from an outer "make" run?
If so, maybe you need to explicitly set MAKELEVEL=0 when invoking
our build, to keep it from thinking it is a sub-make.  Not sure
about whether it'd be wise to reset MAKEFLAGS as well.

regards, tom lane

Re: crash with sql language partition support function

Robert Haas wrote:
> On Thu, Apr 12, 2018 at 8:55 AM, Alvaro Herrera  
> wrote:
> > Amit Langote wrote:
> >> Anyway, after reading your replies, I thought of taking a stab at unifying
> >> the partitioning information that's cached by relcache.c.
> >
> > Wow.  Now that's one large patch.  I'm going to run with this for HEAD,
> > but I think we should do a minimal fix for PG10.
> 
> Is there really a compelling reason to do more than minimal fixes in
> HEAD?

IMO there is ample reason for restructuring the lot of code that currently
lives in catalog/partition.c.  We're going to have to support this code
for a minimum of six years -- and that's only if we get around to
reorganizing it during pg12.  I don't have a lot of faith that I'll be
motivated to reorganize it in pg12, but I do know that I am motivated to
reorganize it now.  If we do happen to reorganize it in pg12, then any
bugs we find afterwards will cost double in terms of backpatching the
fixes than if we reorganize it now.  I don't want my future self to
curse my current self for not doing it when it was appropriate -- i.e.
when it was fresh in my mind and freshly written.

> We are (or should be) trying to stabilize this branch so we can
> ship it and start the next one,

Yes, that is what I am trying to do -- I want us to have a sane base
upon which to do our work for years to come.

Do we need more than minimal fixes in the memory context, FmgrInfo, and
miscellaneous other fixes that Amit is proposing in this patch?  That I
don't know.  I hope to have an answer for this later, and I think this
is the reason for your final comment:

> and the chances that heavy hacking on this will delay that process
> seem better than average.

Maybe if there are no bugs and it's just ugly coding, it is best left
alone.  But as I said, I don't know the answer yet.

I will make sure to propose any functional code changes separately from
code movement.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: psql leaks memory on query cancellation


I imagine that this indicates that control-C processing allocates some
memory it doesn't free, resulting in an "island" up at the end of memory
that prevents glibc from releasing all the free memory it's got.  Whether
that's an actual leak, or just memory we're holding onto in hopes of
reusing it, isn't clear.  (valgrind might be useful.)


malloc could request memory from kernel by two ways: sbrk() and mmap(), 
first one has described problem, mmap hasn't. It's described in 
mallopt(3) in section M_MMAP_THRESHOLD, to test that try to repeat test 
with MALLOC_MMAP_THRESHOLD_ environment  set to 8192.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: crash with sql language partition support function

Alvaro Herrera wrote:

> I'm dealing with this now -- will push shortly.  The sane thing to do is
> backpatch my previous memcxt fixes, since your patch introduces a
> problem that we discussed with that other patch, namely that you would
> leak the whole memory context if there is a problem while running the
> function.  I also noticed here that copy_partition_key is doing memcpy()
> on the fmgr_info, which is bogus -- it should be using fmgr_info_copy.
> Rather than patching that one piece it seems better to replace it
> wholesale, since I bet this won't be the last time we'll hear about this
> routine, and I would prefer not to deal with differently broken code in
> the older branch.

Pushed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: submake-errcodes

Andrew Gierth wrote:
> > "Tom" == Tom Lane  writes:
> 
>  Tom> I'm beginning to get dissatisfied with this approach of expecting
>  Tom> the topmost Make run to do the generated-headers work
> 
> Is it worth exploring the idea of changing to a non-recursive style of
> makefile?

I looked onto this a while ago.  It seems easily doable for the backend
code proper, modulo potential command-line length issues; but anything
involving shared libraries requires Makefile.shlib to be involved, which
seems painful to resolve.  Once I realized the shlib problem, I stopped
trying, but if you find some way around that it may be worthwhile.

I'm altogether ignorant on how might we use it for the headers problem,
mind.  I only tried to tackle the main executable.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: submake-errcodes

Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> I'm beginning to get dissatisfied with this approach of expecting
>  Tom> the topmost Make run to do the generated-headers work

> Is it worth exploring the idea of changing to a non-recursive style of
> makefile?

Not post-feature-freeze, for sure.  Whether it's worth the work as a
long-term project, I dunno.

regards, tom lane

Re: submake-errcodes

2018-04-12 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 Tom> I'm beginning to get dissatisfied with this approach of expecting
 Tom> the topmost Make run to do the generated-headers work

Is it worth exploring the idea of changing to a non-recursive style of
makefile?

-- 
Andrew (irc:RhodiumToad)

Re: psql leaks memory on query cancellation

Craig Ringer  writes:
> On 12 April 2018 at 18:26, Darafei "Komяpa" Praliaskouski  
> wrote:
>> Is it expected behavior (so I can have a look at something server returned
>> somehow and it's kept there for me), or a plain leak?

> This is totally normal behaviour for any C program.

Well, it depends.  On some platforms, malloc basically never gives back
memory to the OS, but on glibc/Linux it does.  What I see here,
experimenting with a slightly-cut-down version of the example, is
that if psql is allowed to finish the query normally then its memory
usage according to "top" *does* go back down to very little at the
end.  But if one hits control-C partway through, it doesn't.

I imagine that this indicates that control-C processing allocates some
memory it doesn't free, resulting in an "island" up at the end of memory
that prevents glibc from releasing all the free memory it's got.  Whether
that's an actual leak, or just memory we're holding onto in hopes of
reusing it, isn't clear.  (valgrind might be useful.)

In short, there might be something that could be improved here, but
I've not dug into it enough to say for sure.

BTW, I also notice that either psql or libpq seems to take a darn
long time to release a several-GB-sized query result.  That might
be improvable as well.

regards, tom lane

Re: Covering GiST indexes

On Thu, Apr 12, 2018 at 4:00 AM, Andrey Borodin  wrote:
> I have two concerns.
> First one is about INDEX_AM_RESERVED_BIT.
> B-tree uses it as a base for prefix truncation (I'm not quite sure why it is 
> usually called suffix truncation, but this is a matter for other thread).

Since you brought it up, and since I pushed that particular
terminology, I should acknowledge that the original 1977 Bayer paper
on suffix truncation calls a B-Tree with suffix truncation a prefix
B-Tree. However, more recent work seems to consistently refer to the
technique as suffix truncation, while also referring to more advanced
techniques for compressing (not truncating) leaf tuples as prefix
compression.

I suggested suffix truncation because it seemed to be the dominant way
of referring to the technique. And, because it seemed more logical:
the suffix is what gets truncated away.

-- 
Peter Geoghegan

Re: [HACKERS] Runtime Partition Pruning

On Thu, Apr 12, 2018 at 12:40 AM, David Rowley
 wrote:
> I guess the problem there would be there's nothing to say that parse
> analysis will shortly be followed by a call to the planner, and a call
> to the planner does not mean the plan is about to be executed. So I
> don't think it would be possible to keep pointers to relcache entries
> between these modules, and it would be hard to determine whose
> responsibility it would be to call relation_close().

Yeah, that's definitely a non-starter.

> It might be possible to do something better in each module by keeping
> an array indexed by RTI which have each entry NULL initially then on
> first relation_open set the element in the array to that pointer.

I'm not sure that makes a lot of sense in the planner, but in the
executor it might be a good idea.   See also
https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
for related discussion.  I think that a coding pattern where we rely
on relation_open(..., NoLock) is inherently dangerous -- it's too easy
to be wrong about whether the lock is sure to have been taken.  It
would be much better to open the relation once and hold onto the
pointer, not just for performance reasons, but for robustness.

BTW, looking at ExecSetupPartitionPruneState:

/*
 * Create a sub memory context which we'll use when making calls to the
 * query planner's function to determine which partitions will
match.  The
 * planner is not too careful about freeing memory, so we'll ensure we
 * call the function in this context to avoid any memory leaking in the
 * executor's memory context.
 */

This is a sloppy cut-and-paste job, not only because somebody changed
one copy of the word "planner" to "executor" and left the others
untouched, but also because the rationale isn't really correct for the
executor anyway, which has memory contexts all over the place and
frees them all the time.  I don't know whether the context is not
needed at all or whether the context is needed but the rationale is
different, but I don't buy that explanation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Instability in the postgres_fdw regression test

We've been seeing $subject since commit 1bc0100d2, with little clue
as to the cause.  Previous attempts to fix it by preventing autovacuum
from running on the relevant tables didn't seem to help.

I have now managed to reproduce the issue reliably enough to study it.
(It turns out that on longfin's host, running the buildfarm script *under
cron* produces the failure a reasonable percentage of the time.  The
identical test case, launched from an interactive shell, never fails.
I speculate that the kernel scheduler treats cron jobs differently.)
It is in fact a timing issue, and what triggers it is there being an
active autovacuum task in another database.  The fact that the buildfarm
script uses USE_MODULE_DB in the contrib tests greatly increases the
surface area of the problem, since that creates a lot more databases that
autovacuum could be active in.  Once I realized that, I found that it can
trivially be reproduced by hand, in a "make installcheck" test, simply by
having an open transaction in another DB in the cluster.  For instance
"select pg_sleep(60);" and then run make installcheck in postgres_fdw.

So now, drilling down to the specific problem: what we're seeing is that
the plans for some queries change because the "remote" server reports
a different rowcount estimate than usual for

EXPLAIN SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 
1" > 2000)) FOR UPDATE

Normally the estimate is one, but in the failure cases it's 12 or so.
This estimate is coming out of ineq_histogram_selectivity, of course.
Furthermore, the highest value actually present in the histogram is
1000, because that was the highest value of "C 1" when the test did
ANALYZE up at the top.  That means we'll invoke get_actual_variable_range
to try to identify the actual current maximum.  Most of the time, it
returns 2010, which is correct, and we end up estimating that only
about one row will exceed 2000.

However, in the failure cases, what's getting extracted from the index
is .  That's because we'd inserted and deleted that value further
up in the test, and if there's been an open transaction holding back
RecentGlobalXmin, then SnapshotNonVacuumable is going to consider that
entry still good.  This value is enough larger than 2000 to move the
selectivity estimate appreciably, and then kaboom.

This theory successfully explains the observed fact that some buildfarm
failures show differences in two query plans, while others show a
difference just in the first one.  In the latter cases, the external
transaction ended, so that RecentGlobalXmin could advance, in between.

What I propose to do to fix the instability is to change the test
stanza that uses  as a key-chosen-at-random to use something less
far away from the normal range of "C 1" values, so that whether it's
still visible to get_actual_variable_range has less effect on this
selectivity estimate.  That's a hack, for sure, but I don't see any
other fix that's much less of a hack.

regards, tom lane

Re: [HACKERS] path toward faster partition pruning

On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera  wrote:
> Here's an idea.  Why don't we move the function/opclass creation lines
> to insert.sql, without the DROPs, and use the same functions/opclasses
> in the three tests insert.sql, alter_table.sql, hash_part.sql and
> partition_prune.sql, i.e. not recreate what are essentially the same
> objects three times?  This also leaves them around for the pg_upgrade
> test, which is not a bad thing.

That sounds good, but maybe we should go further and move the
partitioning tests out of generically-named things like insert.sql
altogether and have test names that actually mention partitioning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: Covering + unique indexes.




Peter Geoghegan wrote:

On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan  wrote:

I did find another problem, though. Looks like the idea to explicitly
represent the number of attributes directly has paid off already:

pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
create unique index cov_idx on covering_bug (f1) include(f2);
insert into covering_bug select i, i * random() * 1000, i * random() *
10 from generate_series(0,10) i;
DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
CREATE TABLE
DEBUG:  building index "cov_idx" on table "covering_bug" serially
CREATE INDEX
ERROR:  tuple has wrong number of attributes in index "cov_idx"


Actually, this was an error on my part (though I'd still maintain that
the check paid off here!). I'll still add defensive assertions inside
_bt_newroot(), and anywhere else that they're needed. There is no
reason to not add defensive assertions in all code that handles page
splits, and needs to fetch a highkey from some other page. We missed a
few of those.
Agree, I prefer to add more Assert, even. may be, more than actually 
needed. Assert-documented code :)




I'll add an item to "Decisions to Recheck Mid-Beta" section of the
open items page for this patch. We should review the decision to make
a call to _bt_check_natts() within _bt_compare(). It might work just
as well as an assertion, and it would be unfortunate if workloads that
don't use covering indexes had to pay a price for the
_bt_check_natts() call, even if it was a small price. I've seen
_bt_compare() appear prominently in profiles quite a few times.



Could you show a patch?

I think, we need move _bt_check_natts() and its call under 
USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't 
pay for unused feature.

--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Covering GiST indexes

2018-04-12 Thread Alexander Korotkov

Hi, Andrey!

On Thu, Apr 12, 2018 at 2:00 PM, Andrey Borodin 
wrote:

> Looks like we finally have covering indexes! And that's cool!
>

Thank you!

So I decided to create a thread to discuss covering GiST indexes.
> Here's a prototype patch implementing this functionality.
> It is quite small (+80 -30) and lacks tests and docs. But it creates a
> context.
>
> I have two concerns.
> First one is about INDEX_AM_RESERVED_BIT.
> B-tree uses it as a base for prefix truncation (I'm not quite sure why it
> is usually called suffix truncation, but this is a matter for other thread).
> GiST , probably, will not use [pre\su]fix truncation. But I'd like to use
> that 13th bit to implement intra-page indexing - a way to improve search
> within gist page. See [0,1]
>

I also think that GiST isn't going to do suffix truncation of keys, because
GiST
is composing keys for every attribute and trying to use them in queries if
even GiST didn't use those attributes in any split.  Depending on data
distribution,
key representation, query and so on that keys may appear useful or not.
And GiST doesn't have any legal interface to determine that in advance.

I think that GiST needs another feature: not suffix truncation, but
different
attribute representation in leaf pages and internal pages.  For example,
now GiST on points stores boxes in leafs.  That's a great waste of space.
So, we might just have different tuple descriptors for internal and leaf
pages of GiST, which could have both different attribute types and
different counts of attributes.  Thankfully GiST pages don't have high keys
and we don't have to mix tuples of different types on the same page
(unlike B-tree).

Second, currently including indexes do not allow same attributes in both
> keys and include parts.
> # create index on x using gist(c) include (c);
> ERROR:  included columns must not intersect with key columns
>
> But it makes sense for example for geometries like PostGIS. Index keys are
> truncated to small MBRs while having whole complex geometry right in an
> index could be handy.
>

The issue discovered in [1] seems to be related.  Thus, after fix provided
there when
column is indexed without index-only scan support, then it can't be used in
index-only
scan anyway (if even it's indexed another time with index-only scan
support).
So, I think we need a better fix for [1] first.

Another thing that could be done for PostGIS geometries is just another
opclass which
stores geometries "as is" in leafs.  As I know, geometries contain MBRs
inside their
own, so there is no need to store extra MBR.  I think the reason why PostGIS
doesn't have such opclass yet is that geometries are frequently large and
easily
can exceed maximum size of index tuple.  The same problem applies to
coverting
indexes too.  Possible solution here might be to support external toasted
attributes
in covering indexes.  But I think that still requires quite amount of work.

1.
https://www.postgresql.org/message-id/1516210494.1798.16.camel%40nlpgo.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Native partitioning tablespace inheritance

On Wed, Apr 11, 2018 at 9:55 PM, David Rowley
 wrote:
> I imagine the correct thing to do is properly record the TABLESPACE
> option for the partitioned table then make child tables use that if
> nothing else was specified.
>
> This would allow the parent partition's tablespace to be changed from
> time to time as disk partitions become full to allow the new
> partitions to be created in the tablespace which sits on a disk with
> the most free space.

Hmm, that's interesting.  So you want the children to inherit the
parent's tablespace when they are created, but if the parent's
tablespace is later changed, the existing children don't move?  I
guess that's a defensible behavior, but it's not one I would have
considered.  It's certainly quite different from what the TABLESPACE
option means when applied to an unpartitioned table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crash with sql language partition support function

On Thu, Apr 12, 2018 at 8:55 AM, Alvaro Herrera  wrote:
> Amit Langote wrote:
>> Anyway, after reading your replies, I thought of taking a stab at unifying
>> the partitioning information that's cached by relcache.c.
>
> Wow.  Now that's one large patch.  I'm going to run with this for HEAD,
> but I think we should do a minimal fix for PG10.

Is there really a compelling reason to do more than minimal fixes in
HEAD?  We are (or should be) trying to stabilize this branch so we can
ship it and start the next one, and the chances that heavy hacking on
this will delay that process seem better than average.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

Robert Haas wrote:

> I don't think it was a good idea to change this without a lot more
> discussion, as part of another commit that really was about something
> else, and after feature freeze.

> Please revert the part of this commit that changed the lock level.

You're right, that was too hasty.  Will revert.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crash with sql language partition support function

Amit Langote wrote:

> Since this bug also exists in the released PG 10 branch, I also created a
> patch for that.  It's slightly different than the one for PG 11dev,
> because there were some changes recently to how the memory context is
> manipulated in RelationBuildPartitionKey.  That's
> v1-PG10-0001-Fix-a-memory-context-bug-in-RelationBuildPartitio.patch.

Here's a repro for pg10, which doesn't have hash partitioning:

create function my_int4_sort(int4,int4) returns int language sql
  as $$ select case when $1 = $2 then 0 when $1 > $2 then 1 else -1 end; $$;
create operator class test_int4_ops for type int4 using btree as
  operator 1 < (int4,int4), operator 2 <= (int4,int4),
  operator 3 = (int4,int4), operator 4 >= (int4,int4),
  operator 5 > (int4,int4), function 1 my_int4_sort(int4,int4);
create table t (a int4) partition by range (a test_int4_ops);
create table t1 partition of t for values from (0) to (1000);
insert into t values (100);
insert into t values (200); -- *boom*

I'm dealing with this now -- will push shortly.  The sane thing to do is
backpatch my previous memcxt fixes, since your patch introduces a
problem that we discussed with that other patch, namely that you would
leak the whole memory context if there is a problem while running the
function.  I also noticed here that copy_partition_key is doing memcpy()
on the fmgr_info, which is bogus -- it should be using fmgr_info_copy.
Rather than patching that one piece it seems better to replace it
wholesale, since I bet this won't be the last time we'll hear about this
routine, and I would prefer not to deal with differently broken code in
the older branch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WARNING in parallel index creation.

On Wed, Apr 11, 2018 at 2:34 PM, Tom Lane  wrote:
> That test is, therefore, wrong.  Otherwise, no non-builtin function
> could ever be marked parallel safe, for fear that the shlib it lives
> in might try to set up a custom variable at load time.

I don't follow that logic.  If the check is more stringent than it
needs to be for safety, then we can relax it.  If the check is less
stringent than it needs to be for safety, we need to tighten it.  If
it's exactly right, then it has to stay the way it is, even if that
prohibits some things we wish we could allow.

> I'm of the opinion that having such a test here at all is crummy design.
> It implies that set_config_option is in charge of knowing about every
> one of its callers and passing judgment on whether they represent
> parallel-safe usages, which is the exact opposite of modularity,
> even if set_config_option had enough information to make that judgment
> which it does not.

The point of SerializeGUCState and RestoreGUCState is to enforce an
invariant that all cooperating processes have the same idea about the
values of every GUC.  If you let the GUCs be changed, then isn't that
invariant is violated regardless of why you let it happen?  For
example:

rhaas=# set pg_trgm.similarity_threshold = '002';
SET
rhaas=# select current_setting('pg_trgm.similarity_threshold');
 current_setting
-
 002
(1 row)

rhaas=# load 'pg_trgm';
WARNING:  2 is outside the valid range for parameter
"pg_trgm.similarity_threshold" (0 .. 1)
LOAD
rhaas=# select current_setting('pg_trgm.similarity_threshold');
 current_setting
-
 0.3
(1 row)

This shows that if you load a library that defines a custom variable
in process A but not process B, it's trivial to construct a query that
no longer returns the same answer in both backends, which isn't too
good, because the point of parallel query is to deliver the same
answer with parallel query that you would have gotten without it, or
at the very least, to make it look like the answer was generated by a
single process with a unified view of the world rather than multiple
uncoordinated processes.

I tried the following test with Jeff's example function and with
plperl.on_init='' and it does not produce a warning:

rhaas=# set force_parallel_mode = on;
SET
rhaas=# select foobar('dsgsdg');
 foobar

 gdsgsd
(1 row)

I haven't tracked it down, but I think what must be going on here is
that, when the function is called via a query, something triggers the
shared library to get loaded before we enter parallel mode.  Then
things work OK, because the worker will load the shared library before
restoring the leader's GUC state, and so the state is actually
guaranteed to match.  Here it will *probably* end up matching because
most likely every process that loads the module will interpret the
existing textual setting of the newly-defined GUC in the same way, but
it's a bit less ironclad, and if they decide to emit warnings (as in
my first example, above) then you'll get multiple copies of them.  So
I don't think just drilling a hole through the prohibition in
set_config_option() is a particularly appealing solution.  It would
amount to suppressing a warning that is basically legitimate.

I wonder if there's an easy way to force the libraries that we're
going to need to be loaded before we reach CreateParallelContext().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: submake-errcodes

2018-04-12 Thread Devrim Gündüz


Hi Tom,

On Wed, 2018-04-11 at 10:16 -0400, Tom Lane wrote:
> You sure you're on 31f1f0bb4fd642643994d35c35ecb5b929711a99 or later?

To make sure, I am using latest git snapshot:

https://download.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2

and still same error. Easy to reproduce on F-27 box. 

> Which gmake version is this?

4.2.1

Regards,
-- 
Devrim Gündüz
EnterpriseDB: https://www.enterprisedb.com
PostgreSQL Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

signature.asc
Description: This is a digitally signed message part

Re: Creation of wiki page for open items of v11

2018-04-12 Thread Jonathan S. Katz


> On Apr 12, 2018, at 9:24 AM, Ashutosh Bapat  
> wrote:
> 
> On Thu, Apr 12, 2018 at 6:52 PM, Alvaro Herrera  
> wrote:
>> 
>> Now, maybe what you suggest for open items is to create a separate view
>> using much of the commitfest app code, say
>> https://commitfest.postgresql.org/openitems/NNN
>> I think that would probably work well.
> 
> Yes, that's what I had in mind.

+1.  Once the .org refresh goes live I would be happy to hack on
something for that, with the understanding it may not be adopted for
the v11 cycle.

Jonathan

Re: submake-errcodes

Michael Paquier  writes:
> But this does not work:
> ./configure blah
> cd src/pl/plpython/
> make -j 4 check

Hm.  That shows yet another parallel-safety hazard, which can be resolved
like this:

diff --git a/src/pl/plpython/Makefile b/src/pl/plpython/Makefile
index 653fe64..c17015b 100644
--- a/src/pl/plpython/Makefile
+++ b/src/pl/plpython/Makefile
@@ -131,7 +131,7 @@ installcheck: submake-pg-regress

 .PHONY: submake-pg-regress
-submake-pg-regress:
+submake-pg-regress: | submake-generated-headers
$(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X)

 clean distclean: clean-lib

to ensure that we finish the generated-headers work before launching
that child Make run.

I'm beginning to get dissatisfied with this approach of expecting the
topmost Make run to do the generated-headers work; it's bleeding into
more places than I'd hoped.  I don't see any really good alternative
though.  If we allow the child runs to try to do it, we're going to
have issues with parallel runs clobbering each others' output.  It's
somewhat surprising that that didn't occur more often before; the
only real difference since the bootstrap data restructuring is that
we have more generated headers than we used to.

I guess the good news is that the supported cases will be a whole lot
more bulletproof against high -j counts than they were before.  I never
used to dare going beyond -j8, because builds tended to fall over.

regards, tom lane

Re: Partitioned tables and covering indexes


pushed. Hope, second try will be successful.

Teodor Sigaev wrote:

Thank you, pushed

Amit Langote wrote:

Hi.

On 2018/04/11 0:36, Teodor Sigaev wrote:

 Does the attached fix look correct?  Haven't checked the fix with
ATTACH
 PARTITION though.


Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.


Seems right way, do not modify incoming object and do not copy rather
large and deep nested structure as suggested by Amit.


Yeah, Alexander's suggested way of using a separate variable for
indexParams is better.


But it will  be better to have a ATTACH PARTITION test too.


I have added tests.  Actually, instead of modifying existing tests, I
think it might be better to have a separate section at the end of
indexing.sql to test covering indexes feature for partitioned tables.

Attached find updated patch.

Thanks,
Amit





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Reopen logfile on SIGHUP

2018-04-12 Thread Alexander Kuzmenkov


On 11.04.2018 00:00, Tom Lane wrote:

So we need a mechanism that's narrowly targeted
to reopening the logfile, without SIGHUP'ing the entire database.


We can send SIGUSR1 to the syslogger process. To make its pid easier to 
find out, it can be published in "$PGDATA/logging_collector.pid", as 
suggested by Grigory. The attached patch does this. It also adds a brief 
description of how to use this with logrotate.



Point 2: Depending on how you've got the log filenames configured,
setting rotation_requested may result in a change in log filename


If logrotate only needs the file to be reopened, syslogger's rotation 
does just than when using a static log file name. I imagine logrotate 
can be configured to do something useful with changing file names, too. 
It is a matter of keeping the configuration of syslogger and logrotate 
consistent.



BTW, another thing that needs to be considered is the interaction with
rotation_disabled.  Right now we automatically drop that on SIGHUP, but
I'm unclear on whether it should be different for logrotate requests.


The SIGUSR1 path is supposed to be used by automated tools. In a sense, 
it is an automatic rotation, the difference being that it originates 
from an external tool and not from syslogger itself. So, it sounds 
plausible that the rotation request shouldn't touch the 
rotation_disabled flag, and should be disabled by it, just like the 
automatic rotation.


Still, this leads us to a scenario where we can lose logs:
1. postgres is configured to use a static file name. logrotate is 
configured to move the file, send SIGUSR1 to postgres syslogger, gzip 
the file and delete it.
2. logrotate starts the rotation. It moves the file and signals postgres 
to reopen it.
3. postgres fails to reopen the file because there are too many files 
open (ENFILE/EMFILE), which is a normal occurrence on heavily loaded 
systems. Or it doesn't open the new file because the rotation_disable 
flag is set. It continues logging to the old file.
4. logrotate has no way to detect this failure, so it gzips the file and 
unlinks it.
5. postgres continues writing to the now unlinked file, and we lose an 
arbitrary amount of logs until the next successful rotation.


With dynamic file names, logrotate can be told to skip open files, so 
that it doesn't touch our log file if we haven't switched to the new 
one. With a static file name, the log file is always open, so this 
method doesn't work. I'm not sure how to make this work reliably.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4a68ec3b40..91b96dc588 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -960,6 +960,23 @@ pg_ctl start | rotatelogs /var/log/pgsql_log 86400
   
 
   
+   The logrotate program can also be used to collect log files 
+   produced by the built-in logging collector. In this configuration, the location and
+   names of the log files are determined by logging collector, and 
+   logrotate periodically archives these files. There must be
+   some way for logrotate to inform the application that the
+   log file it is using is going to be archived, and that it must direct further output
+   to a new file. This is commonly done with a postrotate script that
+   sends a SIGHUP signal to the application, which then reopens the log
+   file. In PostgreSQL, this is achieved by sending a 
+   SIGUSR1 signal to the logging collector process. The PID of this
+   process can be found in the $PGDATA/logging_collector.pid file.
+   When the logging collector receives the signal, it switches to the new log files
+   according to its configuration (see ).
+   If it is configured to use a static file name, it will just reopen the file.
+  
+
+  
On many systems, however, syslog is not very reliable,
particularly with large log messages; it might truncate or drop messages
just when you need them the most.  Also, on Linux,
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 58b759f305..99201727ec 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -138,6 +138,8 @@ static void flush_pipe_input(char *logbuffer, int *bytes_in_logbuffer);
 static void open_csvlogfile(void);
 static FILE *logfile_open(const char *filename, const char *mode,
 			 bool allow_errors);
+static void syslogger_on_exit(int code, Datum arg);
+static void syslogger_publish_pid(void);
 
 #ifdef WIN32
 static unsigned int __stdcall pipeThread(void *arg);
@@ -286,6 +288,9 @@ SysLoggerMain(int argc, char *argv[])
 	set_next_rotation_time();
 	update_metainfo_datafile();
 
+	/* publish my pid to pid file */
+	syslogger_publish_pid();
+
 	/* main worker loop */
 	for (;;)
 	{
@@ -1417,6 +1422,64 @@ update_metainfo_datafile(void)
 		LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE)));
 }
 
+/*
+ * The pid

Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

On Mon, Apr 2, 2018 at 3:11 PM, Alvaro Herrera  wrote:
> Why do we need AccessExclusiveLock on all children of a relation that we
> want to scan to search for rows not satisfying the constraint?  I think
> it should be enough to get ShareLock, which prevents INSERT, no?  I have
> a feeling I'm missing something here, but I don't know what, and all
> tests pass with that change.

I don't think it was a good idea to change this without a lot more
discussion, as part of another commit that really was about something
else, and after feature freeze.

As Kyotaro Horiguchi also mentioned, this introduces a deadlock
hazard.  With current master:

Setup:

create table foo (a int, b text) partition by range (a);
create table foo1 partition of foo for values from (1) to (100);
create table food (a int, b text) partition by range (a);
create table food1 partition of food for values from (1) to (100);

Session 1:
begin;
BEGIN
rhaas=# select * from food1;
 a | b
---+---
(0 rows)

rhaas=# insert into food1 values (1, 'thunk');

Session 2:
rhaas=# alter table foo attach partition food default;

At which point session 1 deadlocks, because the lock has to be
upgraded to AccessExclusiveLock since we're changing the constraints.

Now you might think about relaxing the lock level for the later
acquisition, too, but I'm not sure that's safe.  The issue is that
scanning a relation for rows that don't match the new constraint isn't
by itself sufficient: you also have to be sure that nobody can add one
later.  If they don't have the relation open, they'll certainly
rebuild their relcache entry when they open it, so it'll be fine.  But
if they already have the relation open, I'm not sure we can be certain
it will get rebuilt if, later in the same transaction, they try to
insert data.  This whole area needs more research -- there may very
well be good opportunities to reduce lock levels in this area, but it
needs careful study and analysis.

Please revert the part of this commit that changed the lock level.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Bugs in TOAST handling, OID assignment and redo recovery

Pavan Deolasee  writes:
> On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane  wrote:
>> So while looking at this, it suddenly occurred to me that probing with
>> SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
>> either.

> Yeah it occurred to me as well, but when I looked at the code, I couldn't
> find a case that is broken. I even tried a few test cases with DDLs etc.

I think it probably can't happen for catalog MVCC scans, because we
generally take full new snapshots for those.  The situation that would be
hazardous is where we use an existing snapshot (so that it doesn't see
the just-committed Other Transaction) but advance its command counter
(so that it can see the new object we just made).  So the sort of failure
I'd predict is that a user query just after an object creation could see
duplicate OIDs in the catalogs.  To get to that point, you might need a
stable function (using the troublesome snapshot) calling a volatile one
(which contains the actual DDL).

> But I think what you did is fine and more bullet proof. So +1 to that.

Yeah, even if this isn't actually a reachable case, it's hard to argue
that we should complicate the code and take API-break risks in the back
branches to make a tiny optimization on the assumption that it can never
happen.

regards, tom lane

Re: submake-errcodes

Christoph Berg  writes:
> Most of these work for me as well. The actual incantation via
> debian/rules fails, but I couldn't really narrow down what the
> difference is. Manually invoking the debian/rules targets:

For a moment I thought that this might be the critical difference:
> preparing build tree... done
ie that you're doing a VPATH build.  But no, that case works as
expected too, for me.

The only conclusion I can come to is that you're using sources that
predate my recent makefile fixes, particularly 3b8f6e75f.  Please
double-check that.

regards, tom lane

Re: Partitioned tables and covering indexes

Thanks to everyone, this part is pushed. I will waiting a bit before pushing 
topic-stater patch to have a time to look on buildfarm.


Alvaro, could you add a comment to CompareIndexInfo() to clarify why it doesn't 
compare indoptions (like DESC/ASC etc)? It doesn't matter for uniqueness of 
index but changes an order and it's not clear at least for me why we don't pay 
attention for that.



In attached path I did:
1) fix a bug in CompareIndexInfo() which checks opfamily for including column
2) correct size for all vectors
3) Add assertion in various places to be sure that we don't try to use including 
column as key column
4) per Peter Geoghegan idea add a error message if somebody adds options to 
include column instead silently ignore it.





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Creation of wiki page for open items of v11

On Thu, Apr 12, 2018 at 6:52 PM, Alvaro Herrera  wrote:
>
> Now, maybe what you suggest for open items is to create a separate view
> using much of the commitfest app code, say
> https://commitfest.postgresql.org/openitems/NNN
> I think that would probably work well.

Yes, that's what I had in mind.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Creation of wiki page for open items of v11

Ashutosh Bapat wrote:
> On Thu, Apr 12, 2018 at 6:27 PM, Alvaro Herrera  
> wrote:
> > Ashutosh Bapat wrote:
> >
> >> We use commitfest app for tracking the patches submitted. It has done
> >> well. Can we re-purpose the same for tracking open items?
> >
> > I think the rules are too different to cram both in the same place.
> 
> What do you mean by same place? Can you please explain this a bit?

"Both in the same place" I meant both the CF items and the open items,
in a single place which is the commitfest app.

The commitfest app is designed with the commitfest workflow in mind.
There are rules about when items can be added, there are states that
correspond to patches being developed and discussed.  Open items don't
follow the same rules.  What I fear is that if we cram open items in the
commitfest app, we will loosen rules and break the workflow in some way
that will cause us pain when we're back to normal commitfest mode.

The other point is that we want additional annotations for open items.
What commit creates an open item (so we know which committer to blame),
when was it created, when did we last notify the committer, what commit
closed it.  We don't have that in commitfest, and I bet there would be
an urge to add those eventually.

I was suggesting a different app because that app would implement
solutions for these somewhat different needs.

Now, maybe what you suggest for open items is to create a separate view
using much of the commitfest app code, say
https://commitfest.postgresql.org/openitems/NNN
I think that would probably work well.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-12 Thread Pavan Deolasee

On Thu, Apr 12, 2018 at 5:21 AM, Tom Lane  wrote:

> Michael Paquier  writes:
> > I have not really checked this thread in details, but one thing that
> > strikes me is that it would be rather easy to add a TAP test based on
> > the initial script that Pavan has sent.  Would that be worth testing
> > cycles or not?
>
> I doubt it --- that test is so specialized that it'd be unlikely to
> catch any other bug.
>
>
I agree; it's not worth adding a TAP test except may be as a demonstration
to write future tests for catching such issues. This was a very specialised
test case written after getting a full grasp on the bug. And it just tests
the thing that I knew is broken based on code reading. Also, with OID
duplicate issue fixed, hitting more bugs in this area is going to be even
more difficult.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Creation of wiki page for open items of v11

On Thu, Apr 12, 2018 at 6:27 PM, Alvaro Herrera  wrote:
> Ashutosh Bapat wrote:
>
>> We use commitfest app for tracking the patches submitted. It has done
>> well. Can we re-purpose the same for tracking open items?
>
> I think the rules are too different to cram both in the same place.

What do you mean by same place? Can you please explain this a bit?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-12 Thread Pavan Deolasee

On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane  wrote:

> So while looking at this, it suddenly occurred to me that probing with
> SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
> either.
>

Yeah it occurred to me as well, but when I looked at the code, I couldn't
find a case that is broken. I even tried a few test cases with DDLs etc.
But I think what you did is fine and more bullet proof. So +1 to that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Creation of wiki page for open items of v11

Ashutosh Bapat wrote:

> We use commitfest app for tracking the patches submitted. It has done
> well. Can we re-purpose the same for tracking open items?

I think the rules are too different to cram both in the same place.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crash with sql language partition support function

Amit Langote wrote:

> Anyway, after reading your replies, I thought of taking a stab at unifying
> the partitioning information that's cached by relcache.c.

Wow.  Now that's one large patch.  I'm going to run with this for HEAD,
but I think we should do a minimal fix for PG10.  Did you detect any
further bugs, while doing all this rework, apart from the one that
started this thread?  If not, I would prefer to do commit the minimal
fix at start of thread to both branches, then apply the larger
restructuring patch to HEAD only.

For the record, I don't like the amount of code that this is putting in
relcache.c.  I am thinking that most of that code will go to
src/backend/partitioning/partbounds.c instead.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Boolean partitions syntax

2018-04-12 Thread Jonathan S. Katz

Hi,

> On Apr 12, 2018, at 12:12 AM, Kyotaro HORIGUCHI 
>  wrote:
> 
> Hello.
> 
> At Wed, 11 Apr 2018 09:43:37 -0400, "Jonathan S. Katz" 
> > 
> wrote in  >
>>case EXPR_KIND_PARTITION_BOUNDS:
>> ^~
> ..
>> 2 errors generated.
>> 
>> The attached patch fixes the error.
> 
> Sorry for the silly mistake.
> 
>> I ran the following cases:
>> 
>> Case #1: My Original Test Case
>> 
>>  CREATE TABLE records (
>>  id int GENERATED BY DEFAULT AS IDENTITY NOT NULL,
>>  record_date date NOT NULL,
>>  record_text text,
>>  archived bool NOT NULL DEFAULT FALSE
>>  ) PARTITION BY LIST(archived);
>> 
>>  CREATE TABLE records_archive
>>  PARTITION OF records
>>  FOR VALUES IN (TRUE);
>> 
>>  CREATE TABLE records_active
>>  PARTITION OF records
>>  FOR VALUES IN (FALSE);
>> 
>> Everything created like a charm.
>> 
>> Case #2: random()
>> 
>>  CREATE TABLE oddity (
>>  id int GENERATED BY DEFAULT AS IDENTITY NOT NULL,
>>  random_filter int
>>  ) PARTITION BY LIST(random_filter);
>> 
>>  CREATE TABLE oddity_random
>>  PARTITION OF oddity
>>  FOR VALUES IN ((random() * 100)::int);
>> 
>> I did a \d+ on oddity and:
>> 
>>  partitions=# \d+ oddity
>>  (truncated)
>>  Partition key: LIST (random_filter)
>>  Partitions: oddity_random FOR VALUES IN (19)
>> 
>> So this appears to behave as described above.
>> 
>> Attached is the patch with the fix for the build.  This is the first time 
>> I’m attaching
>> a patch for the core server, so apologizes if I missed up the formatting.
> 
> Thank you for verification and the revised patch. The format is
> fine and the fix is correct but I noticed that I forgot to remove
> plural S's from error messages. The attached is the version with
> the fix.

I applied the new version of the patch and ran through the above scenarios
again.  Everything behaved as expected from a user standpoint.

Thanks,

Jonathan

Re: psql leaks memory on query cancellation

2018-04-12 Thread Komяpa

>
>
> > Is it expected behavior (so I can have a look at something server
> returned
> > somehow and it's kept there for me), or a plain leak?
>
> This is totally normal behaviour for any C program.
>

Thanks Konstantin and Craig for the help.

To mitigate the issue I've changed the allocator on my laptop to jemalloc.
For single psql run on my Ubuntu system:

sudo apt install libjemalloc1
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 psql

A global replacement by putting /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
to /etc/ld.so.preload also had a positive effect fixing this behavior in
all the applications, reducing after-boot memory footprint from 7 GB to 3.

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

Re: Covering GiST indexes

2018-04-12 Thread Aleksander Alekseev

Hello Andrey,

> So I decided to create a thread to discuss covering GiST indexes.
> Here's a prototype patch implementing this functionality.  It is quite
> small (+80 -30) and lacks tests and docs. But it creates a context.

I'm glad you got interested in this area. It would be great to have
covering indexes support for GiST as well. Please don't forget to add
the thread to the nearest commitfest entry. I'm looking forward for the
next versions of your patch!

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Re: Covering GiST indexes

Interesting work. I don't have a time now to learn deep your patch, so, add it
to next commitfest, pls. First of all I'd like to see more tests in patch, not
only CREATE INDEX.

Andrey Borodin wrote:

Hi, hackers!

Looks like we finally have covering indexes! And that's cool!

So I decided to create a thread to discuss covering GiST indexes.
Here's a prototype patch implementing this functionality.
It is quite small (+80 -30) and lacks tests and docs. But it creates a context.

I have two concerns.
First one is about INDEX_AM_RESERVED_BIT.
B-tree uses it as a base for prefix truncation (I'm not quite sure why it is
usually called suffix truncation, but this is a matter for other thread).
GiST , probably, will not use [pre\su]fix truncation. But I'd like to use that
13th bit to implement intra-page indexing - a way to improve search within gist
page. See [0,1]

Second, currently including indexes do not allow same attributes in both keys
and include parts.
# create index on x using gist(c) include (c);
ERROR: included columns must not intersect with key columns

But it makes sense for example for geometries like PostGIS. Index keys are
truncated to small MBRs while having whole complex geometry right in an index
could be handy.

Any feedback will be appreciated.

Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/7780A07B-4D04-41E2-B228-166B41D07EEE%40yandex-team.ru
[1]
https://www.postgresql.org/message-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=gho4fewu...@mail.gmail.com

--
Teodor Sigaev E-mail: teo...@sigaev.ru
WWW: http://www.sigaev.ru/

Re: psql leaks memory on query cancellation

2018-04-12 Thread Craig Ringer

On 12 April 2018 at 18:26, Darafei "Komяpa" Praliaskouski  
wrote:
> Hi,
>
> psql (PostgreSQL) 10.3
>
> Here are the steps to reproduce a leak:
>
> 1. connect to 10.3 server, perform the query similar to:
>
> select 'message' || generate_series(1,10);
>
> 2. monitoring psql memory usage in htop or similar tool, press ctrl+c at
> some point where you can clearly distinguish a psql with a big allocated
> buffer from psql without it.
>
> 3. see the query cancelled, but psql memory usage stays the same.
>
> This is especially painful when query you're debugging has a runaway join
> condition, and you understand it only after it doesn't return in seconds as
> you've expected.
>
> Is it expected behavior (so I can have a look at something server returned
> somehow and it's kept there for me), or a plain leak?

This is totally normal behaviour for any C program.

It's part of why systems should have swap space. But it's generally
fairly harmless, as it's uncommon for apps to make huge allocations
only once then stay running with low memory use thereafter.

It would potentially be a leak if doing the same thing repeatedly led
to repeated growth.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: psql leaks memory on query cancellation

2018-04-12 Thread Konstantin Knizhnik




On 12.04.2018 13:26, Darafei "Komяpa" Praliaskouski wrote:

Hi,

psql (PostgreSQL) 10.3

Here are the steps to reproduce a leak:

1. connect to 10.3 server, perform the query similar to:

select 'message' || generate_series(1,10);

2. monitoring psql memory usage in htop or similar tool, press ctrl+c 
at some point where you can clearly distinguish a psql with a big 
allocated buffer from psql without it.


3. see the query cancelled, but psql memory usage stays the same.

This is especially painful when query you're debugging has a runaway 
join condition, and you understand it only after it doesn't return in 
seconds as you've expected.


Is it expected behavior (so I can have a look at something server 
returned somehow and it's kept there for me), or a plain leak?


Darafei Praliaskouski,
GIS Engineer / Juno Minsk


It seems to be effect of glibc malloc which doesn't return deallocated 
memory to OS.

I attach gdb to psql and print memory usage before/after query interruption:

Arena 0:
system bytes =  989835264
in use bytes =  989811328
Total (incl. mmap):
system bytes = 1258274816
in use bytes = 1258250880
max mmap regions =  1
max mmap bytes   =  268439552

^CCancel request sent
ERROR:  canceling statement due to user request
postgres=# Arena 0:
system bytes = 1284907008
in use bytes = 278032
Total (incl. mmap):
system bytes = 1284907008
in use bytes = 278032
max mmap regions =  1
max mmap bytes   =  536875008

As you can seen there are 1.2 Gb of mapped memory, but only 270kb of it 
is used.
Unfortunately it is more or less expected behavior of malloc: it is 
assumed that if application consume a lot of memory at some moment of 
time and then release it,
then most likely it will reuse it again in future. So there is not so 
much sense to return it to OS. And it is really not easy to do so 
because of fragmentation.
You can not easily unmap this 1.2 Gb of mapped space if there is live 
objects may be just just few bytes  length allocated somewhere in this area.


The only solution is to use some memory contexts like in Postgres 
backends, but it requires complete rewriting of libpq. I am not sure 
that somebody will want to do it.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Covering GiST indexes

2018-04-12 Thread Andrey Borodin

Hi, hackers!

Looks like we finally have covering indexes! And that's cool!

So I decided to create a thread to discuss covering GiST indexes.
Here's a prototype patch implementing this functionality.
It is quite small (+80 -30) and lacks tests and docs. But it creates a context.

I have two concerns.
First one is about INDEX_AM_RESERVED_BIT.
B-tree uses it as a base for prefix truncation (I'm not quite sure why it is 
usually called suffix truncation, but this is a matter for other thread).
GiST , probably, will not use [pre\su]fix truncation. But I'd like to use that 
13th bit to implement intra-page indexing - a way to improve search within gist 
page. See [0,1]

Second, currently including indexes do not allow same attributes in both keys 
and include parts.
# create index on x using gist(c) include (c);
ERROR:  included columns must not intersect with key columns

But it makes sense for example for geometries like PostGIS. Index keys are 
truncated to small MBRs while having whole complex geometry right in an index 
could be handy.

Any feedback will be appreciated.

Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/7780A07B-4D04-41E2-B228-166B41D07EEE%40yandex-team.ru
[1] 
https://www.postgresql.org/message-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=gho4fewu...@mail.gmail.com
 

0001-Covering-Gist.patch
Description: Binary data

psql leaks memory on query cancellation

2018-04-12 Thread Komяpa

Hi,

psql (PostgreSQL) 10.3

Here are the steps to reproduce a leak:

1. connect to 10.3 server, perform the query similar to:

select 'message' || generate_series(1,10);

2. monitoring psql memory usage in htop or similar tool, press ctrl+c at
some point where you can clearly distinguish a psql with a big allocated
buffer from psql without it.

3. see the query cancelled, but psql memory usage stays the same.

This is especially painful when query you're debugging has a runaway join
condition, and you understand it only after it doesn't return in seconds as
you've expected.

Is it expected behavior (so I can have a look at something server returned
somehow and it's kept there for me), or a plain leak?

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

Re: Partitioned tables and covering indexes

2018-04-12 Thread Alexander Korotkov

On Thu, Apr 12, 2018 at 1:14 AM, Teodor Sigaev  wrote:

> Peter Geoghegan wrote:
>
>> On Wed, Apr 11, 2018 at 2:29 PM, Peter Eisentraut
>>  wrote:
>>
>>> But in this case it doesn't even do equality comparison, it just returns
>>> the value.
>>>
>>
>> That's the idea that I tried to express. The point is that we need to
>> tell the user that there is no need to worry about it, rather than
>> that they're wrong to ask about it. Though we should probably actually
>> just throw an error.
>>
>
> Any operation over including columns is a deal for filter, not an index,
> so collation will not be used somewhere inside index.
>
> 2Alexander: ComputeIndexAttrs() may use whole vectors (for including
> columns too) just to simplify coding, in other places, seems, better to
> have exact size of vectors. Using full-sized vectors could mask a error,
> for exmaple, with setting opfamily to InvalidOid for key column. But if you
> insist on that, I'll modify attached patch to use full-sized vectors
> anywhere.
>
> In attached path I did:
> 1) fix a bug in CompareIndexInfo() which checks opfamily for including
> column
> 2) correct size for all vectors
> 3) Add assertion in various places to be sure that we don't try to use
> including column as key column
> 4) per Peter Geoghegan idea add a error message if somebody adds options
> to include column instead silently ignore it.


I don't insist on using full-sized vectors everywhere.  Attached patch
looks OK for me.  I haven't test it though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Partitioned tables and covering indexes

2018-04-12 Thread Alexander Korotkov

On Thu, Apr 12, 2018 at 1:14 AM, Teodor Sigaev  wrote:

> That's the idea that I tried to express. The point is that we need to
>>> tell the user that there is no need to worry about it, rather than
>>> that they're wrong to ask about it. Though we should probably actually
>>> just throw an error.
>>>
>>
>> Or maybe it should be the collation of the underlying table columns.
>> Otherwise the collation returned by an index-only scan would be
>> different from a table scan, no?
>>
> +1, dangerous

I'm OK with collation of included columns to be the same as collation
of underlying table columns.  But I still think we should throw an error
when user is trying to specify his own collation of included columns.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Problem while setting the fpw with SIGHUP

On Thu, Apr 12, 2018 at 04:59:10PM +0900, Kyotaro HORIGUCHI wrote:
> At Thu, 12 Apr 2018 14:07:53 +0900, Michael Paquier  
> wrote in <20180412050753.ga19...@paquier.xyz>
>> I have been able to spend a couple of hours on your patch, wrapping my
>> mind on your stuff.  So what I had in mind was something like this type
>> of scenario:
> 
> Thank for the precise explanation.

Just to be clear and to avoid incorrect conclusion.  This is the type of
scenarios I imagined about when I read your previous email, concluding
such scenarios those cannot apply per the strong assumption on
SharedRecoveryInProgress your patch heavily relies on.  In short I have
no objections.

>> (The latest patch is a mix of two patches.)
> 
> Sorry I counld get this.

The patch called v2-0001-Change-FPW-handling.patch posted on
https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp,
which is the latest version available, is a mix of the patch you are
creating for this thread and of a patch aimed a fixing an issue with
partition range handling.
--
Michael


signature.asc
Description: PGP signature

Re: Problem while setting the fpw with SIGHUP

2018-04-12 Thread Kyotaro HORIGUCHI

Hello.

At Thu, 12 Apr 2018 14:07:53 +0900, Michael Paquier  wrote 
in <20180412050753.ga19...@paquier.xyz>
> On Thu, Apr 12, 2018 at 10:34:30AM +0900, Kyotaro HORIGUCHI wrote:
> > Checkpointer never calls CreateCheckPoint while
> > RecoveryInProgress() == true. In other words, checkpointer is not
> > an updator of shared FPW at the time StartupXLOG calls
> > CreateCheckPoint for fallback_promote.
> 
> I have been able to spend a couple of hours on your patch, wrapping my
> mind on your stuff.  So what I had in mind was something like this type
> of scenario:

Thank for the precise explanation.

The scenario that CreateCheckPoint is called simultaneously in
different prcoesses seems broken by itself in the first place but
I put that aside for now.

> 1) The startup process requires a restart point.
> 2) The checkpointer receives the request, and blocks before reading
> RecoveryInProgress().

RecoveryInProgress doesn't take lock. But I assume here that
checkpointer is taking a long time after entering
RecoveryInProgress and haven't actually read
SharedRecoveryInProgress.

> 3) A fallback_promote is triggered, making the startup process call
> CreateCheckpoint().

I'm confused here. It seems to me that StartupXLOG calls
CreateCheckPoint only in bootstrap or standalone cases. No
concurrent processe is running in the cases.

Even if CreateCheckPoint is called there in the IsUnderPostmater
case, checkpointer never calls CreateCheckPoint during the
checkpoint. This is safe in regard to shared FPW. (but checkpoint
is being blocked in this scenario.)

Assuming that RequestCheckpoint() is called here, the request is
merged with the previous request above and the function returns
after the checkpoint ends. (That is, checkpointer must continue
to run in this case.)

> 4) Startup process finishes checkpoint, updates Insert->fullPageWrites.

According to this scenario, checkpionter is still stalling
now. So it is not a concurrent update.

> 5) Checkpoint reads RecoveryInProgress to false, moves on with its
> checkpoint.

If checkpointer sees SharedRecoveryInProgress being false,
Create(or Request)CheckPoint called in (3) must have finished and
StartupXLOG() no longer updates shared FPW. There's no concurrent
update.

> > The comment may be somewhat confusing that it is written
> > there. The point is that checkpointer and StartupXLOG are
> > mutually excluded on updating shared FPW by
> > SharedRecoveryInProgress flag.
> 
> Indeed.  I can see that it is the main key point of the patch.
> 
> > | * If full_page_writes has been turned off, issue XLOG_FPW_CHANGE before
> > | * the flag actually takes effect. Checkpointer never calls this function
> > | * before StartupXLOG() turns off SharedRecoveryInProgress so there's no
> > | * window where checkpointer and startup processes - the only updators of
> > | * the flag - can update shared FPW simultaneously. Thus no lock is
> > | * required here. Both shared and local fullPageWrites do not change
> > | * before the next reading below.
> 
> Yeah, this reduces the confusion.

Thanks^^;

> (The latest patch is a mix of two patches.)

Sorry I counld get this.

> +The default is on. The change of the parmeter 
> takes
> +effect at the next checkpoint time.
> s/parmeter/parameter/
> 
> By the way, I would vote for keeping track in WAL of full_page_writes
> switched from off to on.  This is not used in the backend, but that's
> still useful for debugging end-user issues.

Agreed and I tried that. The problem on that is that some records
can be written after REDO point before XLOG_FPW_CHANGE(true) is
written. However this is no problem for the FPW-related stuff to
work properly (since no one looks it), the FPW record suggests
that the current checkpoint loses FPI in the first several
records. This has a far larger impact with this patch because
shared FPW is always turned on just at REDO point.

So I choosed not to write XLOG_FPW_CHANGE(false) rather than
writing bogus records.

> Actually, I was wondering why a spin lock is not taken in
> RecoveryInProgress when reading SharedRecoveryInProgress, but that's
> from 1a3d1044 which added a memory barrier instead as guarantee...

Maybe it doesn't need barrier, since the flag is initialized as
true and becomes false just once and delay in reading by other
processes doesn't no harm. I think that bool doesn't suffer
atomicity. Even all these are true, some description is needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: submake-errcodes

2018-04-12 Thread Christoph Berg

Re: Michael Paquier 2018-04-12 <20180411235843.gg32...@paquier.xyz>
> > You sure you're on 31f1f0bb4fd642643994d35c35ecb5b929711a99 or later?
> > Which gmake version is this?

GNU Make 4.2.1

> For what it's worth, all those combinations work for me when on
> d1e90792:
> ./configure blah
> cd src/pl/plpython/
> make -j 4 [ all | install ]
> 
> But this does not work:
> ./configure blah
> cd src/pl/plpython/
> make -j 4 check
> 
> make[2]: *** [Makefile:89: keywords_srv.o] Error 1
> make[2]: *** Deleting file 'keywords_srv.o'
> make[2]: Leaving directory '/home/ioltas/git/postgres/src/common'
> make[1]: *** [../../../src/Makefile.global:568: submake-libpgport] Error 2

Most of these work for me as well. The actual incantation via
debian/rules fails, but I couldn't really narrow down what the
difference is. Manually invoking the debian/rules targets:

$ debian/rules stamp/configure-build-py3
mkdir -p stamp build-py3
cd build-py3 && ../configure \
   --with-python \
   PYTHON=/usr/bin/python3 \
   --mandir=/usr/share/postgresql/11/man 
--docdir=/usr/share/doc/postgresql-doc-11 --sysconfdir=/etc/postgresql-common 
--datarootdir=/usr/share/ --datadir=/usr/share/postgresql/11 
--bindir=/usr/lib/postgresql/11/bin --libdir=/usr/lib/x86_64-linux-gnu/ 
--libexecdir=/usr/lib/postgresql/ --includedir=/usr/include/postgresql/ 
--with-extra-version=" (Debian 11~~devel-1)" --enable-nls 
--enable-integer-datetimes --enable-thread-safety --enable-tap-tests 
--enable-debug  --disable-rpath --with-llvm --with-uuid=e2fs --with-gnu-ld 
--with-pgport=5432 --with-system-tzdata=/usr/share/zoneinfo --with-systemd  
CFLAGS='-g -O2 -fdebug-prefix-map=/srv/projects/postgresql/pg/master=. 
-fstack-protector-strong -Wformat -Werror=format-security 
-fno-omit-frame-pointer' LDFLAGS='-Wl,-z,relro -Wl,-z,now'
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
...
checking for PYTHON... /usr/bin/python3
configure: using python 3.6.5 (default, Apr  1 2018, 05:46:30)
checking for Python distutils module... yes
checking Python configuration directory... 
/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu
checking Python include directories... -I/usr/include/python3.6m
checking how to link an embedded Python application... 
-L/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu -lpython3.6m -lpthread -ldl  
-lutil -lm
...
preparing build tree... done
configure: creating ./config.status
config.status: creating GNUmakefile
config.status: creating src/Makefile.global
config.status: creating src/include/pg_config.h
config.status: creating src/include/pg_config_ext.h
config.status: creating src/interfaces/ecpg/include/ecpg_config.h
config.status: linking ../src/backend/port/tas/dummy.s to src/backend/port/tas.s
config.status: linking ../src/backend/port/dynloader/linux.c to 
src/backend/port/dynloader.c
config.status: linking ../src/backend/port/posix_sema.c to 
src/backend/port/pg_sema.c
config.status: linking ../src/backend/port/sysv_shmem.c to 
src/backend/port/pg_shmem.c
config.status: linking ../src/backend/port/dynloader/linux.h to 
src/include/dynloader.h
config.status: linking ../src/include/port/linux.h to src/include/pg_config_os.h
config.status: linking ../src/makefiles/Makefile.linux to src/Makefile.port
touch "stamp/configure-build-py3"

$ debian/rules stamp/build-py3
/usr/bin/make -C build-py3/src/pl/plpython
make[1]: Verzeichnis 
„/srv/projects/postgresql/pg/master/build-py3/src/pl/plpython“ wird betreten
/usr/bin/msgfmt -c -o po/cs.mo 
/srv/projects/postgresql/pg/master/build-py3/../src/pl/plpython/po/cs.po
... more msgfmt
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 
-fdebug-prefix-map=/srv/projects/postgresql/pg/master=. 
-fstack-protector-strong -Wformat -Werror=format-security 
-fno-omit-frame-pointer -fPIC -I. 
-I/srv/projects/postgresql/pg/master/build-py3/../src/pl/plpython 
-I/usr/include/python3.6m -I../../../src/include 
-I/srv/projects/postgresql/pg/master/build-py3/../src/include  -Wdate-time 
-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE   -c -o plpy_cursorobject.o 
/srv/projects/postgresql/pg/master/build-py3/../src/pl/plpython/plpy_cursorobject.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 
-fdebug-prefix-map=/srv/projects/postgresql/pg/master=. 
-fstack-protector-strong -Wformat -Werror=format-security 
-fno-omit-frame-pointer -fPIC -I. 
-I/srv/projects/postgresql/pg/master/build-py3/../src/pl/plpython 
-I/usr/include/python3.6m -I../../../src/include 
-I/srv/projects/postgresql/pg/master/build-py3/../src/include  -Wdate-time 
-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE   -c -o plpy_elog.o

Re: Creation of wiki page for open items of v11

On Thu, Apr 12, 2018 at 11:46:33AM +0530, Ashutosh Bapat wrote:
> Usually there are threads associated with the open items, then there
> are patches to solve those and owners responsible in resolving them.
> We have all the infrastructure in the commitfest app to track those
> things, may be we could just relabel things. I haven't seen the
> commitfest app code (and don't plan to do that myself) so may be this
> is just a wild idea..

Moving open items from the wiki to the CF app is definitely possible,
even for this release.  It is not mandatory to have a patch to create an
entry.  One just needs a ML thread.

Agreed as well that it is not necessary to list fixed items before the
first beta is released.
--
Michael


signature.asc
Description: PGP signature

Re: Creation of wiki page for open items of v11

On Wed, Apr 11, 2018 at 10:33 PM, Magnus Hagander  wrote:
> On Wed, Apr 11, 2018 at 6:57 PM, Andres Freund  wrote:
>>
>> On 2018-04-11 13:54:34 -0300, Alvaro Herrera wrote:
>> > The other proposal was that we could have a simple web app to track open
>> > items.  After all, we now know what we need from it.  A wiki page seems
>> > more laborious.  (The commitfest app also sprung from a wiki page.)
>>
>> Growing a number of non-issue issue trackers...
>>
>
> Indeed.
>
> (And of course, if we want to go in *any* direction away from the wiki, it's
> not going to happen in time for *this* release..)
>

We use commitfest app for tracking the patches submitted. It has done
well. Can we re-purpose the same for tracking open items?

Usually there are threads associated with the open items, then there
are patches to solve those and owners responsible in resolving them.
We have all the infrastructure in the commitfest app to track those
things, may be we could just relabel things. I haven't seen the
commitfest app code (and don't plan to do that myself) so may be this
is just a wild idea..
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Creation of wiki page for open items of v11