date:20181107

Hi,

Thank you updating the patch and adding notes to the documentation about
the points I raised.

On 2018/11/07 9:53, Michael Paquier wrote:
> On Tue, Nov 06, 2018 at 07:04:17PM +0900, Amit Langote wrote:
>> Agree with keeping it simple.  Maybe, we could (should?) document that the
>> only ON COMMIT action that works when specified with partitioned parent
>> table is DROP (other actions are essentially ignored)?
> 
> I have been thinking about this one, and here are some ideas:
> - for ON COMMIT DROP:
> When used on a partitioned table or a table with inheritance children,
> this drops the depending partitions and children.
> - for ON DELETE ROWS:
> When used on a partitioned table, this is not cascaded to its
> partitions.

I looked at the documentation patch regarding this:

@@ -1225,7 +1226,9 @@ WITH ( MODULUS numeric_literal, REM
 
  
   The temporary table will be dropped at the end of the current
-  transaction block.
+  transaction block.  When used on a partitioned table or a
+  table with inheritance children, this drops the depending
+  partitions and children.

How about:

When used on tables with inheritance children (including partitioned
tables), this also drops the children (partitions).

Thanks,
Amit

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-11-07 Thread Peter Eisentraut

On 07/11/2018 19:03, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Committed, thanks.
> 
> It seems unfortunate that there is no test coverage in the committed
> patch.  I realize that it may be hard to test given the filesystem
> dependency, but how will we know if it works at all?

You can use something like

PG_UPGRADE_OPTS=--clone make -C src/bin/pg_upgrade check

(--link is similarly untested.)

If we do get the TAP tests for pg_upgrade set up, we can create smaller
and faster test cases for this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/07 20:46, David Rowley wrote:
> On 5 November 2018 at 20:17, Amit Langote  
> wrote:
>> On 2018/11/04 19:07, David Rowley wrote:
>>> Perhaps a better design would be to instead of having random special
>>> partitioned-table-only fields in ResultRelInfo, just have an extra
>>> struct there that contains the extra partition information which would
>>> include the translation maps and then just return the ResultRelInfo
>>> and allow callers to lookup any extra details they require.
>>
>> IIUC, you're saying that we could introduce a new struct that contains
>> auxiliary information needed by partition pruning (maps, slot, etc. for
>> tuple conversion) and add a new ResultRelInfo member of that struct type.
>> That way, there is no need to return them separately or return an index to
>> access them from their arrays.  I guess we won't even need the arrays we
>> have now.  I think that might be a good idea and simplifies things
>> significantly.
> 
> I've attached a patch which does this.

Thank you for updating the patch this way.

> It adds a new struct named
> PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
> out of PartitionTupleRouting. There are fields for each of what these
> arrays used to store inside the PartitionRoutingInfo struct.
> 
> While doing this I kept glancing back over at ModifyTableState and at
> the mt_per_subplan_tupconv_maps array. I wondered if it would be
> better to make the PartitionRoutingInfo a bit more generic, perhaps
> call it TupleConversionInfo and have fields like ti_ToGeneric and
> ti_FromGeneric, with the idea that "generic" would be the root
> partition or the first subplan for inheritance updates. This would
> allow us to get rid of a good chunk of code inside nodeModifyTable.c.
> However, I ended up not doing this and left PartitionRoutingInfo to be
> specifically for Root to Partition conversion.

I think it's good that you left mt_per_subplan_tupconv_maps out of this.
UPDATE tuple routing can be said to have two steps: first step, a tiny
one, converts the tuple that needs to be routed from the source
partition's rowtype to the root's rowtype so that tuple routing proper can
begin, and second is the actual tuple routing carried out using
PartitionTupleRouting.  The first step is handled by nodeModifyTable.c and
so any data structures related to it should be in ModifyTableState.

Actually, as I also proposed upthread, we should move root_tuple_slot from
PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
it's part of the first step described above that has nothing to do with
partition tuple routing proper.  We can make PartitionTupleRouting private
to execPartition.c if we do that.

> Also, on the topic about what to name the conversion maps from a few
> days ago; After looking at this a bit more I decided that having them
> named ParentToChild and ChildToParent is misleading.  If the child is
> the child of some sub-partitioned table then the parent that the map
> is talking about is not the partition's parent, but the root
> partitioned table. So really RootToPartition and PartitionToRoot seem
> like much more accurate names for the maps.

Makes sense. :)

> I made a few other changes along the way; I thought that
> ExecFindPartition() would be a good candidate to take on the
> responsibility of validating the partition is valid for INSERTs when
> it uses a partition out of the subplan_resultrel_hash. I thought it
> was better to check this once when we're in the code path of grabbing
> the ResultRelInfo out that hash table rather than in a code path that
> must check if it's been done already each time we route a tuple into a
> partition. It also allowed me to get rid of
> ri_PartitionReadyForRouting.

Ah, I too had tried to remove ri_PartitionReadyForRouting, but had to give
up on that idea because I didn't think of moving steps that are needed to
be performed before setting it to true to that block of code in
ExecFindPartition.

> I also moved the call to
> ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
> make that function static, which could result in the generation of
> slightly more optimal compiled code.
> 
> Please find attached the v14 patch.
> 
> Rather nicely git --stat reports a net negative additional code (with
> the 48 lines of new tests included)
> 
>  11 files changed, 632 insertions(+), 660 deletions(-)

That's nice!

I didn't find anything quite significant to complain about, except just
one line:

+ * Initially we must only setup 1 PartitionDispatch object; the one for

setup -> set up

Thanks,
Amit

Re: PostgreSQL Limits and lack of documentation about them.

2018-11-07 Thread John Naylor

On 11/8/18, David Rowley  wrote:
> On 8 November 2018 at 10:02, Robert Haas  wrote:
>> It might be useful for some users to explain that certain things will
>> should work for values < X, may work for values between X and Y, and
>> will definitely not work above Y.  Or maybe we can provide a narrative
>> explanation rather than just a table of numbers.  Or both.  But I
>> think trying to provide a table of exact cutoffs is sort of like
>> tilting at windmills.
>
> I added something along those lines in a note below the table. Likely
> there are better ways to format all this, but trying to detail out
> what the content should be first.

The language seems fine to me.

-John Naylor

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Thu, Nov 08, 2018 at 06:00:29AM +, Andrew Gierth wrote:
> I think it would be nice to have a test that didn't rely on that, too.

Yes, I don't disagree with you and I thought about it.  Fetching the
value from the control file is easy, doing the comparison between two
LSNs is also simple by doing it directly with pg_lsn in the database
(and I don't want to add math logics about LSNs as a TAP API).  Now I am
less sure about how portable it is possible to make the read of 8 bytes
on the page header for the last page of a relation portable across many
architectures in perl.

And I am not sure that we actually need this addition that as the
standby spawned in the TAP test will not have any clients scanning the
pages and generating WAL, except autovacuum which would be better
switched off in the test.
--
Michael


signature.asc
Description: PGP signature

Re: move PartitionBoundInfo creation code

On 2018/11/08 12:59, Amit Langote wrote:
> It might be okay to split the big switch in partition_bounds_create() into
> different functions for different partitioning methods for clarity as you
> say, but let's avoid creating new files for range, list, and hash.
> 
> I will post an updated patch with that break down.

And here is the new version.  The break down into smaller local functions
for different partitioning methods is in patch 0002.

Thanks,
Amit
>From f8a0caaa850675f7fd44d9b92bb1c61739ea55c5 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 7 Nov 2018 13:52:43 +0900
Subject: [PATCH v3 1/3] Code re-arrangement in RelationBuildPartitionDesc

Rearranged such that the code that creates PartitionBoundInfo is
separated from the rest.

Also, simplified the portion of that code that handles mapping of
canonical partition indexes to the original indexes.
---
 src/backend/utils/cache/partcache.c | 941 +++-
 1 file changed, 486 insertions(+), 455 deletions(-)

diff --git a/src/backend/utils/cache/partcache.c 
b/src/backend/utils/cache/partcache.c
index 5757301d05..46653c7222 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -260,36 +260,25 @@ RelationBuildPartitionKey(Relation relation)
 void
 RelationBuildPartitionDesc(Relation rel)
 {
-   List   *inhoids,
-  *partoids;
-   Oid*oids = NULL;
+   PartitionDesc   partdesc;
+   PartitionBoundInfo  boundinfo;
+   List   *inhoids;
List   *boundspecs = NIL;
ListCell   *cell;
int i,
nparts;
PartitionKey key = RelationGetPartitionKey(rel);
-   PartitionDesc result;
MemoryContext oldcxt;
-
int ndatums = 0;
+   int canon_index = 0;
int default_index = -1;
-
-   /* Hash partitioning specific */
-   PartitionHashBound **hbounds = NULL;
-
-   /* List partitioning specific */
-   PartitionListValue **all_values = NULL;
-   int null_index = -1;
-
-   /* Range partitioning specific */
-   PartitionRangeBound **rbounds = NULL;
+   Oid*oids_orig;
+   int*mapping;
 
/* Get partition oids from pg_inherits */
inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
 
/* Collect bound spec nodes in a list */
-   i = 0;
-   partoids = NIL;
foreach(cell, inhoids)
{
Oid inhrelid = lfirst_oid(cell);
@@ -325,245 +314,10 @@ RelationBuildPartitionDesc(Relation rel)
}
 
boundspecs = lappend(boundspecs, boundspec);
-   partoids = lappend_oid(partoids, inhrelid);
ReleaseSysCache(tuple);
}
 
-   nparts = list_length(partoids);
-
-   if (nparts > 0)
-   {
-   oids = (Oid *) palloc(nparts * sizeof(Oid));
-   i = 0;
-   foreach(cell, partoids)
-   oids[i++] = lfirst_oid(cell);
-
-   /* Convert from node to the internal representation */
-   if (key->strategy == PARTITION_STRATEGY_HASH)
-   {
-   ndatums = nparts;
-   hbounds = (PartitionHashBound **)
-   palloc(nparts * sizeof(PartitionHashBound *));
-
-   i = 0;
-   foreach(cell, boundspecs)
-   {
-   PartitionBoundSpec *spec = 
castNode(PartitionBoundSpec,
-   
lfirst(cell));
-
-   if (spec->strategy != PARTITION_STRATEGY_HASH)
-   elog(ERROR, "invalid strategy in 
partition bound spec");
-
-   hbounds[i] = (PartitionHashBound *)
-   palloc(sizeof(PartitionHashBound));
-
-   hbounds[i]->modulus = spec->modulus;
-   hbounds[i]->remainder = spec->remainder;
-   hbounds[i]->index = i;
-   i++;
-   }
-
-   /* Sort all the bounds in ascending order */
-   qsort(hbounds, nparts, sizeof(PartitionHashBound *),
- qsort_partition_hbound_cmp);
-   }
-   else if (key->strategy == PARTITION_STRATEGY_LIST)
-   {
-   List   *non_null_values = NIL;
-
-   /*
-* Create a unified list of non-null values across all 
partitions.
-*/
-   i = 0;
-

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2018-11-07 Thread Andrew Gierth

> "Michael" == Michael Paquier  writes:

 >> How? It's OK (and normal) for in-core pages to have newer LSNs than
 >> minRecoveryPoint, it's only on-disk pages that must not be more
 >> recent than that. And pageinspect will show the in-core page...

 Michael> If the standby is stopped cleanly once, what's in shared
 Michael> buffers gets flushed to disk by the checkpointer once at
 Michael> shutdown, forcing minRecoveryPoint to be updated by the
 Michael> checkpointer, and that update was not happening beforethe fix.
 Michael> And once the standby is started again, what's on disk gets
 Michael> reloaded, showing the inconsistency.

Ah, I missed that you were shutting down.

I think it would be nice to have a test that didn't rely on that, too.

-- 
Andrew (irc:RhodiumToad)

Re: Removing unneeded self joins

On 19 October 2018 at 01:47, Alexander Kuzmenkov
 wrote:
> Here is a version that compiles.

I had a quick read through this and I think its missing about a 1-page
comment section detailing when we can and when we cannot remove these
self joins, and what measures we must take when we do remove them.

Apart from that, I noted the following during my read:

1. I don't think this is the right way to do this. There are other
places where we alter the varnoold. For example:
search_indexed_tlist_for_var(). So you should likely be doing that too
rather than working around it.

@@ -166,10 +166,13 @@ _equalVar(const Var *a, const Var *b)
  COMPARE_SCALAR_FIELD(vartypmod);
  COMPARE_SCALAR_FIELD(varcollid);
  COMPARE_SCALAR_FIELD(varlevelsup);
- COMPARE_SCALAR_FIELD(varnoold);
- COMPARE_SCALAR_FIELD(varoattno);
  COMPARE_LOCATION_FIELD(location);

+ /*
+ * varnoold/varoattno are used only for debugging and may differ even
+ * when the variables are logically the same.
+ */
+

2. Surely the following loop is incorrect:

for (i = toKeep->min_attr; i <= toKeep->max_attr; i++)
{
int attno = i - toKeep->min_attr;
toKeep->attr_needed[attno] = bms_add_members(toKeep->attr_needed[attno],
toRemove->attr_needed[attno]);
}

What if toRemove has a lower min_attr or higher max_attr?

3. "wind" -> "find"

+ * When we wind such a join, we mark one of the participating relation as

4. I think the following shouldn't be happening:

+
  Result
One-Time Filter: false
-(2 rows)
+   ->  Index Scan using parent_pkey on parent x
+ Index Cond: (k = 1)
+(4 rows)

5. I'd have thought the opposite. Surely there are more chances of
this being useful with more joins?

+ /* Limit the number of joins we process to control the quadratic behavior. */
+ if (n > join_collapse_limit)
+ break;

6. In remove_self_joins_one_level() I think you should collect the
removed relations in a Relids rather than a list.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-07 Thread Amit Kapila

On Sun, Nov 4, 2018 at 6:37 AM Haribabu Kommi  wrote:
>
> On Sun, Nov 4, 2018 at 11:17 AM Michael Paquier  wrote:
>>
>> On Sat, Nov 03, 2018 at 03:56:14PM +0530, Amit Kapila wrote:
>> > Before trying out any solution or deciding which is better, I think we
>> > want to understand why the variability in results occurred only after
>> > your patch?  Without the patch, it works just fine.
>>
>> Good point.  We surely want to have a stable feature, which gets tested
>> without triggering random failures in the builfarm.
>
>
> Thanks for the review.
>
> This patch has changed the pg_stat_statements_reset() function from returning 
> void
> to number statements that it reset.
>

What is the motivation of that change?  It seems to be
backward-incompatible change.  I am not telling we can't do it, but do
we have strong justification?  One reason, I could see is to allow the
user to get the exact value of statements that got reset and maybe
that is more helpful as we have now additional parameters in the API,
but I am not sure if that is a good justification.

> The regression test contains pg_stat_statements_reset()
> as first query to reset any of the query stats that are already tracked to 
> let the test to
> provide the proper results. But with this feature, if we test this regression 
> test on an
> already running server, the first query result is varying and it leads to 
> test failure.
>
> So to fix this problem, I added a wrapper function that masks the result of 
> the
> pg_stat_statements_reset() function and just return as void, with this 
> wrapper function
> used a first statement, the test is stable, as this function takes care of 
> resetting already
> existing statements from the already running server.
>

Or you could change the query to something like (Select NULL from
pg_stat_statements_reset())

> With the above change, the regression test is stable. Comments?
>

In the comment message, you wrote: "Backward Compatible change, Now
pg_stat_statements_reset() function returns the number of statement
entries that are reset."

Do you want to say incompatible instead of compatible?

+  If no parameter is specified or specify everything as
NULL(invalid),
+  it will discard all statistics.
+  By default, this function can only be executed by superusers. Access may
+  be granted to others using GRANT.

I think it will look better if you can continue the "By default, ..."
line from where the previous line ends rather than starting a new
line.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

On Thu, Nov 08, 2018 at 05:04:35AM +, Andrew Gierth wrote:
> How? It's OK (and normal) for in-core pages to have newer LSNs than
> minRecoveryPoint, it's only on-disk pages that must not be more recent
> than that. And pageinspect will show the in-core page...

If the standby is stopped cleanly once, what's in shared buffers gets
flushed to disk by the checkpointer once at shutdown, forcing
minRecoveryPoint to be updated by the checkpointer, and that update was
not happening beforethe fix.  And once the standby is started again,
what's on disk gets reloaded, showing the inconsistency.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] [PATCH] Generic type subscripting

st 7. 11. 2018 v 19:35 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Wed, 7 Nov 2018 at 17:09, Pavel Stehule 
> wrote:
> >
> > I don't agree. If we use a  same syntax for some objects types, we
> should to enforce some consistency.
>
> Just to make it clear, consistency between what?
>
> > I don't think so you should to introduce nulls for JSONs. In this case,
> the most correct solution is raising a exception.
>
> Now it's my turn to disagree. As an argument I have this thread [1], where
> similar discussion happened about flexibility of jsonb and throwing an
> errors
> (in this particular case whether or not to throw an error when a non
> existing
> path was given to jsonb_set).
>

It doesn't mean so it is designed well.

>
> I can imagine significant number of use cases when adding a value to jsonb
> like
> that is desirable outcome, and I'm not sure if I can come up with an
> example
> when strictness is the best result. Maybe if you have something in mind,
> you
> can describe what would be the case for that? Also as I've mentioned
> before,
> consistency between jsonb_set and jsonb subscripting operator will help us
> to
> avoid tons of question about why I can do this and this using one option,
> but
> not another.
>

I have only one argument - with this behave nobody knows if value was
appended or updated.




>
> [1]:
> https://www.postgresql.org/message-id/CAM3SWZT3uZ7aFktx-nNEWGbapN1oy2t2gt10pnOzygZys_Ak1Q%40mail.gmail.com
>

Re: jsonpath

Tomas Vondra  writes:
> BTW is the v19 really just a rebase of the preceding version?
> I'm asking because v18 was adding two types into pg_type.dat, namely
> jsonpath (6050) and _jsonpath (6051), while v19 only adds jsonpath
> (6050).

I haven't looked at this patch, but manual generation of array type entries
is obsolete since 3dc820c43e427371b66d217f2bd5481fc9ef2e2d.  There should
still be a mention of the OID to use for the array type though ...

regards, tom lane

Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2018-11-07 Thread Andrew Gierth

> "Michael" == Michael Paquier  writes:

 Michael> I have also been trying to shape that into a TAP test which
 Michael> can be added into the in-core recovery test suite, and it
 Michael> happens that the part which scans if all the pages of a
 Michael> relation are not newer than what minRecoveryPoint is set to in
 Michael> the control file can be easily calculated by using pageinspect
 Michael> and pg_control_recovery() with a simple SQL query.

How? It's OK (and normal) for in-core pages to have newer LSNs than
minRecoveryPoint, it's only on-disk pages that must not be more recent
than that. And pageinspect will show the in-core page...

-- 
Andrew (irc:RhodiumToad)

Re: csv format for psql

On Wed, Nov 07, 2018 at 05:36:54PM +0100, Daniel Verite wrote:
> I guess it could go into a #define in psql/settings.h, along with
> these:
> #define DEFAULT_FIELD_SEP "|"
> #define DEFAULT_RECORD_SEP "\n"

Yes.

> Yes, since we support column names with embedded newlines (even
> though it's hard to think of a legit use case for that) and CSV fields
> support embedded newlines too.

Hm.  We would likely want at least a comment about that..

Anyway, I am still going through the patch, so no need to send a new
version for now.
--
Michael


signature.asc
Description: PGP signature

Re: file cloning in pg_upgrade and CREATE DATABASE

On Wed, Nov 07, 2018 at 06:42:00PM +0100, Peter Eisentraut wrote:
> -n is often used for something like "dry run", so it didn't go for that
> here.  I suspect the cloning will remain a marginal option for some
> time, so having only a long option is acceptable.

Good point.  I didn't consider this, and it is true that --dry-run is
used, pg_rewind being one example.
--
Michael


signature.asc
Description: PGP signature

Adding a TAP test checking data consistency on standby with minRecoveryPoint

Hi all,

While working on a fix for c186ba13 which fixes the way minRecoveryPoint
is updated for other processes than the startup process, I was
struggling about being able to get that into a reproducible test case.

I have been thinking what Andrew Gierth has mentioned yesterday, and
roughly designed a test case mentioned here which is able to see the
problem: 
https://www.postgresql.org/message-id/20181107044915.gf1...@paquier.xyz

I have also been trying to shape that into a TAP test which can be added
into the in-core recovery test suite, and it happens that the part which
scans if all the pages of a relation are not newer than what
minRecoveryPoint is set to in the control file can be easily calculated
by using pageinspect and pg_control_recovery() with a simple SQL query.
So, digging into it, I have been able to get a reproducible TAP test
case which is in the patch attached.

On HEAD, if you revert c186ba13 and then trigger the test the
inconsistency shows up immediately.  Keeping the fix makes the test
pass.

This test suite will make sure that we don't break again how
minRecoveryPoint is handled across multiple processes, so I think that
it would be a good addition for HEAD and the future.

Thoughts?
--
Michael
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index daf79a0b1f..0364a927c3 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/016_min_consistency.pl b/src/test/recovery/t/016_min_consistency.pl
new file mode 100644
index 00..b6d06b2ad3
--- /dev/null
+++ b/src/test/recovery/t/016_min_consistency.pl
@@ -0,0 +1,92 @@
+# Test for checking consistency of on-disk pages for a cluster with
+# the minimum recovery LSN, ensuring that the updates happen across
+# all processes.  In this test, the startup process and the checkpoint,
+# triggering the non-startup code paths, are both checked.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $primary = get_new_node('primary');
+$primary->init(allows_streaming => 1);
+
+# Set shared_buffers to a very low value to enforce discard and flush
+# of PostgreSQL buffers on standby, enforcing other processes than the
+# startup process to update the minimum recovery LSN in the control
+# file.
+$primary->append_conf("postgresql.conf", backup('bkp');
+my $standby = get_new_node('standby');
+$standby->init_from_backup($primary, 'bkp', has_streaming => 1);
+$standby->start;
+
+# Dummy table and extension creation for the upcoming tests.
+$primary->safe_psql('postgres',
+	'create extension pageinspect');
+$primary->safe_psql('postgres',
+	'create table test1 (a int) with (fillfactor = 10)');
+$primary->safe_psql('postgres', 'insert into test1 select generate_series(1, 1)');
+
+# Take a checkpoint and enforce post-checkpoint full page writes
+# which makes the startup process replay those pages, updating
+# minRecoveryPoint.
+$primary->safe_psql('postgres', 'checkpoint');
+$primary->safe_psql('postgres', 'update test1 set a = a + 1');
+
+# Fill in the standby's shared buffers with the data filled in
+# previously.
+$standby->safe_psql('postgres', 'select count(*) from test1');
+
+# Update the table again, this does not generate full page writes so
+# the standby will replay records associated with it, but the startup
+# process will not flush those pages.
+$primary->safe_psql('postgres', 'update test1 set a = a + 1');
+
+# Extract from the relation the last block created, this will be
+# used at the end of the test for sanity checks.
+my $last_block;
+$primary->psql('postgres',
+	"select pg_relation_size('test1')::int / setting::int - 1 from pg_settings where name = 'block_size';",
+	stdout => \$last_block);
+
+# Wait for last record to have been replayed on the standby.
+$primary->wait_for_catchup($standby, 'replay',
+		   $primary->lsn('insert'));
+
+# Issue a restart point on the standby now, which makes the checkpointer
+# update minRecoveryPoint.
+$standby->safe_psql('postgres', 'checkpoint');
+
+# Now shut down the primary violently so as the standby does not
+# receive the shutdown checkpoint, making sure that the startup
+# process does not flush any pages on its side.  The standby is
+# cleanly stopped, which makes the checkpointer update minRecoveryPoint
+# with the restart point created at shutdown.
+$primary->stop('immediate');
+$standby->stop('fast');
+
+# Now restart the standby and check the state of the instance. All
+# the pages of the relation previously created should not have a
+# LSN newer than what minRecoveryPoint has.
+$standby->start;
+
+# Check that the last page of the table, which is the

Re: move PartitionBoundInfo creation code

On 2018/11/08 0:04, Alvaro Herrera wrote:
>> On Wed, Nov 07, 2018 at 03:34:38PM +0900, Amit Langote wrote:
>>> I think we can design the interface of partition_bounds_create such that
>>> it returns information needed to re-arrange OIDs to be in the canonical
>>> partition bound order, so the actual re-ordering of OIDs can be done by
>>> the caller itself.  (It may not always be OIDs, could be something else
>>> like a struct to represent fake partitions.)
> 
> This is interesting.  I don't think the current interface is so bad that
> we can't make a few tweaks later; IOW let's focus on the current patch
> for now, and we can improve later.  You (and David, and maybe someone
> else) already have half a dozen patches touching this code, and I bet
> some of these will have to be rebased over this one.

The patch on ATTACH/DETACH concurrently thread is perhaps touching close
to here, but I'm not sure if any of it should be concerned about how we
generate the PartitionBoundInfo which the patch here trying to make into
its own function?

Thanks,
Amit

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018/11/08 11:01, Robert Haas wrote:
> On Wed, Nov 7, 2018 at 7:06 PM David Rowley
>  wrote:
>> While the find_all_inheritors() call is something I'd like to see
>> gone, I assume it was done that way since an UPDATE might route a
>> tuple to a partition that there is no subplan for and due to INSERT
>> with VALUES not having any RangeTblEntry for any of the partitions.
>> Simply, any partition which is a descendant of the target partition
>> table could receive the tuple regardless of what might have been
>> pruned.
> 
> Thanks.  I had figured out since my email of earlier today that it was
> needed in the INSERT case, but I had not thought of/discovered the
> case of an UPDATE that routes a tuple to a pruned partition.  I think
> that latter case may not be tested in our regression tests, which is
> perhaps something we ought to change.
> 
> Honestly, I *think* that the reason that find_all_inheritors() call is
> there is because I had the idea that it was important to try to lock
> partition hierarchies in the same order in all cases so as to avoid
> spurious deadlocks.  However, I don't think we're really achieving
> that goal despite this code.  If we arrive at this point having
> already locked some relations, and then lock some more, based on
> whatever got pruned, we're clearly not using a deterministic locking
> order.  So I think we could probably rip out the find_all_inheritors()
> call here and change the NoLock in get_partition_dispatch_recurse() to
> just take a lock.  That's probably a worthwhile simplification and a
> slight optimization regardless of anything else.

A patch that David and I have been working on over at:

https://commitfest.postgresql.org/20/1690/

does that.  With that patch, partitions (leaf or not) are locked and
opened only if a tuple is routed to them.  In edd44738bc (Be lazier about
partition tuple routing), we postponed the opening of leaf partitions, but
we still left the RelationGetPartitionDispatchInfo machine which
recursively creates PartitionDispatch structs for all partitioned tables
in a tree.  The patch mentioned above postpones even the partitioned
partition initialization to a point after a tuple is routed to it.

The patch doesn't yet eliminate the find_all_inheritors call from
ExecSetupPartitionTupleRouting.  But that's mostly because of the fear
that if we start becoming lazier about locking individual partitions too,
we'll get non-deterministic locking order on partitions that we might want
to avoid for deadlock fears.  Maybe, we don't need to be fearful though.

> But I really think it would be better if we could also jigger this to
> avoid reopening relations which the executor has already opened and
> locked elsewhere.  Unfortunately, I don't see a really simple way to
> accomplish that.  We get the OIDs of the descendents and want to know
> whether there is range table entry for that OID; but there's no data
> structure which answers that question at present, I believe, and
> introducing one just for this purpose seems like an awful lot of new
> machinery.  Perhaps that new machinery would still have less
> far-reaching consequences than the machinery Alvaro is proposing, but,
> still, it's not very appealing.

The newly added ExecGetRangeTableRelation opens (if not already done) and
returns a Relation pointer for tables that are present in the range table,
so requires to be passed a valid RT index.  That works for tables that the
planner touched.  UPDATE tuple routing benefits from that in cases where
the routing target is already in the range table.

For insert itself, planner adds only the target partitioned table to the
range table.  Partitions that the inserted tuples will route to may be
present in the range table via some other plan node, but the insert's
execution state won't know about them, so it cannot use
EcecGetRangeTableRelation.

> Perhaps one idea is only open and lock partitions on demand - i.e. if
> a tuple actually gets routed to them.  There are good reasons to do
> that independently of reducing lock levels, and we certainly couldn't
> do it without having some efficient way to check whether it had
> already been done.  So then the mechanism wouldn't feel like so much
> like a special-purpose hack just for concurrent ATTACH/DETACH.  (Was
> Amit Langote already working on this, or was that some other kind of
> on-demand locking?)

I think the patch mentioned above gets us closer to that goal.

Thanks,
Amit

Re: move PartitionBoundInfo creation code

On 2018/11/08 11:48, Michael Paquier wrote:
>>> Thinking crazy, we could also create a subfolder partitioning/bounds/
>>> which includes three files with this refactoring:x
>>> - range.c
>>> - values.c
>>> - hash.c
>>> Then we keep partbounds.c which is the entry point used by partcache.c,
>>> and partbounds.c relies on the stuff in each other sub-file when
>>> building a strategy.  This move could be interesting in the long term as
>>> there are comparison routines and structures for each strategy if more
>>> strategies are added.
>>
>> I'm not sure this is worthwhile.  You'll create a bunch of independent
>> translation units in exchange for ...?
> 
> That's the part about going crazy and wild on this refactoring.  Not all
> crazy ideas are worth doing, still I like mentioning all possibilities
> so as we miss nothing in the review process.

It might be okay to split the big switch in partition_bounds_create() into
different functions for different partitioning methods for clarity as you
say, but let's avoid creating new files for range, list, and hash.

I will post an updated patch with that break down.

Thanks,
Amit

Re: PostgreSQL Limits and lack of documentation about them.

On 8 November 2018 at 10:02, Robert Haas  wrote:
> IMHO, documenting that you can get up to 1600 integer columns but only
> 1002 bigint columns doesn't really help anybody, because nobody has a
> table with only one type of column, and people usually want to have
> some latitude to run ALTER TABLE commands later.
>
> It might be useful for some users to explain that certain things will
> should work for values < X, may work for values between X and Y, and
> will definitely not work above Y.  Or maybe we can provide a narrative
> explanation rather than just a table of numbers.  Or both.  But I
> think trying to provide a table of exact cutoffs is sort of like
> tilting at windmills.

I added something along those lines in a note below the table. Likely
there are better ways to format all this, but trying to detail out
what the content should be first.

Hopefully I I've addressed the other things mentioned too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Title: Appendix B. Database Limitations

Appendix B. Database LimitationsPrev UpPart VIII. AppendixesHome NextAppendix B. Database Limitations
 The following table describes the limits of PostgreSQL
 Table B.1. PostgreSQL limitationsItemLimitCommentMaximum Database SizeUnlimited Maximum Number of DatabasesUnlimited Maximum Relation Size32 TBLimited to 2^32 - 1 pages per relation.Maximum Columns per Table1600Further limited by tuple size fitting on a single page. See note belowMaximum Field Size1 GB Maximum Identifier Length63 charactersCan be increased by recompiling PostgreSQLMaximum Rows per TableUnlimited Maximum Indexes per TableUnlimited Maximum Indexed Columns32Can be increased by recompiling PostgreSQL. Limit includes
  any INCLUDE columnsMaximum Partition Keys32Can be increased by recompiling PostgreSQLMaximum Relations per DatabaseUnlimited Maximum Partitions per Partitioned Relations268,435,456May be increased by using sub-partitioningNote
 The maximum number of columns for a table is further reduced as the tuple
 being stored must fit on a single heap page.  Variable length fields such
 as TEXT, VARCHAR and
 CHAR can have their values stored out of line in the
 table's TOAST table when the values are large enough to require it.  Only
 an 18 byte pointer must remain inside the tuple in the table's heap.  For
 shorter length variable length fields either a 4 byte or 1 byte field
 header is used, and the value is stored inside the heap tuple.  Often this
 can mean the actual maximum number of columns that you can store inside a
 table is further reduced as the tuple can become too large to fit inside a
 single 8192 byte heap page.  For example, excluding the tuple header, a
 tuple made up of 1600 INT columns would consume 6400 bytes and could be
 stored in a heap page, but a tuple of 1600 BIGINT columns would consume
 12800 bytes, therefore not fit inside a heap page.

 Columns which have been dropped from the table also contribute to the
 maximum column limit, although the dropped column values for newly created
 tuples are internally marked as NULL in the tuples null bitmap, which does
 occupy space.
Prev Up NextAppendix A. PostgreSQL Error Codes Home Appendix C. Date/Time Support

v3-0001-Add-documentation-section-appendix-detailing-some.patch
Description: Binary data

Re: POC: Cleaning up orphaned files using undo logs

On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh  wrote:
> On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
>  wrote:
> > It passes make check on Unix and Windows, though currently it's
> > failing some of the TAP tests for reasons I'm looking into (possibly
> > due to bugs in the lower level patches, not sure).
> >
> I looked into the regression failures when the tap-tests are enabled.
> It seems that we're not estimating and allocating the shared memory
> for rollback-hash tables correctly. I've added a patch to fix the
> same.

Thanks Kuntal.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: move PartitionBoundInfo creation code

On Wed, Nov 07, 2018 at 12:04:27PM -0300, Alvaro Herrera wrote:
> On 2018-Nov-07, Michael Paquier wrote:
> This is interesting.  I don't think the current interface is so bad that
> we can't make a few tweaks later; IOW let's focus on the current patch
> for now, and we can improve later.  You (and David, and maybe someone
> else) already have half a dozen patches touching this code, and I bet
> some of these will have to be rebased over this one.

Yeah, I am actually quite a fan of having the mapping being returned by
partition_bound_create and just let the caller handle the OID
assignment.  That's neater.  I think that the patch could gain a tad
more in clarity by splitting all sub-processing for each strategy into a
single routine though.  The presented switch/case is a bit hard to
follow in the last patch set, and harder to review in details.

>> Thinking crazy, we could also create a subfolder partitioning/bounds/
>> which includes three files with this refactoring:x
>> - range.c
>> - values.c
>> - hash.c
>> Then we keep partbounds.c which is the entry point used by partcache.c,
>> and partbounds.c relies on the stuff in each other sub-file when
>> building a strategy.  This move could be interesting in the long term as
>> there are comparison routines and structures for each strategy if more
>> strategies are added.
> 
> I'm not sure this is worthwhile.  You'll create a bunch of independent
> translation units in exchange for ...?

That's the part about going crazy and wild on this refactoring.  Not all
crazy ideas are worth doing, still I like mentioning all possibilities
so as we miss nothing in the review process.
--
Michael


signature.asc
Description: PGP signature

Re: Should new partitions inherit their tablespace from their parent?

On Thu, Nov 08, 2018 at 12:50:40PM +1300, David Rowley wrote:
> How about we record the tablespace option for the partitioned table in
> reltablespace instead of saving it as 0.  Newly created partitions
> which don't have a TABLESPACE mentioned in the CREATE TABLE command
> should be created in their direct parent partitioned tables
> tablespace.

I have seen enough complains on the mailing lists regarding the way
tablespaces are handled for partitioned tables and their partitions that
it looks like a very good idea to make the tablespace being inherited
automatically, by setting up reltablespace to a non-zero value even if
a partitioned table has no physical presence.  Of course not on v11 or
older releases, just on HEAD.  It is no good to have partitioned indexes
and partitioned tables being handling inconsistently for such things.
--
Michael


signature.asc
Description: PGP signature

Re: Postgres, fsync, and OSs (specifically linux)

On Fri, Oct 19, 2018 at 6:42 PM Craig Ringer  wrote:
> On Fri, 19 Oct 2018 at 07:27, Thomas Munro  
> wrote:
>> 2.  I am +1 on back-patching Craig's PANIC-on-failure logic.  Doing
>> nothing is not an option I like.  I have some feedback and changes to
>> propose though; see attached.
>
> Thanks very much for the work on reviewing and revising this.

My plan is do a round of testing and review of this stuff next week
once the dust is settled on the current minor releases (including
fixing a few typos I just spotted and some word-smithing).  All going
well, I will then push the resulting patches to master and all
supported stable branches, unless other reviews or objections appear.
At some point not too far down the track I hope to be ready to
consider committing that other patch that will completely change all
of this code in the master branch, but in any case Craig's patch will
get almost a full minor release cycle to sit in the stable branches
before release.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 November 2018 at 15:01, Robert Haas  wrote:
> Honestly, I *think* that the reason that find_all_inheritors() call is
> there is because I had the idea that it was important to try to lock
> partition hierarchies in the same order in all cases so as to avoid
> spurious deadlocks.  However, I don't think we're really achieving
> that goal despite this code.  If we arrive at this point having
> already locked some relations, and then lock some more, based on
> whatever got pruned, we're clearly not using a deterministic locking
> order.  So I think we could probably rip out the find_all_inheritors()
> call here and change the NoLock in get_partition_dispatch_recurse() to
> just take a lock.  That's probably a worthwhile simplification and a
> slight optimization regardless of anything else.

I'd not thought of the locks taken elsewhere case. I guess it just
perhaps reduces the chances of a deadlock then.

A "slight optimization" is one way to categorise it. There are some
benchmarks you might find interesting in [1] and [2]. Patch 0002 does
just what you mention.

[1] 
https://www.postgresql.org/message-id/06524959-fda8-cff9-6151-728901897b79%40redhat.com
[2] 
https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 7, 2018 at 7:06 PM David Rowley
 wrote:
> While the find_all_inheritors() call is something I'd like to see
> gone, I assume it was done that way since an UPDATE might route a
> tuple to a partition that there is no subplan for and due to INSERT
> with VALUES not having any RangeTblEntry for any of the partitions.
> Simply, any partition which is a descendant of the target partition
> table could receive the tuple regardless of what might have been
> pruned.

Thanks.  I had figured out since my email of earlier today that it was
needed in the INSERT case, but I had not thought of/discovered the
case of an UPDATE that routes a tuple to a pruned partition.  I think
that latter case may not be tested in our regression tests, which is
perhaps something we ought to change.

Honestly, I *think* that the reason that find_all_inheritors() call is
there is because I had the idea that it was important to try to lock
partition hierarchies in the same order in all cases so as to avoid
spurious deadlocks.  However, I don't think we're really achieving
that goal despite this code.  If we arrive at this point having
already locked some relations, and then lock some more, based on
whatever got pruned, we're clearly not using a deterministic locking
order.  So I think we could probably rip out the find_all_inheritors()
call here and change the NoLock in get_partition_dispatch_recurse() to
just take a lock.  That's probably a worthwhile simplification and a
slight optimization regardless of anything else.

But I really think it would be better if we could also jigger this to
avoid reopening relations which the executor has already opened and
locked elsewhere.  Unfortunately, I don't see a really simple way to
accomplish that.  We get the OIDs of the descendents and want to know
whether there is range table entry for that OID; but there's no data
structure which answers that question at present, I believe, and
introducing one just for this purpose seems like an awful lot of new
machinery.  Perhaps that new machinery would still have less
far-reaching consequences than the machinery Alvaro is proposing, but,
still, it's not very appealing.

Perhaps one idea is only open and lock partitions on demand - i.e. if
a tuple actually gets routed to them.  There are good reasons to do
that independently of reducing lock levels, and we certainly couldn't
do it without having some efficient way to check whether it had
already been done.  So then the mechanism wouldn't feel like so much
like a special-purpose hack just for concurrent ATTACH/DETACH.  (Was
Amit Langote already working on this, or was that some other kind of
on-demand locking?)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: jsonpath

BTW is the v19 really just a rebase of the preceding version?

I'm asking because v18 was adding two types into pg_type.dat, namely
jsonpath (6050) and _jsonpath (6051), while v19 only adds jsonpath
(6050). I've noticed because v18 was missing a comma between the two
entries, which I'd say is a bug - interestingly enough, both types were
created correctly, so it seems to be resilient to such typos.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: jsonpath



On 11/6/18 4:48 PM, Tomas Vondra wrote:
> On 11/6/18 3:31 PM, Nikita Glukhov wrote:
>> On 29.10.2018 2:20, Tomas Vondra wrote:>
>>
>> ...
>>>
>>> 9) It's generally a good idea to make the individual pieces committable
>>> separately, but that means e.g. the regression tests have to pass after
>>> each patch. At the moment that does not seem to be the case for 0002,
>>> see the attached file. I'm running with -DRANDOMIZE_ALLOCATED_MEMORY,
>>> not sure if that's related.
>>
>> This should definitely be a bug in json support, but I can't reproduce
>> it simply by defining -DRANDOMIZE_ALLOCATED_MEMORY.  Could you provide
>> a stack trace at least?
>>
> I'll try.
> 

Not sure why you can't reproduce the failures, it's perfectly
reproducible for me. For the record, I'm doing this:

./configure --prefix=/home/user/pg-jsonpath --enable-debug
--enable-cassert CFLAGS="-O0 -DRANDOMIZE_ALLOCATED_MEMORY" && make -s
clean && make -s -j4 && make check

After sticking Assert(false) to JsonEncodeJsonbValue (to the default
case), I get a failure like this:

  select json '{}' @* 'lax $[0]';
! WARNING:  unknown jsonb value type: 20938064
! server closed the connection unexpectedly
!   This probably means the server terminated abnormally
!   before or while processing the request.
! connection to server was lost

The backtrace is attached. My guess is JsonValueListGetList in
jsonb_jsonpath_query only does shallow copy instead of copying the
pieces into funcctx->multi_call_memory_ctx, so it gets corrupted on
subsequent calls.

I also attach valgrind report, but I suppose the reported issues are a
consequence of the same bug.

regard

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
(gdb) bt
#0  0x7aa8bd7ddeab in raise () from /lib64/libc.so.6
#1  0x7aa8bd7c85b9 in abort () from /lib64/libc.so.6
#2  0x00a34154 in ExceptionalCondition (conditionName=0xc6d888 "!(0)", 
errorType=0xc6ce18 "FailedAssertion", fileName=0xc6caa0 "json.c", 
lineNumber=3047) at assert.c:54
#3  0x009415e7 in JsonEncodeJsonbValue (buf=0x7ffca408b870, 
jbv=0x7ffca408b830) at json.c:3047
#4  0x009416a4 in JsonbValueToJson (jbv=0x7ffca408b830) at json.c:3077
#5  0x0096707c in jsonb_jsonpath_query (fcinfo=0x13f4cf0) at 
jsonpath_exec.c:2735
#6  0x009670bb in json_jsonpath_query2 (fcinfo=0x13f4cf0) at 
jsonpath_exec.c:2741
#7  0x006e1c18 in ExecMakeFunctionResultSet (fcache=0x13f4c80, 
econtext=0x13f41b8, argContext=0x13f5d40, isNull=0x13f4be0, isDone=0x13f4c60) 
at execSRF.c:604
#8  0x0070d858 in ExecProjectSRF (node=0x13f40a0, continuing=false) at 
nodeProjectSet.c:175
#9  0x0070d715 in ExecProjectSet (pstate=0x13f40a0) at 
nodeProjectSet.c:105
#10 0x006dee78 in ExecProcNodeFirst (node=0x13f40a0) at 
execProcnode.c:445
#11 0x006d3e56 in ExecProcNode (node=0x13f40a0) at 
../../../src/include/executor/executor.h:237
#12 0x006d66fa in ExecutePlan (estate=0x13f3e48, planstate=0x13f40a0, 
use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, 
direction=ForwardScanDirection, dest=0x13fa1b0, execute_once=true) at 
execMain.c:1707
#13 0x006d447d in standard_ExecutorRun (queryDesc=0x1353768, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364
#14 0x006d42ab in ExecutorRun (queryDesc=0x1353768, 
direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:307
#15 0x008be5ff in PortalRunSelect (portal=0x1397348, forward=true, 
count=0, dest=0x13fa1b0) at pquery.c:932
#16 0x008be2a4 in PortalRun (portal=0x1397348, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x13fa1b0, 
altdest=0x13fa1b0, completionTag=0x7ffca408be70 "") at pquery.c:773
#17 0x008b8271 in exec_simple_query (query_string=0x1330738 "select 
json '{}' @* 'lax $[0]';") at postgres.c:1215
#18 0x008bc580 in PostgresMain (argc=1, argv=0x135bab0, 
dbname=0x135b820 "regression", username=0x132d028 "user") at postgres.c:4243
#19 0x00819f7a in BackendRun (port=0x1354940) at postmaster.c:4377
#20 0x00819748 in BackendStartup (port=0x1354940) at postmaster.c:4068
#21 0x00815c60 in ServerLoop () at postmaster.c:1700
#22 0x0081550c in PostmasterMain (argc=8, argv=0x132aeb0) at 
postmaster.c:1373
#23 0x0073ebba in main (argc=8, argv=0x132aeb0) at main.c:228
(gdb) up
#4  0x009416a4 in JsonbValueToJson (jbv=0x7ffca408b830) at json.c:3077
3077JsonEncodeJsonbValue(, jbv);
(gdb) print *jbv
$1 = {type = 20938064, val = {numeric = 0x7ffca408b870, boolean = 112, string = 
{len = -1542932368, val = 0x0}, array = {nElems = -1542932368, elems = 0x0, 
rawScalar = 16}, object = {nPairs = -1542932368, uniquified = 252, pairs = 
0x0}, binary = {len = -1542932368, data = 0x0}, datetime = {
  value = 140723060521072, typid = 0, typmod = 0, tz = 4684816}}}

Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

On Wed, Nov 7, 2018 at 11:03 PM Etsuro Fujita
 wrote:
> (2018/11/07 9:22), Kyotaro HORIGUCHI wrote:
> > At Tue, 06 Nov 2018 21:07:37 +0900, Etsuro 
> > Fujita  wrote 
> > in<5be18409.2070...@lab.ntt.co.jp>
> >> (2018/11/06 12:53), Kyotaro HORIGUCHI wrote:
> >>> even if it comes from another pipe, we can assume that the
> >>> SIGPIPE immediately stopped the program before returning any
> >>> garbage to us.
> >>
> >> Sorry, I don't understand this.
> >
> > Mmm. It looks confusing, sorry. In other words:
> >
> > We don't care the reason for program's ending if we received
> > enough data from the program and we actively (= before receiving
> > EOF) stop reading. On the other hand if SIGPIPE (on another pipe)
> > or something fatal happened on the program before we end reading,
> > we receive EOF just after that and we are interested in the cause
> > of the program's death.
> >
> > # Does the above make sense?
>
> Yeah, thanks for the explanation!

+1

I take back what I said earlier about false positives from other
pipes.  I think it's only traditional Unix programs designed for use
in pipelines and naive programs that let SIGPIPE terminate the
process.   The accepted answer here gives a good way to think about
it:

https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist

A program sophisticated enough to be writing to other pipes is no
longer in that category and should be setting up signal dispositions
itself, so I agree that we should enable the default disposition and
ignore WTERMSIG(exit_code) == SIGPIPE, as proposed.  That is pretty
close to the intended purpose of that signal AFAICS.

> > In the sense of "We don't care the reason", negligible reasons
> > are necessariry restricted to SIGPIPE, evan SIGSEGV could be
> > theoretically ignored safely. "theoretically" here means it is
> > another issue whether we rely on the output from a program which
> > causes SEGV (or any reason other than SIGPIPE, which we caused).
>
> For the SIGSEGV case, I think it would be better that we don't rely on
> the output data, IMO, because I think there might be a possibility that
> the program have generated that data incorrectly/unexpectedly.

+1

I don't think we should ignore termination by signals other than
SIGPIPE: that could hide serious problems from users.  I want to know
if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
happens after we read enough data; there is a major problem that a
human needs to investigate!

> >>> Finally in the attached PoC, I set cstate->eof_reached on failure
> >>> of NextCopyFromRawFields then if eof_reached we don't ingore
> >>> SIGPIPE. For a program like
> >>> "puts("test1");fflush(stdout);kill(getpid(), SIGPIPE);", I had
> >>> the following behavior. (I got an error without the fflush() but
> >>> it is inevitable).
> >>>
> >>> =# select * from ft2 limit 1;
> >>>  a
> >>> ---
> >>>test1
> >>>
> >>> =# select * from ft2 limit 2;
> >>> ERROR:  program "/home/horiguti/work/exprog" failed
> >>> DETAIL:  child process was terminated by signal 13: Broken pipe
> >>>
> >>> For the original case:
> >>>
> >>> =# select * from ft1 limit 1;
> >>> a
> >>> --
> >>>test
> >>> =# select * from ft1 limit 2;
> >>> a
> >>> --
> >>>test
> >>> (1 row)
> >>>
> >>>
> >>> I didn't confirmed whether it is complete.
> >>
> >> Sorry, I don't understand this fully, but the reason to add the
> >> eof_reached stuff is to avoid ignoring the SIGPIPE failure in normal
> >> COPY FROM PROGRAM cases?
> >
> > Yes, if we have received EOF, the program ended before we
> > finished reading from the pipe thus any error should be
> > reported.  Elsewise we don't care at least SIGPIPE. In the
> > attached patch all kind of errors are ignored when pipe is closed
> > from reading-end on backends.
>
> OK, I understand your idea.  Thanks for the patch!

+1

> > As the result it doesn't report an error for SELECT * FROM ft2
> > LIMIT 1 on "main(void){puts("test1"); return 1;}".
> >
> > =#  select * from ft limit 1;
> > a
> > ---
> >   test1
> > (1 row)
> >
> > limit 2 reports the error.
> >
> > =# select * from ft limit 2;
> > ERROR:  program "/home/horiguti/work/exprog" failed
> > DETAIL:  child process exited with exit code 1
>
> I think this would be contrary to users expectations: if the SELECT
> command works for limit 1, they would expect that the command would work
> for limit 2 as well.  So, I think it would be better to error out that
> command for limit 1 as well, as-is.

I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
error.  For LIMIT 1, we got all the rows we wanted, and then we closed
the pipe.  If we got a non-zero non-signal exit code, or a signal exit
code and it was SIGPIPE (not any other signal!), then we should
consider that to be expected.

I tried to think of a scenario where the already-received output is
truly invalidated by a later error that happens after we close the
pipe.  It could be something involving

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan  wrote:
> I think that you have to be doing a multi-level delete for a deadlock
> to take place, which isn't particularly likely to coincide with a
> concurrent insertion in general. That may explain why it's taken a
> year to get a high-quality bug report.

BTW, it's notable that Chen's query uses ON CONFLICT DO UPDATE.
Speculative insertion might cause just the wrong kind of churn,
involving continual recycling of heap TIDs.

-- 
Peter Geoghegan

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

On Mon, Oct 29, 2018 at 12:04 PM chenhj  wrote:
> ## stack of autovacuum:Acquire lock 0x2aaab670dfe4 and hold 0x2aaab4009564
> --
> (gdb) bt
> #0  0x7fe11552379b in do_futex_wait.constprop.1 () from 
> /lib64/libpthread.so.0
> #1  0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from 
> /lib64/libpthread.so.0
> #2  0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from 
> /lib64/libpthread.so.0
> #3  0x0069d362 in PGSemaphoreLock (sema=0x2ac07eb8) at 
> pg_sema.c:310
> #4  0x007095ac in LWLockAcquire (lock=0x2aaab670dfe4, 
> mode=LW_EXCLUSIVE) at lwlock.c:1233
> #5  0x004947ef in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, 
> blkno=701, isRoot=isRoot@entry=0 '\000', parent=parent@entry=0x7ffd81e4c790, 
> myoff=myoff@entry=37) at ginvacuum.c:262
> #6  0x00494874 in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, 
> blkno=blkno@entry=9954, isRoot=isRoot@entry=1 '\001', 
> parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=0)
> at ginvacuum.c:277
> #7  0x00494ed1 in ginVacuumPostingTreeLeaves 
> (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=9954, isRoot=isRoot@entry=0 '\000') at 
> ginvacuum.c:404
> #8  0x00494e21 in ginVacuumPostingTreeLeaves 
> (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=644, isRoot=isRoot@entry=1 '\001') at 
> ginvacuum.c:372
> #9  0x00495540 in ginVacuumPostingTree (rootBlkno= out>, gvs=0x7ffd81e4d0f0) at ginvacuum.c:426
> #10 ginbulkdelete (info=0x7ffd81e4f720, stats=, 
> callback=, callback_state=) at ginvacuum.c:649
> #11 0x005e1194 in lazy_vacuum_index (indrel=0x3146e28, 
> stats=stats@entry=0x28ec200, vacrelstats=vacrelstats@entry=0x28ebc28) at 
> vacuumlazy.c:1621
> #12 0x005e214d in lazy_scan_heap (aggressive=, 
> nindexes=, Irel=, vacrelstats=, 
> options=16, onerel=0x28ec1f8)
> at vacuumlazy.c:1311
> #13 lazy_vacuum_rel (onerel=onerel@entry=0x3144fa8, 
> options=options@entry=99, params=params@entry=0x289f270, bstrategy= out>) at vacuumlazy.c:258

Actually, the bigger problem is on this side of the deadlock, within
VACUUM. ginInsertCleanup() (the first/other side of the deadlock) may
have problems, but this seems worse. Commit 218f51584d5 appears to be
at fault here.

First things first: ginScanToDelete() *maintains* buffer locks on
multiple levels of a posting tree, meaning that there may be cases
where quite a few exclusive buffer locks may be held all at once (one
per level). MAX_SIMUL_LWLOCKS is 200 these days, and in practice a
B-Tree can never get that tall, but having the number of buffer locks
acquired be determined by the height of the tree is not okay, on
general principle. The nbtree code's page deletion goes to a lot of
effort to keep the number of buffer locks fixed, but nothing like that
is is attempted for GIN posting trees.

Chen's original analysis of the problem seems to be more or less
accurate: the order that ginScanToDelete() acquires buffer locks as it
descends the tree (following commit 218f51584d5) is not compatible
with the order within ginFinishSplit(). The faulty code within
ginScanToDelete() crabs/couples buffer locks while descending the
tree, whereas the code within ginFinishSplit() crabs them as it
ascends the same tree.

Teodor: Do you think that the issue is fixable? It looks like there
are serious issues with the design of 218f51584d5 to me. I don't think
the general "there can't be any inserters at this subtree" thing works
given that we have to couple buffer locks when moving right for other
reasons. We call ginStepRight() within ginFinishSplit(), for reasons
that date back to bug fix commit ac4ab97e from 2013 -- that detail is
probably important, because it seems to be what breaks the subtree
design (we don't delete in two phases or anything in GIN).

I think that you have to be doing a multi-level delete for a deadlock
to take place, which isn't particularly likely to coincide with a
concurrent insertion in general. That may explain why it's taken a
year to get a high-quality bug report.

-- 
Peter Geoghegan

RE: Small performance tweak to run-time partition pruning

2018-11-07 Thread Imai, Yoshikazu

On Mon, Oct 22, 2018 at 8:31 PM, David Rowley wrote:
> On 18 October 2018 at 16:13, Imai, Yoshikazu
>  wrote:
> > The patch improves the performance about 1.3% which is less than
> > David's result, but it seems still improves the performance.
> 
> Thanks for doing these benchmarks.
> 
> The speedup is small, but it becomes much more significant once other
> bottlenecks are removed. More partitions may show a larger increase, but
> more partitions also means that a larger range table array gets built
> during ExecInitRangeTable(), which is also slow.

Since I did enough tests and ISTM the patch is sophisticated,
I changed the state of this patch to "Ready for Committer".

--
Yoshikazu Imai

Inadequate executor locking of indexes

I discovered that it's possible to trigger relation_open's new assertion
about having a lock on the relation by the simple expedient of running
the core regression tests with plan_cache_mode = force_generic_plan.
(This doesn't give me a terribly warm feeling about how thoroughly we've
exercised the relevant code, but I don't know what to do about that.)

The reason for the crash is that nodeIndexscan.c and friends all open
their index-to-scan with code like

/*
 * Open the index relation.
 *
 * If the parent table is one of the target relations of the query, then
 * InitPlan already opened and write-locked the index, so we can avoid
 * taking another lock here.  Otherwise we need a normal reader's lock.
 */
relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
indexstate->iss_RelationDesc = index_open(node->indexid,
  relistarget ? NoLock : 
AccessShareLock);

Now, at the time this code was written, InitPlan did indeed ensure that
indexes of a plan's result relation were already opened and locked,
because it calls InitResultRelInfo which used to call ExecOpenIndices.
Nowadays, the ExecOpenIndices part is postponed to ExecInitModifyTable,
which figures it can optimize matters:

/*
 * If there are indices on the result relation, open them and save
 * descriptors in the result relation info, so that we can add new
 * index entries for the tuples we add/update.  We need not do this
 * for a DELETE, however, since deletion doesn't affect indexes. Also,
 * inside an EvalPlanQual operation, the indexes might be open
 * already, since we share the resultrel state with the original
 * query.
 */
if (resultRelInfo->ri_RelationDesc->rd_rel->relhasindex &&
operation != CMD_DELETE &&
resultRelInfo->ri_IndexRelationDescs == NULL)
ExecOpenIndices(resultRelInfo,
node->onConflictAction != ONCONFLICT_NONE);

Therefore, in a plan consisting of a DELETE ModifyTable atop an indexscan
of the target table, we are opening the index without the executor
acquiring any lock on the index.  The only thing that saves us from
triggering that Assert is that in most cases the planner has already
taken a lock on any index it considered (and not released that lock).
But using a prepared plan breaks that.

This problem is ancient; it's unrelated to the recent changes to reduce
executor locking, because that only concerned table locks not index locks.
I'm not certain how much real-world impact it has, since holding a lock
on the index's parent table is probably enough to prevent dire things
from happening in most cases.  Still, it seems like a bug.

There are several things we might do to fix this:

1. Drop the "operation != CMD_DELETE" condition from the above-quoted bit
in ExecInitModifyTable.  We might be forced into that someday anyway if
we want to support non-heap-style tables, since most other peoples'
versions of indexes do want to know about deletions.

2. Drop the target-table optimization from nodeIndexscan.c and friends,
and just always open the scan index with AccessShareLock.  (BTW, it's
a bit inconsistent that these nodes don't release their index locks
at the end; ExecCloseIndices does.)

3. Teach nodeIndexscan.c and friends about the operation != CMD_DELETE
exception, so that they get the lock for themselves in that case.  This
seems pretty ugly/fragile, but it's about the only option that won't end
in adding index lock-acquisition overhead in some cases.  (But given the
planner's behavior, it's not clear to me how often that really matters.)

BTW, another thing that is bothering me about this is that, both with
the current code and options #1 and #3, there's an assumption that any
indexscan on a query's target table must be underneath the relevant
ModifyTable plan node.  Given some other plan tree structure it might
be possible to initialize the indexscan node before the ModifyTable,
breaking the assumption that the latter already got index locks.  I'm
not sure if that's actually possible at present, but it doesn't seem
like I'd want to bet on it always being so in the future.  This might
be an argument for going with option #2, which eliminates all such
assumptions.

Another idea is to make things work more like we just made them work for
tables, which is to ensure that all index locks are taken before reaching
the executor, or at least as part of some other processing than the plan
tree walk.  That would be a good deal more work than any of the options
listed above, though, and it would definitely not be back-patchable if we
decide we need a back-patched fix.

Thoughts?

regards, tom lane

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 8 November 2018 at 05:05, Robert Haas  wrote:
> That seems OK at present, because no new partitions can have appeared
> since ExecSetupPartitionTupleRouting() acquired locks.  But if we
> allow new partitions to be added with only ShareUpdateExclusiveLock,
> then I think there would be a problem.  If a new partition OID creeps
> into the partition descriptor after find_all_inheritors() and before
> we fetch its partition descriptor, then we wouldn't have previously
> taken a lock on it and would still be attempting to open it without a
> lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).
>
> Admittedly, it might be a bit hard to provoke a failure here because
> I'm not exactly sure how you could trigger a relcache reload in the
> critical window, but I don't think we should rely on that.
>
> More generally, it seems a bit strange that we take the approach of
> locking the entire partitioning hierarchy here regardless of which
> relations the query actually knows about.  If some relations have been
> pruned, presumably we don't need to lock them; if/when we permit
> concurrent partition, we don't need to lock any new ones that have
> materialized.  We're just going to end up ignoring them anyway because
> there's nothing to do with the information that they are or are not
> excluded from the query when they don't appear in the query plan in
> the first place.

While the find_all_inheritors() call is something I'd like to see
gone, I assume it was done that way since an UPDATE might route a
tuple to a partition that there is no subplan for and due to INSERT
with VALUES not having any RangeTblEntry for any of the partitions.
Simply, any partition which is a descendant of the target partition
table could receive the tuple regardless of what might have been
pruned.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Should new partitions inherit their tablespace from their parent?

Dear Hackers,

I've just read the thread in [1] where there was a discussion on a bug
fix regarding index partitions not being created in the same
tablespace as the partitioned index was created.

One paragraph that stood out when reading the thread was:

On 8 November 2018 at 09:00, Robert Haas wrote:
> With regard to this patch, I think the new behavior is fine in and of
> itself, but I *do not* think it should have been back-patched and I
> *do not* think it should work one way for tables and another for
> indexes.

While I don't agree with that 100% as there's no option to specify
were an index partition gets automatically created when a new
partitioned table is ATTACHed, whereas with new partitioned tables you
can at least specify the TABLESPACE option.

Anyway, Robert mentioned that he does not think it should work one way
for partitioned tables and another for partitioned indexes.

Here's the current situation with partitioned tables:

create table listp (a int) partition by list(a) tablespace foo;
create table listp1 partition of listp for values in(1);
select relname,reltablespace from pg_class where relname like 'listp%';
relname | reltablespace
-+---
listp | 0
listp1 | 0
(2 rows)

It does not seem very good that when I created the partitioned table
and asked for its tablespace to be "foo" that postgres ignored that.
However, a partitioned table has no storage, so you could entertain
claims that this is not a bug.

I do have experience with using partitioned tables in a production
environment. During my experience a series of events took place that I
don't think is unique to this one case. I think we could make life
easier for this case. Here's the scenario:

1. Start a business
2. Record some time series data.
3. Your business grows faster than you thought.
4. You painfully partition your time-series table so you can easily
get rid of older data.
5. Your business grows even more and the single disk partition you use
for data on the database server is filling up quickly.
6. Panic.
7. Add more disks and be thankful you partitioned your table back in #4.

Now, we can, of course, manage all this today, but new partitions
defaulting to the default tablespace might seem a little surprising.
Imagine the situation of the DBA removing the old partitions, it's
pretty convenient if, once they've done that they can then also glance
at free diskspace and then perhaps decide to change the default
partition tablespace to the disk partition with the most free space so
that the nightly batch job which creates new partitions puts them in
the best place. The DBA could set the default_tablespace GUC, but the
tablespaces for the partitioned table might be on slower storage due
to how large they become and they might not want all new tables to go
in there.

I think we can do better:

How about we record the tablespace option for the partitioned table in
reltablespace instead of saving it as 0. Newly created partitions
which don't have a TABLESPACE mentioned in the CREATE TABLE command
should be created in their direct parent partitioned tables
tablespace.

The above idea was mentioned in [2] and many people seemed to like it.
It was never implemented which likely lead to [1], which appears to
not be a great situation.

I've no patch, but I'm willing to go and write one if the general
consensus is that we want it to work this way.

[1]
https://www.postgresql.org/message-id/flat/20181102003138.uxpaca6qfxzskepi%40alvherre.pgsql
[2]
https://www.postgresql.org/message-id/flat/CAKJS1f9PXYcT%2Bj%3DoyL-Lquz%3DScNwpRtmD7u9svLASUygBdbN8w%40mail.gmail.com#089ce41fe9a33c340f7433e5f0018912

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

On Mon, Oct 29, 2018 at 12:04 PM chenhj  wrote:
> #4  0x007095ac in LWLockAcquire (lock=0x2aaab4009564, 
> mode=LW_EXCLUSIVE) at lwlock.c:1233
> #5  0x00490a32 in ginStepRight (buffer=6713826, index= out>, lockmode=lockmode@entry=2) at ginbtree.c:174
> #6  0x00490c1c in ginFinishSplit 
> (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, 
> freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at 
> ginbtree.c:701
> #7  0x00491396 in ginInsertValue 
> (btree=btree@entry=0x7ffd81e4f950, stack=, 
> insertdata=insertdata@entry=0x7ffd81e4f940, buildStats=buildStats@entry=0x0)
> at ginbtree.c:773
> #8  0x0048fb42 in ginInsertItemPointers (index=, 
> rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, 
> nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
> at gindatapage.c:1907
> #9  0x0048a9ea in ginEntryInsert 
> (ginstate=ginstate@entry=0x28e7ef8, attnum=, key=42920440, 
> category=, items=0x2916598, nitem=353,
> buildStats=buildStats@entry=0x0) at gininsert.c:214
> #10 0x00496d94 in ginInsertCleanup 
> (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', 
> fill_fsm=fill_fsm@entry=1 '\001',
> forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at 
> ginfast.c:883
> #11 0x00497727 in ginHeapTupleFastInsert 
> (ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) 
> at ginfast.c:448
> #12 0x0048b209 in gininsert (index=, 
> values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, 
> heapRel=, checkUnique=,
> indexInfo=0x28b5aa8) at gininsert.c:522

I want to point something out about the above inserter represented by
this stack trace. It deadlocks against VACUUM from within the
following ginEntryInsert() call:

/*
 * While we left the page unlocked, more stuff might have gotten
 * added to it.  If so, process those entries immediately.  There
 * shouldn't be very many, so we don't worry about the fact that
 * we're doing this with exclusive lock. Insertion algorithm
 * guarantees that inserted row(s) will not continue on next page.
 * NOTE: intentionally no vacuum_delay_point in this loop.
 */
if (PageGetMaxOffsetNumber(page) != maxoff)
{
ginInitBA();
processPendingPage(, , page, maxoff + 1);

ginBeginBAScan();
while ((list = ginGetBAEntry(,
 , , ,
)) != NULL)
ginEntryInsert(ginstate, attnum, key, category, //
<--- deadlocks
   list, nlist, NULL);
}

Clearly this particular call to ginEntryInsert() from within
ginInsertCleanup() is generally supposed to be a rare occurrence. It's
not surprising that it's hard for you to hit this issue.

BTW, I strongly doubt that disabling hugepages has fixed anything,
though it may have accidentally masked the problem. This is an issue
around the basic correctness of the locking protocols used by GIN.

-- 
Peter Geoghegan

Re: pread() and pwrite()

On Thu, Nov 8, 2018 at 4:27 AM Andrew Dunstan
 wrote:
> On 11/7/18 10:05 AM, Jesper Pedersen wrote:
> > On 11/7/18 9:30 AM, Tom Lane wrote:
> >> I'm confused by this.  Surely the pwrite-based code is writing
> >> exactly the
> >> same data as before.  Do we have to conclude that valgrind is
> >> complaining
> >> about passing uninitialized data to pwrite() when it did not complain
> >> about exactly the same thing for write()?
> >>
> >> [ looks ... ]  No, what we have to conclude is that the write-related
> >> suppressions in src/tools/valgrind.supp need to be replaced or augmented
> >> with pwrite-related ones.
> >
> > The attached patch fixes this for me.
> >
> > Unfortunately pwrite* doesn't work for the pwrite64(buf) line.
>
> Works for me. If there's no objection I will commit this.

Thanks for adjusting that.  I suppose I would have known about this if
cfbot checked every patch with valgrind, which I might look into.

I'm a little confused about how an uninitialised value originating in
an OID list finishes up in an xlog buffer, considering that OIDs don't
have padding.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Tid scan improvements

2018-11-07 Thread Edmund Horner

On Tue, 6 Nov 2018 at 16:52, Alvaro Herrera  wrote:
> On 2018-Nov-06, David Rowley wrote:
> > 14. we pass 'false' to what?
> >
> > + * save the tuple and the buffer returned to us by the access methods in
> > + * our scan tuple slot and return the slot.  Note: we pass 'false' because
> > + * tuples returned by heap_getnext() are pointers onto disk pages and were
> > + * not created with palloc() and so should not be pfree()'d.  Note also
> > + * that ExecStoreHeapTuple will increment the refcount of the buffer; the
> > + * refcount will not be dropped until the tuple table slot is cleared.
> >   */
>
> Seems a mistake stemming from 29c94e03c7d0 ...

Yep -- I copied that bit from nodeSeqscan.c.  Some of the notes were
removed in that change, but nodeSeqscan.c and nodeIndexscan.c still
have them.

I made a little patch to remove them.


remove-obsolete-ExecStoreTuple-notes.patch
Description: Binary data

Re: Calculate total_table_pages after set_base_rel_sizes()

On 8 November 2018 at 06:17, Tom Lane  wrote:
> Pushed with cosmetic adjustments.

Thanks for adjusting and pushing.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: fix psql \conninfo & \connect when using hostaddr

On Fri, Oct 26, 2018 at 9:54 AM Fabien COELHO  wrote:
> About updating psql's behavior, without this patch:
>
>sh> psql "host=foo hostaddr=127.0.0.1"
>
>psql> \conninfo
>You are connected to database "fabien" as user "fabien" on host "foo" at 
> port "5432".
> # NOPE, I'm really connected to localhost, foo does not even exist
> # Other apparent inconsistencies are possible when hostaddr overrides
> # "host" which is an socket directory or an IP.

I remain of the opinion that this is not a bug.  You told it that foo
has address 127.0.0.1 and it believed you; that's YOUR fault.

> After the patch:
>
>sh> psql "host=foo hostaddr=127.0.0.1"
>
>psql> \conninfo
>You are connected to database "fabien" as user "fabien" on host "foo" 
> (address "127.0.0.1") at port "5432".
># better

Nevertheless, that seems like a reasonable change to the output.  Will
your patch show the IP address in all cases or only when hostaddr is
specified?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL Limits and lack of documentation about them.

On Tue, Nov 6, 2018 at 6:01 AM John Naylor  wrote:
> On 11/1/18, David Rowley  wrote:
> > I've attached an updated patch, again it's just intended as an aid for
> > discussions at this stage. Also included the rendered html.
>
> Looks good so far. Based on experimentation with toasted columns, it
> seems the largest row size is 452GB, but I haven't tried that on my
> laptop. :-) As for the number-of-column limits, it's a matter of how
> much detail we want to include. With all the numbers in my previous
> email, that could probably use its own table if we include them all.

There are a lot of variables here.  A particular row size may work for
one encoding and not for another.

IMHO, documenting that you can get up to 1600 integer columns but only
1002 bigint columns doesn't really help anybody, because nobody has a
table with only one type of column, and people usually want to have
some latitude to run ALTER TABLE commands later.

It might be useful for some users to explain that certain things will
should work for values < X, may work for values between X and Y, and
will definitely not work above Y.  Or maybe we can provide a narrative
explanation rather than just a table of numbers.  Or both.  But I
think trying to provide a table of exact cutoffs is sort of like
tilting at windmills.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

On Tue, Nov 6, 2018 at 10:05 AM chenhj  wrote:
> I  analyzed the btree block where lwlock deadlock occurred, as follows:

Thank you for doing this important work.

You're using Postgres 10.2. While that version isn't current with all
GIN bug fixes, it does have this important one:

"Ensure that vacuum will always clean up the pending-insertions list
of a GIN index (Masahiko Sawada)"

Later GIN fixes seem unlikely to be relevant to your issue. I think
that this is probably a genuine, new bug.

> The ginInsertValue() function above gets the lwlock in the order described in 
> the README.

> However, ginScanToDelete() depth-first scans the btree and gets the EXCLUSIVE 
> lock, which creates a deadlock.
> Is the above statement correct? If so, deadlocks should easily happen.

I have been suspicious of deadlock hazards in the code for some time
-- particularly around pending list cleanup. I go into a lot of detail
on my suspicions here:

https://www.postgresql.org/message-id/flat/CAH2-WzmfUpRjWcUq3%2B9ijyum4AJ%2Bk-meGT8_HnxBMpKz1aNS-g%40mail.gmail.com#ea5af1088adfacb3d0ba88313be1a5e3

I note that your first deadlock involve the following kinds of backends:

* ginInsertCleanup() calls from a regular backend, which will have a
backend do things that VACUUM generally only gets to do, like call
RecordFreeIndexPage().

* (auto)VACUUM processes.

Your second/recovery deadlock involves:

* Regular read-only queries.

* Recovery code.

Quite a lot of stuff is involved here!

The code in this area is way too complicated, and I haven't thought
about it in about a year, so it's hard for me to be sure of anything
at the moment. My guess is that there is confusion about the type of
page expected within one or more blocks (e.g. entry tree vs. pending
list), due to a race condition in block deletion and/or recycling --
again, I've suspected something like this could happen for some time.
The fact that you get a distinct deadlock during recovery is
consistent with that theory.

It's safe to say that promoting the asserts on gin page type into
"can't happen" elog errors in places like scanPendingInsert() and
ginInsertCleanup() would be a good start. Note that we already did
similar Assert-elog(ERROR) promotion this for posting tree code, when
similar bugs appeared way back in 2013.

-- 
Peter Geoghegan

Re: partitioned indexes and tablespaces

On Wed, Nov 7, 2018 at 2:33 PM Tomas Vondra
 wrote:
> FWIW, it was mentioned that "your only concurring vote came from someone
> with whom you share an employer" which kind suggests opinions/votes from
> people working for the same company are somehow less honest/valuable. I
> find that annoying and even insulting, because it kinda hints the
> company (or companies) are pushing people to respond differently than
> they would otherwise. Which I find rather insulting.

It doesn't have to be companies exerting overt pressure on their
employees.  It's natural that people are going to have a sympathetic
view of patches written by people with whom they work on a day-to-day
basis.  And it's not even a bad thing; having friends and colleagues
whom we trust is good.  Still, it's impossible to be be sure that
you're reacting in exactly the same way to a patch by someone you know
and trust as you would to a patch written by a stranger, which is why
when Amit and I recently posted some reviews of zheap-related patches,
we inserted disclaimers into the first paragraph that we are not
independent of those patches and that independent review is desirable.
We did that because *we know* that we may be biased and we want to be
fair about that and on guard against it.  I don't think asking other
people to be similarly aware should be annoying or insulting.  If your
Mom decided to take up PostgreSQL hacking and posted a patch here, can
you really say you'd evaluate that patch with the exact same
objectivity that you'd apply to a patch written by somebody you'd
never met before?  Whether you have a good relationship with your Mom
or a terrible one, that seems like an impossible standard for any
normal human being to meet.

With regard to this patch, I think the new behavior is fine in and of
itself, but I *do not* think it should have been back-patched and I
*do not* think it should work one way for tables and another for
indexes.  And, regardless of the technical merits, I strongly object
to things to which multiple people are objecting being jammed into a
back-branch just before a release wraps.  That's just not cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speeding up INSERTs and UPDATEs to partitioned tables


Hi David,

On 11/7/18 6:46 AM, David Rowley wrote:

I've attached a patch which does this.  It adds a new struct named
PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
out of PartitionTupleRouting. There are fields for each of what these
arrays used to store inside the PartitionRoutingInfo struct.

While doing this I kept glancing back over at ModifyTableState and at
the mt_per_subplan_tupconv_maps array. I wondered if it would be
better to make the PartitionRoutingInfo a bit more generic, perhaps
call it TupleConversionInfo and have fields like ti_ToGeneric and
ti_FromGeneric, with the idea that "generic" would be the root
partition or the first subplan for inheritance updates. This would
allow us to get rid of a good chunk of code inside nodeModifyTable.c.
However, I ended up not doing this and left PartitionRoutingInfo to be
specifically for Root to Partition conversion.



Yeah, it doesn't necessarily have to be part of this patch.


Also, on the topic about what to name the conversion maps from a few
days ago; After looking at this a bit more I decided that having them
named ParentToChild and ChildToParent is misleading.  If the child is
the child of some sub-partitioned table then the parent that the map
is talking about is not the partition's parent, but the root
partitioned table. So really RootToPartition and PartitionToRoot seem
like much more accurate names for the maps.



Agreed.


I made a few other changes along the way; I thought that
ExecFindPartition() would be a good candidate to take on the
responsibility of validating the partition is valid for INSERTs when
it uses a partition out of the subplan_resultrel_hash. I thought it
was better to check this once when we're in the code path of grabbing
the ResultRelInfo out that hash table rather than in a code path that
must check if it's been done already each time we route a tuple into a
partition. It also allowed me to get rid of
ri_PartitionReadyForRouting. I also moved the call to
ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
make that function static, which could result in the generation of
slightly more optimal compiled code.

Please find attached the v14 patch.



Passes check-world, and has detailed documentation about the changes :)

Best regards,
 Jesper

Re: partitioned indexes and tablespaces

On 11/7/18 7:41 PM, Robert Haas wrote:
> On Wed, Nov 7, 2018 at 1:22 PM Alvaro Herrera  
> wrote:
>>> But maybe you've adopted that policy already.  You back-patched a
>>> behavior change 2 days before a minor release when the vote was 2-3
>>> against the change.
>>
>> It was?  This is my count:
>> For: Alvaro, Andrew, Tom
>> Against: Michael, Robert, Andres
> 
> Tom's message was posted after you had already committed it.  His
> vote counts from the point of view of determining retrospectively
> whether your action had support, but you can't use it to justify the
> decision to push the commit when you did, unless you used a time
> machine to foresee his message before he posted it.
> 

Yeah. I think the change is OK from technical POV - it essentially fixes
behavior that is useless/surprising and would just result in raised
eyebrows and bogus bug reports. No problems here.

But the consensus probably was not there when it got pushed ...

>> Also, I contested every point that was raised about this patch.  I 
>> don't think there were any remaining technical objections.
> 
> Sure, but I don't think the standard is whether you told people that 
> they were wrong.  I think the standard is whether you convinced them 
> to revise their position.  You sure haven't convinced me.
> 

Yeah. While I think the objections were wrong, and Alvaro explained that
pretty well in his response, there should have been more time for the
OPs to respond before pushing the change. That's not great.

FWIW, it was mentioned that "your only concurring vote came from someone
with whom you share an employer" which kind suggests opinions/votes from
people working for the same company are somehow less honest/valuable. I
find that annoying and even insulting, because it kinda hints the
company (or companies) are pushing people to respond differently than
they would otherwise. Which I find rather insulting.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Surjective functional indexes

Robert Haas  writes:
> On Fri, May 11, 2018 at 11:01 AM, Simon Riggs  wrote:
 I have no problem if you want to replace this with an even better
 design in a later release.

>>> Meh. The author / committer should get a patch into the right shape

>> They have done, at length. Claiming otherwise is just trash talk.

> Some people might call it "honest disagreement".

So where we are today is that this patch was lobotomized by commits
77366d90f/d06fe6ce2 as a result of this bug report:

https://postgr.es/m/20181106185255.776mstcyehnc63ty@alvherre.pgsql

We need to move forward, either by undertaking a more extensive
clean-out, or by finding a path to a version of the code that is
satisfactory.  I wanted to enumerate my concerns while yesterday's
events are still fresh in mind.  (Andres or Robert might have more.)

* I do not understand why this feature is on-by-default in the first
place.  It can only be a win for expression indexes that are many-to-one
mappings; for indexes that are one-to-one or few-to-one, it's a pretty
big loss.  I see no reason to assume that most expression indexes fall
into the first category.  I suggest that the design ought to be to use
this optimization only for indexes for which the user has explicitly
enabled recheck_on_update.  That would allow getting rid of the cost check
in IsProjectionFunctionalIndex, about which we'd otherwise have to have
an additional fight: I do not like its ad-hoc-ness, nor the modularity
violation (and potential circularity) involved in having the relcache call
cost_qual_eval.

* Having heapam.c calling the executor also seems like a
modularity/layering violation that will bite us in the future.

* The implementation seems incredibly inefficient.  ProjIndexIsUnchanged
is doing creation/destruction of an EState, index_open/index_close
(including acquisition of a lock we should already have), BuildIndexInfo,
and expression compilation, all of which are completely redundant with
work done elsewhere in the executor.  And it's doing that *over again for
every tuple*, which totally aside from wasted cycles probably means there
are large query-lifespan memory leaks in an UPDATE affecting many rows.
I think a minimum expectation should be that one-time work is done only
one time; but ideally none of those things would happen at all because
we could share work with the regular code path.  Perhaps it's too much
to hope that we could also avoid duplicate computation of the new index
expression values, but as long as I'm listing complaints ...

* As noted in the bug thread, the implementation of the new reloption
is broken because (a) it fails to work for some built-in index AMs
and (b) by design, it can't work for add-on AMs.  I agree with Andres
that that's not acceptable.

* This seems like bad data structure design:

+   Bitmapset  *rd_projidx; /* Oids of projection indexes */

That comment is a lie, although IMO it'd be better if it were true;
a list of target index OIDs would be a far better design here.  The use
of rd_projidx as a set of indexes into the relation's indexlist is
inefficient and overcomplicated, plus it creates an unnecessary and not
very safe (even if it were documented) coupling between rd_indexlist and
rd_projidx.

* Having ProjIndexIsUnchanged examine relation->rd_projidx directly is
broken by design anyway, both from a modularity standpoint and because
its inner loop involves catalog accesses that could result in relcache
flushes.  I'm somewhat amazed that the regression tests passed on
CLOBBER_CACHE_ALWAYS buildfarm animals, although possibly this is
explained by the fact that they're only testing cases with a single
expression index, so that the bitmap isn't checked again after the cache
flush happens.  Again, this could be fixed if what was returned was just
a list of relevant index OIDs.

* This bit of coding is unsafe, for the reason explained in the existing
comment:

 /*
  * Now save copies of the bitmaps in the relcache entry.  We intentionally
  * set rd_indexattr last, because that's the one that signals validity of
  * the values; if we run out of memory before making that copy, we won't
  * leave the relcache entry looking like the other ones are valid but
  * empty.
  */
 oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 relation->rd_keyattr = bms_copy(uindexattrs);
 relation->rd_pkattr = bms_copy(pkindexattrs);
 relation->rd_idattr = bms_copy(idindexattrs);
 relation->rd_indexattr = bms_copy(indexattrs);
+relation->rd_projindexattr = bms_copy(projindexattrs);
+relation->rd_projidx = bms_copy(projindexes);
 MemoryContextSwitchTo(oldcxt);

* There's a lot of other inattention to comments.  For example, I noticed
that this patch added new responsibilities to RelationGetIndexList without
updating its header comment to mention them.

* There's a lack of attention to terminology, too.  I do not think that
"projection index" is an appropriate term at all,

Re: Connection limit doesn't work for superuser

2018-11-07 Thread Joshua D. Drake


On 11/7/18 10:49 AM, Robert Haas wrote:

On Wed, Nov 7, 2018 at 1:14 PM Tom Lane  wrote:

I think that having superusers be immune to datconnlimit is actually
the right thing; for one reason, because datconnlimit can be set by
database owners, who should not be able to lock superusers out of
their database.

Yeah, that's a reasonable argument, although they'd also be locking
themselves out of the database, and the superuser could undo it by
connecting to some other database.


If people are okay with having rolconnlimit act
differently from datconnlimit in this respect, then I'll withdraw
my objection.

Is there any particular reason why they should be consistent?  It's
not obvious to me, but sometimes I'm dumb.


IMO, super users should only be affected by 
superuser_reserved_connections. Otherwise we are getting into fine grain 
of potential foot guns.



JD








--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: Connection limit doesn't work for superuser

On Wed, Nov 7, 2018 at 1:14 PM Tom Lane  wrote:
> I think that having superusers be immune to datconnlimit is actually
> the right thing; for one reason, because datconnlimit can be set by
> database owners, who should not be able to lock superusers out of
> their database.

Yeah, that's a reasonable argument, although they'd also be locking
themselves out of the database, and the superuser could undo it by
connecting to some other database.

> If people are okay with having rolconnlimit act
> differently from datconnlimit in this respect, then I'll withdraw
> my objection.

Is there any particular reason why they should be consistent?  It's
not obvious to me, but sometimes I'm dumb.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: partitioned indexes and tablespaces

On Wed, Nov 7, 2018 at 1:22 PM Alvaro Herrera  wrote:
> > But maybe you've adopted that policy already.  You back-patched a
> > behavior change 2 days before a minor release when the vote was 2-3
> > against the change.
>
> It was?  This is my count:
> For: Alvaro, Andrew, Tom
> Against: Michael, Robert, Andres

Tom's message was posted after you had already committed it.  His vote
counts from the point of view of determining retrospectively whether
your action had support, but you can't use it to justify the decision
to push the commit when you did, unless you used a time machine to
foresee his message before he posted it.

> Also, I contested every point that was raised about this patch.  I don't
> think there were any remaining technical objections.

Sure, but I don't think the standard is whether you told people that
they were wrong.  I think the standard is whether you convinced them
to revise their position.  You sure haven't convinced me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Wed, Nov 7, 2018 at 12:58 PM Alvaro Herrera  wrote:
> On 2018-Nov-06, Robert Haas wrote:
> > - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> > both call RelationGetPartitionDesc.
>
> My patch deals with this by caching descriptors in the active snapshot.
> So those two things would get the same partition descriptor.  There's no
> RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
> place.
>
> (I tried to use different scoping than the active snapshot; I first
> tried the Portal, then I tried the resource owner.  But nothing seems to
> fit as precisely as the active snapshot.)
...
> In other words, I already solved these problems you list.
>
> Maybe you could give my patch a look.

I have, a bit.  One problem I'm having is that while you explained the
design you chose in a fair amount of detail, you didn't give a lot of
explanation (that I have seen) of the reasons why you chose that
design.  If there's a README or a particularly good comment someplace
that I should be reading to understand that better, please point me in
the right direction.

And also, I just don't really understand what all the problems are
yet.  I'm only starting to study this.

I am a bit skeptical of your approach, though.  Tying it to the active
snapshot seems like an awfully big hammer.  Snapshot manipulation can
be a performance bottleneck both in terms of actual performance and
also in terms of code complexity, and I don't really like the idea of
adding more code there.  It's not a sustainable pattern for making DDL
work concurrently, either -- I'm pretty sure we don't want to add new
code to things like GetLatestSnapshot() every time we want to make a
new kind of DDL concurrent.  Also, while hash table lookups are pretty
cheap, they're not free.  In my opinion, to the extent that we can, it
would be better to refactor things to avoid duplicate lookups of the
PartitionDesc rather than to install a new subsystem that tries to
make sure they always return the same answer.

Such an approach seems to have other possible advantages.  For
example, if a COPY is running and a new partition shows up, we might
actually want to allow tuples to be routed to it.  Maybe that's too
pie in the sky, but if we want to preserve the option to do such
things in the future, a hard-and-fast rule that the apparent partition
descriptor doesn't change unless the snapshot changes seems like it
might get in the way.  It seems better to me to have a system where
code that accesses the relcache has a choice, so that at its option it
can either hang on to the PartitionDesc it has or get a new one that
may be different.  If we can do things that way, it gives us the most
flexibility.

After the poking around I've done over the last 24 hours, I do see
that there are some non-trivial problems with making it that way, but
I'm not really ready to give up yet.

Does that make sense to you, or am I all wet here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Generic type subscripting

2018-11-07 Thread Dmitry Dolgov

> On Wed, 7 Nov 2018 at 17:09, Pavel Stehule  wrote:
>
> I don't agree. If we use a  same syntax for some objects types, we should to 
> enforce some consistency.

Just to make it clear, consistency between what?

> I don't think so you should to introduce nulls for JSONs. In this case, the 
> most correct solution is raising a exception.

Now it's my turn to disagree. As an argument I have this thread [1], where
similar discussion happened about flexibility of jsonb and throwing an errors
(in this particular case whether or not to throw an error when a non existing
path was given to jsonb_set).

I can imagine significant number of use cases when adding a value to jsonb like
that is desirable outcome, and I'm not sure if I can come up with an example
when strictness is the best result. Maybe if you have something in mind, you
can describe what would be the case for that? Also as I've mentioned before,
consistency between jsonb_set and jsonb subscripting operator will help us to
avoid tons of question about why I can do this and this using one option, but
not another.

[1]: 
https://www.postgresql.org/message-id/CAM3SWZT3uZ7aFktx-nNEWGbapN1oy2t2gt10pnOzygZys_Ak1Q%40mail.gmail.com

Re: partitioned indexes and tablespaces

On 2018-Nov-07, Robert Haas wrote:

> On Wed, Nov 7, 2018 at 10:46 AM Alvaro Herrera  
> wrote:
> > I think 11.0 is ready for testing that a migration from a production
> > running 10.x, but not for just blindly migrating.  If you wanted to take
> > such a leap of faith, surely you'd wait for 11.1 at the very least.
> 
> I think that's an irresponsible attitude for a committer to take.  In
> practice, you are probably right, but we shouldn't treat our
> supposedly-stable releases as if they don't really need to be kept
> stable.

I take back that part actually, in the sense that I certainly wouldn't
push patches that I didn't think were good reasonable bugfixes the week
before a release.

But, again, I don't think I was changing any behavior that anybody was
relying on.  Had anybody tried this case, they would have immediately
complained that it didn't work the way they would expect, like #14590.

> But maybe you've adopted that policy already.  You back-patched a
> behavior change 2 days before a minor release when the vote was 2-3
> against the change.

It was?  This is my count:
For: Alvaro, Andrew, Tom
Against: Michael, Robert, Andres

Also, I contested every point that was raised about this patch.  I don't
think there were any remaining technical objections.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Connection limit doesn't work for superuser

"David G. Johnston"  writes:
> On the accept side, which I'm leaning toward, is that superuser is
> already constrained by max_connections and, in addition, the
> implications of setting this value are straight-forward and it obvious
> requires intent on the part of the user.  Its not a "foot-gun" in the
> sense that there are side-effects that the user isn't going to be
> aware of by having this feature in place - it does exactly what the
> label says it does.

That's a fair point, and certainly if we allow and enforce "alter
user postgres nologin" (which we do), it's odd to not enforce
connection limit.  However, looking at the code, it's a little
clearer why it was done that way: it's for consistency with the
behavior of datconnlimit (per-database connection limits).

I think that having superusers be immune to datconnlimit is actually
the right thing; for one reason, because datconnlimit can be set by
database owners, who should not be able to lock superusers out of
their database.  If people are okay with having rolconnlimit act
differently from datconnlimit in this respect, then I'll withdraw
my objection.

regards, tom lane

Re: file cloning in pg_upgrade and CREATE DATABASE

Peter Eisentraut  writes:
> Committed, thanks.

It seems unfortunate that there is no test coverage in the committed
patch.  I realize that it may be hard to test given the filesystem
dependency, but how will we know if it works at all?

regards, tom lane

Re: Does slot_deform_tuple need to care about dropped columns?

Andres Freund  writes:
> ... in the case the attribute isn't already deformed, the
> following hunk exists:

>   /*
>* If the attribute's column has been dropped, we force a NULL result.
>* This case should not happen in normal use, but it could happen if we
>* are executing a plan cached before the column was dropped.
>*/
>   if (TupleDescAttr(tupleDesc, attnum - 1)->attisdropped)
>   {
>   *isnull = true;
>   return (Datum) 0;
>   }

> Which strikes me as quite odd. If somebody previously accessed a *later*
> column (be it via slot_getattr, or slot_getsomeattrs), the whole
> attisdropped check is neutralized.

Good point.  Let's remove it and see what happens.

> Tom, you added that code way back when, in a9b05bdc8330. And as far as I
> can tell that issue existed back then too.

I was just transposing code that had existed before that in ExecEvalVar.
Evidently I didn't think hard about whether the protection was
bulletproof.  But since it isn't, maybe we don't need it at all.
I think our checks for obsoleted plans are a lot more bulletproof
than they were back then, so it's entirely likely the issue is moot.

regards, tom lane

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On 2018-Nov-06, Robert Haas wrote:

> - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> both call RelationGetPartitionDesc.

My patch deals with this by caching descriptors in the active snapshot.
So those two things would get the same partition descriptor.  There's no
RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
place.

(I tried to use different scoping than the active snapshot; I first
tried the Portal, then I tried the resource owner.  But nothing seems to
fit as precisely as the active snapshot.)

> - expand_inherited_rtentry checks
> RelationGetPartitionDesc(oldrelation) != NULL.  If so, it calls
> expand_partitioned_rtentry which fetches the same PartitionDesc again.

This can be solved by changing the test to a relkind one, as my patch
does.

> - set_relation_partition_info also calls RelationGetPartitionDesc.
> Off-hand, I think this code runs after expand_inherited_rtentry.  Not
> sure what to do about this.  I'm not sure what the consequences would
> be if this function and that one had different ideas about the
> partition descriptor.

Snapshot caching, like in my patch, again solves this problem.

> - tablecmds.c is pretty free about calling RelationGetPartitionDesc
> repeatedly, but it probably doesn't matter.  If we're doing some kind
> of DDL that depends on the contents of the partition descriptor, we
> *had better* be holding a lock strong enough to prevent the partition
> descriptor from being changed by somebody else at the same time.

My patch deals with this by unlinking the partcache entry from the hash
table on relation invalidation, so DDL code would obtain a fresh copy
each time (lookup_partcache_entry).

In other words, I already solved these problems you list.

Maybe you could give my patch a look.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Does slot_deform_tuple need to care about dropped columns?

2018-11-07 Thread Andres Freund

Hi,

Currently functions like slot_getattr() first check if the attribute is
already deformed:

Datum
slot_getattr(TupleTableSlot *slot, int attnum, bool *isnull)
{
...
/*
 * fast path if desired attribute already cached
 */
if (attnum <= slot->tts_nvalid)
{
*isnull = slot->tts_isnull[attnum - 1];
return slot->tts_values[attnum - 1];
}
...

but later, in the case the attribute isn't already deformed, the
following hunk exists:

/*
 * If the attribute's column has been dropped, we force a NULL result.
 * This case should not happen in normal use, but it could happen if we
 * are executing a plan cached before the column was dropped.
 */
if (TupleDescAttr(tupleDesc, attnum - 1)->attisdropped)
{
*isnull = true;
return (Datum) 0;
}

Which strikes me as quite odd. If somebody previously accessed a *later*
column (be it via slot_getattr, or slot_getsomeattrs), the whole
attisdropped check is neutralized.

I think we either should remove that check as unnecessary, or move it to
slot_deform_tuple(), so it also protects other accesses (like the very
very common direct access to tts_values/isnull).

Tom, you added that code way back when, in a9b05bdc8330. And as far as I
can tell that issue existed back then too.

Greetings,

Andres Freund

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-11-07 Thread Peter Eisentraut

On 19/10/2018 01:50, Michael Paquier wrote:
> On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:
>> New patch that removes all the various reflink modes and adds a new
>> option --clone that works similar to --link.  I think it's much cleaner now.
> 
> Thanks Peter for the new version.
> 
> +
> + {"clone", no_argument, NULL, 1},
> +
>   {NULL, 0, NULL, 0}
> 
> Why newlines here?

fixed

> @@ -293,6 +300,7 @@ usage(void)
>   printf(_("  -U, --username=NAME   cluster superuser (default 
> \"%s\")\n"), os_info.user);
>   printf(_("  -v, --verbose enable verbose internal 
> logging\n"));
>   printf(_("  -V, --version display version information, 
> then exit\n"));
> + printf(_("  --clone   clone instead of copying 
> files to new cluster\n"));
>   printf(_("  -?, --helpshow this help, then 
> exit\n"));
>   printf(_("\n"
> 
> An idea for a one-letter option could be -n.  This patch can live
> without.

-n is often used for something like "dry run", so it didn't go for that
here.  I suspect the cloning will remain a marginal option for some
time, so having only a long option is acceptable.

> + pg_fatal("error while cloning relation \"%s.%s\": could not open file 
> \"%s\": %s\n",
> +  schemaName, relName, src, strerror(errno));
> 
> The tail of those error messages "could not open file" and "could not
> create file" are already available as translatable error messages.
> Would it be better to split those messages in two for translators?  One
> is generated with pg_fatal("error while cloning relation \"%s.%s\":
> could not open file \"%s\": %s\n",
> +  schemaName, relName, src, strerror(errno));

I think this is too complicated for a few messages.

> Those are all minor points, the patch looks good to me.

Committed, thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Calculate total_table_pages after set_base_rel_sizes()

David Rowley  writes:
> [ v2-0001-Calculate-total_table_pages-after-set_base_rel_si.patch ]

Pushed with cosmetic adjustments.  The reason it's okay to not check for
appendrels in this loop is now quite different from the reason it was okay
before, so I didn't like the fact that you'd just cut-and-pasted the same
comment.  Also, you did s/brel/rel/ in the transferred code, which I would
not have objected to except you were inserting it into a function that
(a) already had a local variable "rel", causing a shadowing situation,
and (b) had an identical loop just above this that used the "brel"
notation.  So I changed that back.

regards, tom lane

Re: fix psql \conninfo & \connect when using hostaddr

st 7. 11. 2018 v 15:11 odesílatel Arthur Zakirov 
napsal:

> Hello all,
>
> On 07.11.2018 16:23, Pavel Stehule wrote:
> > pá 26. 10. 2018 v 15:55 odesílatel Fabien COELHO  > > napsal:
> > >
> > > This is a follow-up to another patch I posted about libpq confusing
> > > documentation & psql resulting behavior under host/hostaddr
> settings.
> > >
> > > Although the first mostly documentation patch did not gather much
> > > enthousiasm, I still think both issues deserve a fix.
> >
> > I checked this patch, and it looks well. The documentation is correct,
> > all tests passed. It does what is proposed.
> >
> > I think so some redundant messages can be reduced - see function
> > printConnInfo - attached patch
>
> I have few notes about patches.
>
> > /* If the host is an absolute path, the connection is via socket unless
> overriden by hostaddr */
> > if (is_absolute_path(host))
> >   if (hostaddr && *hostaddr)
> >   printf(_("You are connected to database \"%s\" as user
> \"%s\" on address \"%s\" at port \"%s\".\n"),
> >  db, PQuser(pset.db), hostaddr, PQport(pset.db));
> >   else
> >   printf(_("You are connected to database \"%s\" as user
> \"%s\" via socket in \"%s\" at port \"%s\".\n"),
> >  db, PQuser(pset.db), host, PQport(pset.db));
> > else
> >   if (hostaddr && *hostaddr && strcmp(host, hostaddr) != 0)
> >   printf(_("You are connected to database \"%s\" as user
> \"%s\" on host \"%s\" (address \"%s\") at port \"%s\".\n"),
> >  db, PQuser(pset.db), host, hostaddr,
> PQport(pset.db));
> >   else
> >   printf(_("You are connected to database \"%s\" as user
> \"%s\" on host \"%s\" at port \"%s\".\n"),
> >  db, PQuser(pset.db), host, PQport(pset.db));
>
> I think there is lack of necessary braces here for first if and second
> else branches. This is true for both patches.
>

?

Pavel

>
> > /* set connip */
> > if (conn->connip != NULL)
> > {
> >   free(conn->connip);
> >   conn->connip = NULL;
> > }
> >
> > {
> >   charhost_addr[NI_MAXHOST];
> >   getHostaddr(conn, host_addr);
> >   if (strcmp(host_addr, "???") != 0)
> >   conn->connip = strdup(host_addr);
> > }
>
> Maybe it is better to move this code into the PQhostaddr() function?
> Since connip is used only in PQhostaddr() it might be not worth to do
> additional work in main PQconnectPoll().
>
> --
> Arthur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>

Re: valgrind on initdb

2018-11-07 Thread Andres Freund

On 2018-11-07 09:37:37 -0500, Andrew Dunstan wrote:
> 
> On 11/7/18 9:11 AM, Tomas Vondra wrote:
> > 
> > On 11/7/18 2:47 PM, John Naylor wrote:
> > > On 11/7/18, Jesper Pedersen  wrote:
> > > > Hi,
> > > > 
> > > > While looking at [1] (included in 23315.log) there are other warnings as
> > > > well.
> > > Perhaps it's worth revisiting to make debugging easier, but right now
> > > initdb.c has this comment:
> > > 
> > >   * Note:
> > >   *The program has some memory leakage - it isn't worth cleaning 
> > > it up.
> > > 
> > Maybe. I certainly don't think it's not worth the time merely for the
> > sake of fixing the memory leaks. The reasoning here is that initdb runs
> > for a short period of time (a couple of seconds, really), and the memory
> > gets released when the process exits. And the leaks are tiny in general
> > - a couple of bytes here and there. Had there been a massive memory leak
> > that would change the equation of course.
> > 
> 
> Yeah, I'm pretty sure I wrote that comment 15 or so years ago in the first C
> implementation of initdb. I don't think my opinion has changed much. We're
> talking about kilobytes, here, nothing massive.

Yea, I'm opposed to fixing this by manually doing lotsa pfrees. We don't
do that anywhere for memory-lifetime allocations, and it'd not actually
be useful.

I think if we ever make mcxt style memory management usable for
frontends, the story could be a bit different. I could e.g. see having
one statically initialized context that contains most/all program
lifetime allocations. But that'd be just a minor positive side-effect.

Greetings,

Andres Freund

Re: partitioned indexes and tablespaces

2018-11-07 Thread Andres Freund

On 2018-11-07 11:19:53 -0500, Robert Haas wrote:
> On Wed, Nov 7, 2018 at 10:46 AM Alvaro Herrera  
> wrote:
> > I think 11.0 is ready for testing that a migration from a production
> > running 10.x, but not for just blindly migrating.  If you wanted to take
> > such a leap of faith, surely you'd wait for 11.1 at the very least.
> 
> I think that's an irresponsible attitude for a committer to take.  In
> practice, you are probably right, but we shouldn't treat our
> supposedly-stable releases as if they don't really need to be kept
> stable.  That's a recipe for anybody back-patching anything they feel
> like whenever they like, which is a recipe for disaster.

+1

Greetings,

Andres Freund

Re: csv format for psql

2018-11-07 Thread Daniel Verite

Michael Paquier wrote:

> -C could also be useful for other things, say compression.

The patch does not grab any short letter, as the consensus
went that way.

> +   pset.popt.topt.fieldSepCsv = pg_strdup(",");
> Let's store that in a variable instead of hardcoding it.

I guess it could go into a #define in psql/settings.h, along with
these:
#define DEFAULT_FIELD_SEP "|"
#define DEFAULT_RECORD_SEP "\n"

> In the regression tests, "col 9" is wanted with a newline?

Yes, since we support column names with embedded newlines (even
though it's hard to think of a legit use case for that) and CSV fields
support embedded newlines too.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Fedora 29 vs linux collation tests




On 11/7/18 9:01 AM, Andrew Dunstan wrote:


Yesterday I did a long overdue upgrade to the machine that runs 
buildfarm animal crake. After some hiccups it's all working, except 
for the linux utf8 collation tests. See 



where you can see this:

  SET lc_time TO 'tr_TR';
+ ERROR:  invalid value for parameter "lc_time": "tr_TR"


`locale -a | grep tr_TR` on the machine reports:

tr_TR
tr_TR.iso88599
tr_TR.utf8


so I'm not sure what's going on here.





Looks like this is cured by

    sudo dnf install glibc-all-langpacks

Sorry for the noise


cheers


andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Connection limit doesn't work for superuser

On 11/7/18 5:19 PM, Tom Lane wrote:
>
> ...
> 
> I think this proposal boils down to asking for support for an
> incredibly bad application design, and equipping every database with
> an additional foot-gun in order to have that.
> 

I'm not sure about that. IMHO being able to restrict the number of
superuser connections can be used to force users to use regular
(non-superuser) roles for stuff that does not require that. Which should
encourage better application design.

Of course, the question is whether such limit can actually be enforced
reliably (I mean, can't the superuser simply change it?) and whether
handing over superuser accounts to application users is a good idea in
general ...

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Connection limit doesn't work for superuser

On Wed, Nov 7, 2018 at 11:19 AM Tom Lane  wrote:
> > Like what?
>
> alter user postgres connection limit 0;
>
> ... oops ...

Sure.  If you have no other superusers that's going to be sad.
Hopefully single-user mode lets you recover, though.  And, anyway,
there are plenty of ways for a superuser to break a cluster far worse
than that.

> I'm not buying the argument that there are realistic use-cases where
> you need a connection limit on a superuser role, either.  Whatever
> you're doing that might merit a connection limit should not be done
> as superuser.  I think this proposal boils down to asking for support
> for an incredibly bad application design, and equipping every database
> with an additional foot-gun in order to have that.

I don't agree; that sounds like masterminding to me.  "You shouldn't
want that feature, so we won't give it to you" is not always an
invalid argument, but we ought to tread lightly with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: partitioned indexes and tablespaces

On Wed, Nov 7, 2018 at 10:46 AM Alvaro Herrera  wrote:
> I think 11.0 is ready for testing that a migration from a production
> running 10.x, but not for just blindly migrating.  If you wanted to take
> such a leap of faith, surely you'd wait for 11.1 at the very least.

I think that's an irresponsible attitude for a committer to take.  In
practice, you are probably right, but we shouldn't treat our
supposedly-stable releases as if they don't really need to be kept
stable.  That's a recipe for anybody back-patching anything they feel
like whenever they like, which is a recipe for disaster.

But maybe you've adopted that policy already.  You back-patched a
behavior change 2 days before a minor release when the vote was 2-3
against the change.  If it matters, the three people opposing it all
work for different companies, wheres your only concurring vote came
from someone with whom you share an employer.  I though the principle
here was that we operate by consensus; that is not a consensus to
proceed at all, let alone to back-patch on short notice.

> > Seeing this stuff discussed and committed in haste gives me the sad
> > face.
>
> Nah.

Sorry, but you don't get to decide what makes other people sad.  I'm
with Michael on this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Connection limit doesn't work for superuser

Robert Haas  writes:
> On Wed, Nov 7, 2018 at 9:45 AM Tom Lane  wrote:
>> I'd vote against.  I think there are way more cases where this would
>> create a problem than where it would fix one.

> Like what?

alter user postgres connection limit 0;

... oops ...

I'm not buying the argument that there are realistic use-cases where
you need a connection limit on a superuser role, either.  Whatever
you're doing that might merit a connection limit should not be done
as superuser.  I think this proposal boils down to asking for support
for an incredibly bad application design, and equipping every database
with an additional foot-gun in order to have that.

regards, tom lane

Re: First-draft release notes for back-branch releases

2018-11-07 Thread Jonathan S. Katz

On 11/7/18 1:59 AM, Amit Langote wrote:
> * Ensure that automatically created child indexes are created in the same

Thanks, that is much clearer. I have made that update.

Thanks again for your help!

Jonathan
2018-11-08 Cumulative Update


The PostgreSQL Global Development Group has released an update to all supported versions of our database system, including 11.1, 10.6, 9.6.11, 9.5.13, 9.4.20, 9.3.25. This release fixes bugs reported over the last three months.

All users using the affected versions of PostgreSQL should update during the next convenient upgrade period. Please see the notes on "Updating" below for any post-update steps that may be required if you are using `pg_stat_statements` in your installation.

This update is also the final release for PostgreSQL 9.3, which is now end-of-life and will no longer receive any bug or security fixes. If your environment still uses PostgreSQL 9.3, please make plans to update to a community supported version as soon as possible. Please see our [versioning policy](https://www.postgresql.org/support/versioning/) for more information.

Bug Fixes and Improvements
--

This update also fixes numerous bugs that were reported in the last several months. Some of these issues affect only version 11, but many affect all supported versions.

These releases include fixes that:

* Ensure that automatically created child indexes are created in the same tablespace as the parent partitioned index
* Fix several crashes with triggers
* Fix problems with applying `ON COMMIT DELETE ROWS` to a partitioned temporary table
* Fix how `NULL` values are handled when using `LEFT JOIN` with a parallelized hash join
* Several fixes around using named or defaulted arguments in `CALL` statements
* Fix for strict aggregate functions (i.e. aggregates that cannot accept `NULL` inputs) with ORDER BY columns that enforces the strictness check
* Fix with `CASE` statements where an expression was cast to an array type
* Disable an optimization for updating expression indexes in order to prevent a crash
* Fix a memory leak that occurred on a specific case of using a SP-GiST index
* Fix for `pg_verify_checksums` incorrectly reporting on files that are not expected to have checksums
* Prevent the PostgreSQL server from starting when `wal_level` is set to a value that cannot support an existing replication slot
* Ensure that the server will process already-received NOTIFY and SIGTERM interrupts before waiting for client input
* Allow PL/Ruby to work with newer versions of PostgreSQL
* Fix for character-class checks on Windows for Unicode characters above U+, which affected full-text search as well as `contrib/ltree` and `contrib/pg_trgm`
* Fix a case where `psql` would not report the receipt of a message from a `NOTIFY` call until after the next command
* Fix build problems on macOS 10.14 (Mojave)
* Several build fixes for the Windows platform

This updates also contains tzdata release 2018g for DST law changes in Chile, Fiji, Morocco, and Russia (Volgograd), plus historical corrections for China, Hawaii, Japan, Macau, and North Korea.

PostgreSQL 9.3 is End-of-Life (EOL)
---

PostgreSQL 9.3 is now end-of-life and will no longer receive any bug or security fixes.  We urge users to start planning an upgrade to a later version of PostgreSQL as soon as possible.  Please see our [versioning policy](https://www.postgresql.org/support/versioning/) for more information.

Updating


All PostgreSQL update releases are cumulative. As with other minor releases, users are not required to dump and reload their database or use `pg_upgrade` in order to apply this update release; you may simply shutdown PostgreSQL and update its binaries.

If your system is using `pg_stat_statements` and you are running a version of PostgreSQL 10 or PostgreSQL 11, we advise that you execute the following command after upgrading:

`ALTER EXTENSION pg_stat_statements UPDATE;`

Users who have skipped one or more update releases may need to run additional, post-update steps; please see the release notes for earlier versions for details.

Links
-
* [Download](https://www.postgresql.org/download/)
* [Release Notes](https://www.postgresql.org/docs/current/release.html)
* [Security Page](https://www.postgresql.org/support/security/)
* [Versioning Policy](https://www.postgresql.org/support/versioning/)
* [Follow @postgresql on Twitter](https://twitter.com/postgresql)


signature.asc
Description: OpenPGP digital signature

Re: plruby: rb_iterate symbol clash with libruby.so

2018-11-07 Thread Pavel Raiskup

On Wednesday, November 7, 2018 4:55:13 PM CET Tom Lane wrote:
> Pavel Raiskup  writes:
> > On Wednesday, November 7, 2018 3:25:31 PM CET Tom Lane wrote:
> >> Yeah, I'm now mighty confused about this as well.  PL/Ruby is pretty old
> >> too, so how come nobody noticed this before?  Is its rb_iterate call in
> >> someplace that hardly gets any use?
> 
> > ... I reproduced this by trigger function from tests/plt test-case
> > on the first try, more info and related fixes in [1].  But it is truth
> > that we haven't run the test-suite for RPM builds so far.
> > Might the reason be that nobody used plruby at all for a very long time?
>
> I'm starting to think that.  I just browsed through the code of plruby,
> and while I don't know that language at all, there are rb_iterate calls
> in places that look like they'd be hard to avoid in typical use.
>
> The fact that the upstream git repo hasn't been touched in nine years
> doesn't exactly indicate a lively project, either :-(
>
> If you guys want to pursue this, I'll finish up the back-patch, but I
> wonder if we're just wasting our time.

FWIW, I plan to help with fixing the plruby project, but to be honest - I
still think that the code can be fixed to not use rb_iterate() anyways.
So back-patch is not needed by me - and certainly not now.

Pavel

Re: Connection limit doesn't work for superuser

On Wed, Nov 7, 2018 at 9:45 AM Tom Lane  wrote:
> Robert Haas  writes:
> > I don't think we should consider something that prevents you from
> > connecting to the database to be in the same category as something
> > that limits what you can do once you are connected.  IOW, +1 to the
> > original proposal from me.
>
> I'd vote against.  I think there are way more cases where this would
> create a problem than where it would fix one.

Like what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Generic type subscripting

st 7. 11. 2018 v 16:25 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Fri, 12 Oct 2018 at 07:52, Pavel Stehule 
> wrote:
> >
> >> > postgres=# insert into test(v) values( '[]');
> >> > INSERT 0 1
> >> > postgres=# update test set v[1000] = 'a';
> >> > UPDATE 1
> >> > postgres=# update test set v[1000] = 'a';
> >> > UPDATE 1
> >> > postgres=# update test set v[1000] = 'a';
> >> > UPDATE 1
> >> > postgres=# select * from test;
> >> > ┌┬─┐
> >> > │ id │v│
> >> > ╞╪═╡
> >> > ││ ["a", "a", "a"] │
> >> > └┴─┘
> >> > (1 row)
> >> >
> >> > It should to raise exception in this case. Current behave allows
> append simply, but can be source of errors. For this case we can introduce
> some special symbol - some like -0 :)
> >>
> >> Yeah, it may look strange, but there is a reason behind it. I tried to
> keep the
> >> behaviour of this feature consistent with jsonb_set function (and in
> fact
> >> they're sharing the same functionality). And for jsonb_set it's
> documented:
> >>
> >> If the item (of a path, in our case an index) is out of the range
> >> -array_length .. array_length -1, and create_missing is true, the
> new value
> >> is added at the beginning of the array if the item is negative, and
> at the
> >> end of the array if it is positive.
> >>
> >> So, the index 1000 is way above the end of the array v, and every new
> item has
> >> being appended at the end.
> >>
> >> Of course no one said that they should behave similarly, but I believe
> it's
> >> quite nice to have consistency here. Any other opinions?
> >
> >
> > Aha - although I understand to your motivation, I am think so it is bad
> design - and jsonb_set behave is not happy.
> >
> > I am think so it is wrong idea, because you lost some information -
> field position - I push value on index 10, but it will be stored on second
> position.
>
> The thing is that we don't store the field position in this sense anyway in
> jsonb. For arrays there are dimentions, boundaries and null bitmaps
> stored, but
> for jsonb it's just an array of elements. If we want to store this data, we
> either have to change the format, or fill in a jsonb with null values up
> to the
> required position (the first option is out of the scope of this patch, the
> second doesn't sound that good).
>

I don't agree. If we use a  same syntax for some objects types, we should
to enforce some consistency.

I don't think so you should to introduce nulls for JSONs. In this case, the
most correct solution is raising a exception.

Regards

Pavel

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 5:09 PM Robert Haas  wrote:
> - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> both call RelationGetPartitionDesc.  Presumably, this means that if
> the partition descriptor gets updated on the fly, the tuple routing
> and partition dispatch code could end up with different ideas about
> which partitions exist.  I think this should be fixed somehow, so that
> we only call RelationGetPartitionDesc once per query and use the
> result for everything.

I think there is deeper trouble here.
ExecSetupPartitionTupleRouting() calls find_all_inheritors() to
acquire RowExclusiveLock on the whole partitioning hierarchy.  It then
calls RelationGetPartitionDispatchInfo (as a non-relcache function,
this seems poorly named) which calls get_partition_dispatch_recurse,
which does this:

/*
 * We assume all tables in the partition tree were already locked
 * by the caller.
 */
Relationpartrel = heap_open(partrelid, NoLock);

That seems OK at present, because no new partitions can have appeared
since ExecSetupPartitionTupleRouting() acquired locks.  But if we
allow new partitions to be added with only ShareUpdateExclusiveLock,
then I think there would be a problem.  If a new partition OID creeps
into the partition descriptor after find_all_inheritors() and before
we fetch its partition descriptor, then we wouldn't have previously
taken a lock on it and would still be attempting to open it without a
lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).

Admittedly, it might be a bit hard to provoke a failure here because
I'm not exactly sure how you could trigger a relcache reload in the
critical window, but I don't think we should rely on that.

More generally, it seems a bit strange that we take the approach of
locking the entire partitioning hierarchy here regardless of which
relations the query actually knows about.  If some relations have been
pruned, presumably we don't need to lock them; if/when we permit
concurrent partition, we don't need to lock any new ones that have
materialized.  We're just going to end up ignoring them anyway because
there's nothing to do with the information that they are or are not
excluded from the query when they don't appear in the query plan in
the first place.

Furthermore, this whole thing looks suspiciously like more of the sort
of redundant locking that f2343653f5b2aecfc759f36dbb3fd2a61f36853e
attempted to eliminate.  In light of that commit message, I'm
wondering whether the best approach would be to [1] get rid of the
find_all_inheritors call altogether and [2] somehow ensure that
get_partition_dispatch_recurse() doesn't open any tables that aren't
part of the query's range table.

Thoughts?  Comments?  Ideas?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: plruby: rb_iterate symbol clash with libruby.so

Pavel Raiskup  writes:
> On Wednesday, November 7, 2018 3:25:31 PM CET Tom Lane wrote:
>> Yeah, I'm now mighty confused about this as well.  PL/Ruby is pretty old
>> too, so how come nobody noticed this before?  Is its rb_iterate call in
>> someplace that hardly gets any use?

> ... I reproduced this by trigger function from tests/plt test-case
> on the first try, more info and related fixes in [1].  But it is truth
> that we haven't run the test-suite for RPM builds so far.
> Might the reason be that nobody used plruby at all for a very long time?

I'm starting to think that.  I just browsed through the code of plruby,
and while I don't know that language at all, there are rb_iterate calls
in places that look like they'd be hard to avoid in typical use.

The fact that the upstream git repo hasn't been touched in nine years
doesn't exactly indicate a lively project, either :-(

If you guys want to pursue this, I'll finish up the back-patch,
but I wonder if we're just wasting our time.

regards, tom lane

Re: partitioned indexes and tablespaces

On 2018-Nov-04, Michael Paquier wrote:

> On Sat, Nov 03, 2018 at 12:30:15PM -0700, Andres Freund wrote:
> > On 2018-11-03 16:24:28 -0300, Alvaro Herrera wrote:
> >> On 2018-Nov-03, Robert Haas wrote:
> >>> Well, you've guaranteed that already.  Now 11 will be different from
> >>> 11.1, and tables will be different from indexes until somebody goes
> >>> and makes that consistent again.
> >> 
> >> Nobody is running 11.0 in production.
> > 
> > Uh, I know of people doing so.
> 
> My understanding of the term GA is that anything marked as GA is
> considered enough stable for production use, and that beta releases are
> here for testing.  So I don't get the argument.

I think 11.0 is ready for testing that a migration from a production
running 10.x, but not for just blindly migrating.  If you wanted to take
such a leap of faith, surely you'd wait for 11.1 at the very least.

> And you have made partitioned indexes now inconsistent with
> partitioned tables.

I don't buy this consistency argument.  Partitions on partitioned tables
are not created automatically.  They are for indexes.  We just got bug
report #15490 about the problem I fixed.

> Seeing this stuff discussed and committed in haste gives me the sad
> face.

Nah.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: plruby: rb_iterate symbol clash with libruby.so

2018-11-07 Thread Pavel Raiskup

On Wednesday, November 7, 2018 3:25:31 PM CET Tom Lane wrote:
> Pavel Raiskup  writes:
> > On Tuesday, November 6, 2018 7:28:21 PM CET Tom Lane wrote:
> >> Done.  I realized that the immediate problem, rb_iterate(), was only
> >> added as of PG v10, which may explain why we hadn't heard complaints
> >> about this till now.  So I've made the change only as far back as v10.
> 
> > The 'rb_iterate' seems to exist at least in REL9_2_STABLE branch, so it is
> > probably much older.
> 
> Oh!  Hmm ... I think I jumped to conclusions when the part of my patch
> that touched struct RBTreeIterator failed to apply.  But you're right,
> rb_iterate has been there since 9.0 now that I look more carefully.
> So we really ought to back-patch further.  However:
> 
> > That
> > said, I'm still not sure how this could work before ...  Maybe it has not
> > been working for some time.
> 
> Yeah, I'm now mighty confused about this as well.  PL/Ruby is pretty old
> too, so how come nobody noticed this before?  Is its rb_iterate call in
> someplace that hardly gets any use?

I can not authoritatively answer this (getting familiar with the code) but
many of the rb_iterate() calls were replaced with alternative API (if
available), so you might be right.

That said, I reproduced this by trigger function from tests/plt test-case
on the first try, more info and related fixes in [1].  But it is truth
that we haven't run the test-suite for RPM builds so far.

Might the reason be that nobody used plruby at all for a very long time?

[1] https://github.com/devrimgunduz/postgresql-plruby/pull/3

Pavel

Re: pread() and pwrite()




On 11/7/18 10:05 AM, Jesper Pedersen wrote:

Hi Tom,

On 11/7/18 9:30 AM, Tom Lane wrote:
I'm confused by this.  Surely the pwrite-based code is writing 
exactly the
same data as before.  Do we have to conclude that valgrind is 
complaining

about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?

[ looks ... ]  No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.



The attached patch fixes this for me.

Unfortunately pwrite* doesn't work for the pwrite64(buf) line.





Works for me. If there's no objection I will commit this.


cheers


andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] [PATCH] Generic type subscripting

2018-11-07 Thread Dmitry Dolgov

> On Fri, 12 Oct 2018 at 07:52, Pavel Stehule  wrote:
>
>> > postgres=# insert into test(v) values( '[]');
>> > INSERT 0 1
>> > postgres=# update test set v[1000] = 'a';
>> > UPDATE 1
>> > postgres=# update test set v[1000] = 'a';
>> > UPDATE 1
>> > postgres=# update test set v[1000] = 'a';
>> > UPDATE 1
>> > postgres=# select * from test;
>> > ┌┬─┐
>> > │ id │v│
>> > ╞╪═╡
>> > ││ ["a", "a", "a"] │
>> > └┴─┘
>> > (1 row)
>> >
>> > It should to raise exception in this case. Current behave allows append 
>> > simply, but can be source of errors. For this case we can introduce some 
>> > special symbol - some like -0 :)
>>
>> Yeah, it may look strange, but there is a reason behind it. I tried to keep 
>> the
>> behaviour of this feature consistent with jsonb_set function (and in fact
>> they're sharing the same functionality). And for jsonb_set it's documented:
>>
>> If the item (of a path, in our case an index) is out of the range
>> -array_length .. array_length -1, and create_missing is true, the new 
>> value
>> is added at the beginning of the array if the item is negative, and at 
>> the
>> end of the array if it is positive.
>>
>> So, the index 1000 is way above the end of the array v, and every new item 
>> has
>> being appended at the end.
>>
>> Of course no one said that they should behave similarly, but I believe it's
>> quite nice to have consistency here. Any other opinions?
>
>
> Aha - although I understand to your motivation, I am think so it is bad 
> design - and jsonb_set behave is not happy.
>
> I am think so it is wrong idea, because you lost some information - field 
> position - I push value on index 10, but it will be stored on second position.

The thing is that we don't store the field position in this sense anyway in
jsonb. For arrays there are dimentions, boundaries and null bitmaps stored, but
for jsonb it's just an array of elements. If we want to store this data, we
either have to change the format, or fill in a jsonb with null values up to the
required position (the first option is out of the scope of this patch, the
second doesn't sound that good).

Re: pread() and pwrite()


Hi Tom,

On 11/7/18 9:30 AM, Tom Lane wrote:

I'm confused by this.  Surely the pwrite-based code is writing exactly the
same data as before.  Do we have to conclude that valgrind is complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?

[ looks ... ]  No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.



The attached patch fixes this for me.

Unfortunately pwrite* doesn't work for the pwrite64(buf) line.

Best regards,
 Jesper
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index af03051260..2f3b602773 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -52,9 +52,10 @@
 {
 	padding_XLogRecData_write
 	Memcheck:Param
-	write(buf)
+	pwrite64(buf)
 
-...
+	...
+	fun:pwrite
 	fun:XLogWrite
 }

Re: move PartitionBoundInfo creation code

On 2018-Nov-07, Michael Paquier wrote:

> On Wed, Nov 07, 2018 at 03:34:38PM +0900, Amit Langote wrote:
> > On 2018/11/05 16:21, Michael Paquier wrote:
> >> On Thu, Nov 01, 2018 at 01:03:00PM +0900, Amit Langote wrote:
> >>> Done a few moments ago. :)
> >> 
> >> From the file size this move is actually negative.  From what I can see
> >> partcache decreases to 400 lines, while partbounds increases to 3k
> >> lines.
> > 
> > Hmm, is that because partbounds.c contains more functionality?
> 
> The move you are doing here makes sense, now refactoring is better if we
> could avoid transforming a large file into a larger one and preserve
> some more balance in the force.

I think the code as it's structured in Amit's patch as a whole makes
sense -- partcache is smaller than partbounds, but why do we care?  As
long as each is a reasonably well defined module, it seems better.  In
the current situation we have a bit of a mess there.  A 3k file is okay.
We're not talking about bloating, say, xlog.c which is almost 400kb now.

$ ls -lh src/backend/access/transam/xlog*
-rw-r--r-- 1 alvherre alvherre  23K oct  4 11:40 
src/backend/access/transam/xlogarchive.c
-rw-r--r-- 1 alvherre alvherre 389K nov  7 11:49 
src/backend/access/transam/xlog.c
-rw-r--r-- 1 alvherre alvherre  21K nov  3 12:15 
src/backend/access/transam/xlogfuncs.c
-rw-r--r-- 1 alvherre alvherre  30K sep  3 12:58 
src/backend/access/transam/xloginsert.c
-rw-r--r-- 1 alvherre alvherre  40K sep  3 12:58 
src/backend/access/transam/xlogreader.c
-rw-r--r-- 1 alvherre alvherre  32K ago  5 21:34 
src/backend/access/transam/xlogutils.c

> > I think we can design the interface of partition_bounds_create such that
> > it returns information needed to re-arrange OIDs to be in the canonical
> > partition bound order, so the actual re-ordering of OIDs can be done by
> > the caller itself.  (It may not always be OIDs, could be something else
> > like a struct to represent fake partitions.)

This is interesting.  I don't think the current interface is so bad that
we can't make a few tweaks later; IOW let's focus on the current patch
for now, and we can improve later.  You (and David, and maybe someone
else) already have half a dozen patches touching this code, and I bet
some of these will have to be rebased over this one.


> Thinking crazy, we could also create a subfolder partitioning/bounds/
> which includes three files with this refactoring:x
> - range.c
> - values.c
> - hash.c
> Then we keep partbounds.c which is the entry point used by partcache.c,
> and partbounds.c relies on the stuff in each other sub-file when
> building a strategy.  This move could be interesting in the long term as
> there are comparison routines and structures for each strategy if more
> strategies are added.

I'm not sure this is worthwhile.  You'll create a bunch of independent
translation units in exchange for ...?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pread() and pwrite()




On 11/7/18 9:30 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 11/7/18 7:26 AM, Jesper Pedersen wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2018-11-07%2001%3A01%3A01

And lousyjack, which uses a slightly different way of calling valgrind,
and thus got past initdb, found a bunch more:


I'm confused by this.  Surely the pwrite-based code is writing exactly the
same data as before.  Do we have to conclude that valgrind is complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?

[ looks ... ]  No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.



Yeah. I just trawled through the lousyjack logs and it looks like all 
the cases it reported could be handled by:


{
    padding_XLogRecData_pwrite
    Memcheck:Param
    pwrite64(buf)

    ...
    fun:XLogWrite
}

cheers


andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Connection limit doesn't work for superuser

2018-11-07 Thread Dmitriy Sarafannikov

What about LOGIN option? It is a similar access restriction, but it works for 
superuser.

=# create role nologin_role superuser nologin unencrypted password '1234';
CREATE ROLE
Time: 1.230 ms
~ $ psql postgres -U nologin_role -h localhost
Password for user nologin_role:
psql: FATAL:  role "nologin_role" is not permitted to log in

IMHO it does not fit this logic.

> 7 нояб. 2018 г., в 17:45, Tom Lane  написал(а):
> 
> Robert Haas  writes:
>> I don't think we should consider something that prevents you from
>> connecting to the database to be in the same category as something
>> that limits what you can do once you are connected.  IOW, +1 to the
>> original proposal from me.
> 
> I'd vote against.  I think there are way more cases where this would
> create a problem than where it would fix one.
> 
>   regards, tom lane
>

Re: Connection limit doesn't work for superuser

Robert Haas  writes:
> I don't think we should consider something that prevents you from
> connecting to the database to be in the same category as something
> that limits what you can do once you are connected.  IOW, +1 to the
> original proposal from me.

I'd vote against.  I think there are way more cases where this would
create a problem than where it would fix one.

regards, tom lane

Re: valgrind on initdb




On 11/7/18 9:11 AM, Tomas Vondra wrote:


On 11/7/18 2:47 PM, John Naylor wrote:

On 11/7/18, Jesper Pedersen  wrote:

Hi,

While looking at [1] (included in 23315.log) there are other warnings as
well.

Perhaps it's worth revisiting to make debugging easier, but right now
initdb.c has this comment:

  * Note:
  *  The program has some memory leakage - it isn't worth cleaning it up.


Maybe. I certainly don't think it's not worth the time merely for the
sake of fixing the memory leaks. The reasoning here is that initdb runs
for a short period of time (a couple of seconds, really), and the memory
gets released when the process exits. And the leaks are tiny in general
- a couple of bytes here and there. Had there been a massive memory leak
that would change the equation of course.



Yeah, I'm pretty sure I wrote that comment 15 or so years ago in the 
first C implementation of initdb. I don't think my opinion has changed 
much. We're talking about kilobytes, here, nothing massive.



cheers


andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pread() and pwrite()

Andrew Dunstan  writes:
> On 11/7/18 7:26 AM, Jesper Pedersen wrote:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2018-11-07%2001%3A01%3A01

> And lousyjack, which uses a slightly different way of calling valgrind, 
> and thus got past initdb, found a bunch more:
> 

I'm confused by this.  Surely the pwrite-based code is writing exactly the
same data as before.  Do we have to conclude that valgrind is complaining
about passing uninitialized data to pwrite() when it did not complain
about exactly the same thing for write()?

[ looks ... ]  No, what we have to conclude is that the write-related
suppressions in src/tools/valgrind.supp need to be replaced or augmented
with pwrite-related ones.

regards, tom lane

Re: plruby: rb_iterate symbol clash with libruby.so

Pavel Raiskup  writes:
> On Tuesday, November 6, 2018 7:28:21 PM CET Tom Lane wrote:
>> Done.  I realized that the immediate problem, rb_iterate(), was only
>> added as of PG v10, which may explain why we hadn't heard complaints
>> about this till now.  So I've made the change only as far back as v10.

> The 'rb_iterate' seems to exist at least in REL9_2_STABLE branch, so it is
> probably much older.

Oh!  Hmm ... I think I jumped to conclusions when the part of my patch
that touched struct RBTreeIterator failed to apply.  But you're right,
rb_iterate has been there since 9.0 now that I look more carefully.
So we really ought to back-patch further.  However:

> That
> said, I'm still not sure how this could work before ...  Maybe it has not
> been working for some time.

Yeah, I'm now mighty confused about this as well.  PL/Ruby is pretty old
too, so how come nobody noticed this before?  Is its rb_iterate call in
someplace that hardly gets any use?

regards, tom lane

Re: fix psql \conninfo & \connect when using hostaddr

2018-11-07 Thread Arthur Zakirov


Hello all,

On 07.11.2018 16:23, Pavel Stehule wrote:
pá 26. 10. 2018 v 15:55 odesílatel Fabien COELHO > napsal:
> 
> This is a follow-up to another patch I posted about libpq confusing

> documentation & psql resulting behavior under host/hostaddr settings.
> 
> Although the first mostly documentation patch did not gather much

> enthousiasm, I still think both issues deserve a fix.

I checked this patch, and it looks well. The documentation is correct, 
all tests passed. It does what is proposed.


I think so some redundant messages can be reduced - see function 
printConnInfo - attached patch


I have few notes about patches.


/* If the host is an absolute path, the connection is via socket unless 
overriden by hostaddr */
if (is_absolute_path(host))
if (hostaddr && *hostaddr)
printf(_("You are connected to database \"%s\" as user \"%s\" on address 
\"%s\" at port \"%s\".\n"),
   db, PQuser(pset.db), hostaddr, PQport(pset.db));
else
printf(_("You are connected to database \"%s\" as user \"%s\" via socket in 
\"%s\" at port \"%s\".\n"),
   db, PQuser(pset.db), host, PQport(pset.db));
else
if (hostaddr && *hostaddr && strcmp(host, hostaddr) != 0)
printf(_("You are connected to database \"%s\" as user \"%s\" on host \"%s\" (address 
\"%s\") at port \"%s\".\n"),
   db, PQuser(pset.db), host, hostaddr, 
PQport(pset.db));
else
printf(_("You are connected to database \"%s\" as user \"%s\" on host \"%s\" 
at port \"%s\".\n"),
   db, PQuser(pset.db), host, PQport(pset.db));


I think there is lack of necessary braces here for first if and second 
else branches. This is true for both patches.



/* set connip */
if (conn->connip != NULL)
{
free(conn->connip);
conn->connip = NULL;
}

{
charhost_addr[NI_MAXHOST];
getHostaddr(conn, host_addr);
if (strcmp(host_addr, "???") != 0)
conn->connip = strdup(host_addr);
}


Maybe it is better to move this code into the PQhostaddr() function? 
Since connip is used only in PQhostaddr() it might be not worth to do 
additional work in main PQconnectPoll().


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: valgrind on initdb




On 11/7/18 2:47 PM, John Naylor wrote:
> On 11/7/18, Jesper Pedersen  wrote:
>> Hi,
>>
>> While looking at [1] (included in 23315.log) there are other warnings as
>> well.
> 
> Perhaps it's worth revisiting to make debugging easier, but right now
> initdb.c has this comment:
> 
>  * Note:
>  * The program has some memory leakage - it isn't worth cleaning it up.
> 

Maybe. I certainly don't think it's not worth the time merely for the
sake of fixing the memory leaks. The reasoning here is that initdb runs
for a short period of time (a couple of seconds, really), and the memory
gets released when the process exits. And the leaks are tiny in general
- a couple of bytes here and there. Had there been a massive memory leak
that would change the equation of course.

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pread() and pwrite()


Hi,

On 11/7/18 7:26 AM, Jesper Pedersen wrote:

On 11/6/18 4:04 PM, Thomas Munro wrote:

On Wed, Nov 7, 2018 at 4:42 AM Jesper Pedersen
Thanks!  Pushed.  I'll keep an eye on the build farm to see if
anything breaks on Cygwin or some other frankenOS.



There is [1] on Andres' skink setup. Looking.



Attached is a reproducer.

Adding the memset() command for the page makes valgrind happy.

Thoughts on how to proceed with this ? The report in [1] shows that 
there are a number of call sites where the page(s) aren't fully initialized.


[1] 
https://www.postgresql.org/message-id/3fe1e38a-fb70-6260-9300-ce67ede21c32%40redhat.com


Best regards,
 Jesper
#include 
#include 
#include 
#include 

int main(int argc, const char* argv[])
{
   FILE* f;
   int fd;
   void* m;
   ssize_t written;

   f = fopen("test.bin", "w+");
   if (f == NULL)
   {
  printf("Cannot open file\n");
  exit(0);
   }
   fd = fileno(f);
   m = malloc((size_t)8192);
   // memset(m, 0, (size_t)8192); <-- will make valgrind succeed

   written = pwrite(fd, m, (size_t)8192, (off_t)0);
   
   printf("Written: %i\n", written);

   free(m);
   fclose(f);
   
   return 0;
}

Fedora 29 vs linux collation tests




Yesterday I did a long overdue upgrade to the machine that runs 
buildfarm animal crake. After some hiccups it's all working, except for 
the linux utf8 collation tests. See 



where you can see this:

  SET lc_time TO 'tr_TR';
+ ERROR:  invalid value for parameter "lc_time": "tr_TR"


`locale -a | grep tr_TR` on the machine reports:

tr_TR
tr_TR.iso88599
tr_TR.utf8


so I'm not sure what's going on here.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: ATTACH/DETACH PARTITION CONCURRENTLY

On Tue, Nov 6, 2018 at 10:18 PM Alvaro Herrera  wrote:
> > > Throw tuples destined for that partition away?
> > Surely not.  (/me doesn't beat straw men anyway.)
>
> Hmm, apparently this can indeed happen with my patch :-(

D'oh.  This is a hard problem, especially the part of it that involves
handling detach, so I wouldn't feel too bad about that.  However, to
beat this possibly-dead horse a little more, I think you made the
error of writing a patch that (1) tried to solve too many problems at
once and (2) didn't seem to really have a clear, well-considered idea
about what the semantics ought to be.

This is not intended as an attack; I want to work with you to solve
the problem, not have a fight about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Connection limit doesn't work for superuser

On Wed, Nov 7, 2018 at 7:20 AM Andrey Borodin  wrote:
> >These clauses determine whether the new role is a “superuser”, who can 
> >override all access restrictions within the database.
> Do we consider connection limit "access restriction"? Superuser can avoid 
> setting his connection limit if he do not need it.

I don't think we should consider something that prevents you from
connecting to the database to be in the same category as something
that limits what you can do once you are connected.  IOW, +1 to the
original proposal from me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: valgrind on initdb

2018-11-07 Thread John Naylor

On 11/7/18, Jesper Pedersen  wrote:
> Hi,
>
> While looking at [1] (included in 23315.log) there are other warnings as
> well.

Perhaps it's worth revisiting to make debugging easier, but right now
initdb.c has this comment:

 * Note:
 *   The program has some memory leakage - it isn't worth cleaning it up.

-John Naylor

Re: valgrind on initdb


On 11/7/18 8:08 AM, Jesper Pedersen wrote:
While looking at [1] (included in 23315.log) there are other warnings as 
well.




On 77366d90.

Best regards,
 Jesper

Re: fix psql \conninfo & \connect when using hostaddr

Hi

pá 26. 10. 2018 v 15:55 odesílatel Fabien COELHO 
napsal:

>
> Hello,
>
> This is a follow-up to another patch I posted about libpq confusing
> documentation & psql resulting behavior under host/hostaddr settings.
>
> Although the first mostly documentation patch did not gather much
> enthousiasm, I still think both issues deserve a fix.
>
> About updating psql's behavior, without this patch:
>
>sh> psql "host=foo hostaddr=127.0.0.1"
>
>psql> \conninfo
>You are connected to database "fabien" as user "fabien" on host "foo"
> at port "5432".
> # NOPE, I'm really connected to localhost, foo does not even exist
> # Other apparent inconsistencies are possible when hostaddr overrides
> # "host" which is an socket directory or an IP.
>
>psql> \c template1
>could not translate host name "foo" to address: Name or service not
> known
>Previous connection kept
> # hmmm what is the meaning of reusing a connection?
> # this issue was pointed out by Arthur Zakirov
>
> After the patch:
>
>sh> psql "host=foo hostaddr=127.0.0.1"
>
>psql> \conninfo
>You are connected to database "fabien" as user "fabien" on host "foo"
> (address "127.0.0.1") at port "5432".
># better
>
>psql> \c template1
>You are now connected to database "template1" as user "fabien".
># thanks
>
> The patch adds a PQhostaddr() function to libpq which reports the
> "hostaddr" setting or the current server ip. The function is used by psql
> for \conninfo and when reusing parameters for \connect.
>
> The messages are slightly more verbose because the IP is output. I think
> that user asking for conninfo should survive to the more precise data.
> This also comes handy if a host name resolves to several IPs (eg IPv6 and
> IPv4, or several IPs...).
>
>
I checked this patch, and it looks well. The documentation is correct, all
tests passed. It does what is proposed.

I think so some redundant messages can be reduced - see function
printConnInfo - attached patch

If there are no be a objection, I'll mark this patch as ready for commiters

Regards

Pavel


> --
> Fabien.
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 601091c570..a7c9f2b400 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1735,6 +1735,35 @@ char *PQhost(const PGconn *conn);
  
 
 
+
+
+ 
+  PQhostaddr
+  
+   PQhostaddr
+  
+ 
+
+ 
+  
+   Returns the actual server IP address of the active connection.
+   This can be the address a host name resolved to, or an IP address
+   provided through the hostaddr parameter.
+
+char *PQhostaddr(const PGconn *conn);
+
+  
+
+  
+   PQhostaddr returns NULL if the
+   conn argument is NULL.
+   Otherwise, if there is an error producing the host information (perhaps
+   if the connection has not been fully established or there was an
+   error), it returns an empty string.
+  
+ 
+
+
 
  
   PQport
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 0dea54d3ce..5b361c9484 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -158,6 +158,7 @@ static void print_with_linenumbers(FILE *output, char *lines,
 	   const char *header_keyword);
 static void minimal_error_message(PGresult *res);
 
+static void printConnInfo(void);
 static void printSSLInfo(void);
 static bool printPsetInfo(const char *param, struct printQueryOpt *popt);
 static char *pset_value_string(const char *param, struct printQueryOpt *popt);
@@ -595,15 +596,7 @@ exec_command_conninfo(PsqlScanState scan_state, bool active_branch)
 			printf(_("You are currently not connected to a database.\n"));
 		else
 		{
-			char	   *host = PQhost(pset.db);
-
-			/* If the host is an absolute path, the connection is via socket */
-			if (is_absolute_path(host))
-printf(_("You are connected to database \"%s\" as user \"%s\" via socket in \"%s\" at port \"%s\".\n"),
-	   db, PQuser(pset.db), host, PQport(pset.db));
-			else
-printf(_("You are connected to database \"%s\" as user \"%s\" on host \"%s\" at port \"%s\".\n"),
-	   db, PQuser(pset.db), host, PQport(pset.db));
+			printConnInfo();
 			printSSLInfo();
 		}
 	}
@@ -2854,6 +2847,7 @@ do_connect(enum trivalue reuse_previous_specification,
 	PGconn	   *o_conn = pset.db,
 			   *n_conn;
 	char	   *password = NULL;
+	char	   *hostaddr = NULL;
 	bool		keep_password;
 	bool		has_connection_string;
 	bool		reuse_previous;
@@ -2894,12 +2888,22 @@ do_connect(enum trivalue reuse_previous_specification,
 	}
 
 	/* grab missing values from the old connection */
-	if (!user && reuse_previous)
-		user = PQuser(o_conn);
-	if (!host && reuse_previous)
-		host = PQhost(o_conn);
-	if (!port && reuse_previous)
-		port = PQport(o_conn);
+	if (reuse_previous)
+	{
+		if (!user)
+			user = PQuser(o_conn);
+		if (host && strcmp(host, PQhost(o_conn)) == 0)
+			/* if we are targetting the same host, reuse

Re: Support custom socket directory in pg_upgrade

2018-11-07 Thread Daniel Gustafsson

> On 6 Nov 2018, at 09:19, Thomas Munro  wrote:
> 
> On Wed, Oct 10, 2018 at 9:27 AM Daniel Gustafsson  > wrote:
>>> On 9 Oct 2018, at 16:22, Tom Lane  wrote:
>>> Daniel Gustafsson  writes:
 Having hit the maximum socketdir length error a number of times in 
 pg_upgrade,
 especially when running tests in a deep directory hierarchy, I figured it 
 was
 time to see if anyone else has had the same problem?  The attached patch is
 what I run with locally to avoid the issue, it adds a --socketdir=PATH 
 option
 to pg_upgrade which overrides the default use of CWD.  Is that something 
 that
 could be considered?
>>> 
>>> I think you could simplify matters if you installed the CWD default value
>>> during option processing.
>> 
>> The attached v2 tries to make the socketdir more like the other configurable
>> directories in pg_upgrade (adding an envvar for it etc).  Is that more in 
>> line
>> with what you were suggesting?  make -C src/bin/pg_upgrade check passes with
>> this, both unmodified and with a -s in the test script to override it.  Also
>> fixed incorrect syntax in the docs part from v1.
> 
> I think PGSOCKETDIR should be mentioned in the documentation like the
> other environment variables,

Of course, fixed.

> and also I'm wondering if it actually
> works: you set it to the current working directory first, then parse
> the command line option if present, and then read the env var only if
> not already set: but it's always going to be, isn't it?  Perhaps you
> should use getcwd() only if all else fails?

Yes, you’re right, I had a thinko in my patch as well as in the testing of it.
The attached version sets cwd as the default in case all else fails.  Extending
check_required_directory() to do this may not be to everyones liking, but it
seemed the cleanest option to me.

cheers ./daniel



pg_upgrade_sockdir-v3.patch
Description: Binary data

valgrind on initdb


Hi,

While looking at [1] (included in 23315.log) there are other warnings as 
well.


I ran with

valgrind --leak-check=full --show-leak-kinds=all --gen-suppressions=all 
--suppressions=/path/to/postgresql/src/tools/valgrind.supp 
--time-stamp=yes --log-file=/tmp/valgrind/%p.log --trace-children=yes 
--track-origins=yes --read-var-info=yes initdb -D /tmp/pgsql --no-clean 
--no-sync --no-locale 2>&1 | tee /tmp/valgrind/initdb.log


[1] 
https://www.postgresql.org/message-id/cb062f4d-55b8-28be-b55f-2e593a3b3755%40redhat.com


Best regards,
 Jesper


initdb.tgz
Description: application/compressed-tar

Re: plruby: rb_iterate symbol clash with libruby.so

2018-11-07 Thread Pavel Raiskup

On Tuesday, November 6, 2018 7:28:21 PM CET Tom Lane wrote:
> I wrote:
> > Yeah.  The long and short of this is that we're trampling on namespace
> > that reasonably belongs to Ruby --- if they had some functions named
> > "pg_something" and complained about a collision with libpq, would we
> > change?  Nope.  So really we should rename these.
> 
> > Barring objections I'll go make this happen shortly.
> 
> Done.  I realized that the immediate problem, rb_iterate(), was only
> added as of PG v10, which may explain why we hadn't heard complaints
> about this till now.  So I've made the change only as far back as v10.
> In principle we could change the rbtree code in 9.5/9.6 as well, but
> I think that's more likely to create problems than fix any.

The 'rb_iterate' seems to exist at least in REL9_2_STABLE branch, so it is
probably much older.  No need to backpatch of course -- just saying.  That
said, I'm still not sure how this could work before ...  Maybe it has not
been working for some time.

Pavel

Re: pread() and pwrite()




On 11/7/18 7:26 AM, Jesper Pedersen wrote:

Hi Thomas,

On 11/6/18 4:04 PM, Thomas Munro wrote:

On Wed, Nov 7, 2018 at 4:42 AM Jesper Pedersen
Thanks!  Pushed.  I'll keep an eye on the build farm to see if
anything breaks on Cygwin or some other frankenOS.



There is [1] on Andres' skink setup. Looking.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2018-11-07%2001%3A01%3A01






And lousyjack, which uses a slightly different way of calling valgrind, 
and thus got past initdb, found a bunch more:



see 




cheers


andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pread() and pwrite()