Re: Why we allow CHECK constraint contradiction?

2018-10-09 Thread Corey Huinker
On Wed, Oct 10, 2018 at 1:44 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tuesday, October 9, 2018, Imai, Yoshikazu <
> imai.yoshik...@jp.fujitsu.com> wrote:
>>
>> Are there any rows which can satisfy the ct's CHECK constraint? If not,
>> why we
>> allow creating table when check constraint itself is contradicted?
>>
>
> I'd bet on it being a combination of complexity and insufficient expected
> benefit.  Time is better spent elsewhere.  Mathmatically proving a
> contradiction in software is harder than reasoning about it mentally.
>

I've actually used that as a feature, in postgresql and other databases,
where assertions were unavailable, or procedural code was unavailable or
against policy.

Consider the following:

CREATE TABLE wanted_values ( x integer );

INSERT INTO wanted_values VALUES (1), (2), (3);


CREATE TABLE found_values ( x integer );

INSERT INTO found_values VALUES (1), (3);


CREATE TABLE missing_values (

x integer,

CONSTRAINT contradiction CHECK (false)

);


INSERT INTO missing_values

SELECT x FROM wanted_values

EXCEPT

SELECT x FROM found_values;


gives the error

ERROR:  new row for relation "missing_values" violates check constraint
"contradiction"

DETAIL:  Failing row contains (2).


Which can be handy when you need to fail a transaction because of bad data
and don't have branching logic available.


Re: Why we allow CHECK constraint contradiction?

2018-10-09 Thread David G. Johnston
On Tuesday, October 9, 2018, Imai, Yoshikazu 
wrote:
>
> Are there any rows which can satisfy the ct's CHECK constraint? If not,
> why we
> allow creating table when check constraint itself is contradicted?
>

I'd bet on it being a combination of complexity and insufficient expected
benefit.  Time is better spent elsewhere.  Mathmatically proving a
contradiction in software is harder than reasoning about it mentally.

David J.


Why we allow CHECK constraint contradiction?

2018-10-09 Thread Imai, Yoshikazu
Hi, all.

I have a wonder about the behaviour of creating table which has a constraint
contradiction.

I created below table.

bugtest=# create table ct (a int, CHECK(a is not null and a >= 0 and a < 100 
and a >= 200 and a < 300));
bugtest=# \d+ ct
Table "public.ct"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  | 
Check constraints:
"ct_a_check" CHECK (a IS NOT NULL AND a >= 0 AND a < 100 AND a >= 200 AND a 
< 300)


Are there any rows which can satisfy the ct's CHECK constraint? If not, why we
allow creating table when check constraint itself is contradicted?


I originally noticed this while creating partitioned range table as below.

bugtest=# create table rt (a int) partition by range (a);
bugtest=# create table rt_sub1 partition of rt for values from (0) to (100) 
partition by range (a);
bugtest=# create table rt_sub2 partition of rt for values from (100) to (200) 
partition by range (a);
bugtest=# create table rt150 partition of rt_sub1 for values from (150) to 
(151);
bugtest=# \d+ rt_sub1
  Table "public.rt_sub1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  | 
Partition of: rt FOR VALUES FROM (0) TO (100)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 100))
Partition key: RANGE (a)
Partitions: rt150 FOR VALUES FROM (150) TO (151)

bugtest=# \d+ rt150
   Table "public.rt150"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  | 
Partition of: rt_sub1 FOR VALUES FROM (150) TO (151)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 100) AND (a IS NOT 
NULL) AND (a >= 150) AND (a < 151))


Any rows are not routed to rt150 through rt nor we can't insert any rows to 
rt150 directly because of its constraints. If we add check whether constraint
is contradicted, it prevent us from accidentally creating useless table like 
above rt150 which would not contain any rows.

I thought there might be a discussion or documentation about this, but I 
couldn't find it. If there is, please also tell me that.

Thanks,

--
Yoshikazu Imai





Re: [HACKERS] Transactions involving multiple postgres foreign servers, take 2

2018-10-09 Thread Masahiko Sawada
On Wed, Oct 3, 2018 at 6:02 PM Chris Travers  wrote:
>
>
>
> On Wed, Oct 3, 2018 at 9:41 AM Chris Travers  wrote:
>>
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:   not tested
>> Spec compliant:   not tested
>> Documentation:tested, failed
>>
>> I am hoping I am not out of order in writing this before the commitfest 
>> starts.  The patch is big and long and so wanted to start on this while 
>> traffic is slow.
>>
>> I find this patch quite welcome and very close to a minimum viable version.  
>> The few significant limitations can be resolved later.  One thing I may have 
>> missed in the documentation is a discussion of the limits of the current 
>> approach.  I think this would be important to document because the caveats 
>> of the current approach are significant, but the people who need it will 
>> have the knowledge to work with issues if they come up.
>>
>> The major caveat I see in our past discussions and (if I read the patch 
>> correctly) is that the resolver goes through global transactions 
>> sequentially and does not move on to the next until the previous one is 
>> resolved.  This means that if I have a global transaction on server A, with 
>> foreign servers B and C, and I have another one on server A with foreign 
>> servers C and D, if server B goes down at the wrong moment, the background 
>> worker does not look like it will detect the failure and move on to try to 
>> resolve the second, so server D will have a badly set vacuum horizon until 
>> this is resolved.  Also if I read the patch correctly, it looks like one can 
>> invoke SQL commands to remove the bad transaction to allow processing to 
>> continue and manual resolution (this is good and necessary because in this 
>> area there is no ability to have perfect recoverability without occasional 
>> administrative action).  I would really like to see more documentation of 
>> failure cases and appropriate administrative action at present.  Otherwise 
>> this is I think a minimum viable addition and I think we want it.
>>
>> It is possible i missed that in the documentation.  If so, my objection 
>> stands aside.  If it is welcome I am happy to take a first crack at such 
>> docs.
>

Thank you for reviewing the patch!

>
> After further testing I am pretty sure I misread the patch.  It looks like 
> one can have multiple resolvers which can, in fact, work through a queue 
> together solving this problem.  So the objection above is not valid and I 
> withdraw that objection.  I will re-review the docs in light of the 
> experience.

Actually the patch doesn't solve this problem; the foreign transaction
resolver processes distributed transactions sequentially. But since
one resolver process is responsible for one database the backend
connecting to another database can complete the distributed
transaction. I understood the your concern and agreed to solve this
problem. I'll address it in the next patch.

>
>>
>>
>> To my mind thats the only blocker in the code (but see below).  I can say 
>> without a doubt that I would expect we would use this feature once available.
>>
>> --
>>
>> Testing however failed.
>>
>> make installcheck-world fails with errors like the following:
>>
>>  -- Modify foreign server and raise an error
>>   BEGIN;
>>   INSERT INTO ft7_twophase VALUES(8);
>> + ERROR:  prepread foreign transactions are disabled
>> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   INSERT INTO ft8_twophase VALUES(NULL); -- violation
>> ! ERROR:  current transaction is aborted, commands ignored until end of 
>> transaction block
>>   ROLLBACK;
>>   SELECT * FROM ft7_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   SELECT * FROM ft8_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   -- Rollback foreign transaction that involves both 2PC-capable
>>   -- and 2PC-non-capable foreign servers.
>>   BEGIN;
>>   INSERT INTO ft8_twophase VALUES(7);
>> + ERROR:  prepread foreign transactions are disabled
>> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   INSERT INTO ft9_not_twophase VALUES(7);
>> + ERROR:  current transaction is aborted, commands ignored until end of 
>> transaction block
>>   ROLLBACK;
>>   SELECT * FROM ft8_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>
>> make installcheck in the contrib directory shows the same, so that's the 
>> easiest way of reproducing, at least on a new installation.  I think the 
>> test cases will have to handle that sort of setup.

The 'make installcheck' is a regression test mode to do the tests to
the existing installation. If the installation 

Re: partition tree inspection functions

2018-10-09 Thread Michael Paquier
On Wed, Oct 10, 2018 at 11:54:48AM +0900, Amit Langote wrote:
> I was partly wrong in saying that we wouldn't need any changes to support
> partitioned indexes here.  Actually, the core function
> find_inheritance_children wouldn't scan pg_inherits for us if we pass an
> (partitioned) index to it, even if the index may have children.
>
> It appears that we don't set relhassubclass for partitioned indexes (that
> is, on parent indexes), so the above the test is useless for indexes.

Aha, so that was it.

> Maybe, we need to start another thread to discuss whether we should be
> manipulating relhassubclass for index partitioning (I'd brought this up in
> the context of relispartition, btw [1]).  I'm not immediately sure if
> setting relhassubclass correctly for indexes will have applications beyond
> this one, but it might be something we should let others know so that we
> can hear more opinions.

This reminds of https://postgr.es/m/20180226053613.gd6...@paquier.xyz,
which has resulted in relhaspkey being removed from pg_class with
f66e8bf8.  I would much prefer if we actually removed it...  Now
relhassubclass is more tricky than relhaspkey as it gets used as a
fast-exit path for a couple of things:
1) prepunion.c when planning for child relations.
2) plancat.c
2) analyze.c
3) rewriteHandler.c

> For time being, I've modified that code as follows:
> 
> @@ -68,9 +70,11 @@ find_inheritance_children(Oid parentrelId, LOCKMODE
> lockmode)
> 
>  /*
>   * Can skip the scan if pg_class shows the relation has never had a
> - * subclass.
> + * subclass.  Since partitioned indexes never have their
> + * relhassubclass set, we cannot skip the scan based on that.
>   */
> -if (!has_subclass(parentrelId))
> +if (get_rel_relkind(parentrelId) != RELKIND_PARTITIONED_INDEX &&
> +!has_subclass(parentrelId))
>  return NIL;

I don't like living with this hack.  What if we simply nuked this
fast-path exit all together?  The only code path where this could
represent a performance issue is RelationBuildPartitionKey(), but we
expect a partitioned table to have children anyway, and there will be
few cases where partitioned tables have no partitions.
--
Michael


signature.asc
Description: PGP signature


Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

2018-10-09 Thread Thomas Munro
On Wed, Oct 10, 2018 at 7:29 AM Robert Haas  wrote:
> On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
>  wrote:
> > 0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
> > generates a FATAL ereport "terminating background worker \"%s\" due to
> > administrator command", directly in the signal handler (hmm, see also
> > recent discussions about the legality of similar code in quickdie()).
>
> Yeah, I think that code is bad news, for the same reasons discussed
> with regard to quickdie().

So I suppose we should just remove it, with something like 0002.  I'm
a bit uneasy about existing code out there that might be not calling
CFI.  OTOH I suspect that a lot of code copied worker_spi.c and
installed its own handler.

> > 1.  Some processes install their own custom SIGTERM handler that sets
> > a flag, and use that to break out of their main loop and exit quietly.
> > Example: autovacuum.c, or the open-source pg_cron extension, and
> > probably many other things that just want a nice quiet shutdown.
> >
> > 2.  Some processes install the standard SIGTERM handler die(), and
> > then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
> > default this looks like "FATAL:  terminating connection due to
> > administrator command".  This approach is effectively required if the
> > main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> > loop.  For example, a "launcher"-type process that calls
> > WaitForBackgroundWorkerStartup() could hang forever if it used
> > approach 1 (ask me how I know).
>
> My experience with background workers has been that approach #1 does
> not really work.  I mean, if you aren't doing anything complicated it
> might be OK, if for example you are executing SQL queries, and you try
> to do #1, then your SQL queries don't respond to interrupts.  And that
> sucks.  So I have generally adopted approach #2.

Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
it might technically be OK for that code)?  I think I might have been
guilty of copying that.

> To put that another way, nearly everything can reach
> CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
> in the case isn't much different from saying that it is required, full
> stop.

> > 3.  Some system processes generally use approach 2, but have a special
> > case in ProcessInterrupts() to suppress or alter the usual FATAL
> > message or behaviour.  This includes the logical replication launcher.
>
> Maybe we could replace this by a general-purpose hook.  So instead of
> the various tests for process types that are there now, we would just
> have
>
> if (procdie_hook != NULL)
> (*procdie_hook)();
>
> And that hook can do whatever you like (but probably including dying).

Ok, patch 0001 is a sketch like that, for discussion.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Add-proc_die_hook-to-customize-die-interrupt-handlin.patch
Description: Binary data


0002-Remove-async-signal-unsafe-bgworker_die-function.patch
Description: Binary data


Re: partition tree inspection functions

2018-10-09 Thread Amit Langote
On 2018/10/09 20:17, Michael Paquier wrote:
> On Tue, Oct 09, 2018 at 07:20:40PM +0900, Amit Langote wrote:
>> Sorry if I'm misunderstanding something, but why would we need a new
>> clone?  If we don't change pg_partition_tree() to only accept tables
>> (regular/partitioned/foreign tables) as input, then the same code can work
>> for indexes as well.  As long as we store partition relationship in
>> pg_inherits, same code should work for both.
> 
> Okay, I see what you are coming at.  However, please note that even if I
> can see the dependencies in pg_inherits for indexes, the patch still
> needs some work I am afraid if your intention is to do so:  
> CREATE TABLE ptif_test (a int, b int) PARTITION BY range (a);
> create index ptif_test_i on ptif_test (a);
> CREATE TABLE ptif_test0 PARTITION OF ptif_test
>   FOR VALUES FROM (minvalue) TO (0) PARTITION BY list (b);
> 
> =# select relid::regclass, parentrelid::regclass from
>  pg_partition_tree('ptif_test'::regclass);
>relid| parentrelid
> +-
>  ptif_test  | null
>  ptif_test0 | ptif_test
> (2 rows)
> =# select relid::regclass, parentrelid::regclass from
>  pg_partition_tree('ptif_test_i'::regclass);
> relid| parentrelid
> -+-
>  ptif_test_i | null
> (1 row)
> 
> In this case I would have expected ptif_test0_a_idx to be listed, with
> ptif_test_i as a parent.

I was partly wrong in saying that we wouldn't need any changes to support
partitioned indexes here.  Actually, the core function
find_inheritance_children wouldn't scan pg_inherits for us if we pass an
(partitioned) index to it, even if the index may have children.  That's
because of this quick-return code at the beginning of it:

/*
 * Can skip the scan if pg_class shows the relation has never had a
 * subclass.
 */
if (!has_subclass(parentrelId))
return NIL;

It appears that we don't set relhassubclass for partitioned indexes (that
is, on parent indexes), so the above the test is useless for indexes.

Maybe, we need to start another thread to discuss whether we should be
manipulating relhassubclass for index partitioning (I'd brought this up in
the context of relispartition, btw [1]).  I'm not immediately sure if
setting relhassubclass correctly for indexes will have applications beyond
this one, but it might be something we should let others know so that we
can hear more opinions.

For time being, I've modified that code as follows:

@@ -68,9 +70,11 @@ find_inheritance_children(Oid parentrelId, LOCKMODE
lockmode)

 /*
  * Can skip the scan if pg_class shows the relation has never had a
- * subclass.
+ * subclass.  Since partitioned indexes never have their
+ * relhassubclass set, we cannot skip the scan based on that.
  */
-if (!has_subclass(parentrelId))
+if (get_rel_relkind(parentrelId) != RELKIND_PARTITIONED_INDEX &&
+!has_subclass(parentrelId))
 return NIL;

> isleaf is of course wrong if the input is a partitioned index, so more
> regression tests to cover those cases would be nice.

Yeah, I fixed that:

@@ -111,7 +111,8 @@ pg_partition_tree(PG_FUNCTION_ARGS)
 nulls[1] = true;

 /* isleaf */
-values[2] = BoolGetDatum(relkind != RELKIND_PARTITIONED_TABLE);
+values[2] = BoolGetDatum(relkind != RELKIND_PARTITIONED_TABLE &&
+ relkind != RELKIND_PARTITIONED_INDEX);

On 2018/10/09 20:25, Michael Paquier wrote:
> Also, the call to find_all_inheritors needs AccessShareLock...  NoLock
> is not secure.

I had thought about that, but don't remember why I made up my mind about
not taking any lock here.  Maybe we should take locks.  Fixed.

Attached updated patch.  I updated docs and tests to include partitioned
indexes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/12085bc4-0bc6-0f3a-4c43-57fe0681772b%40lab.ntt.co.jp
>From 46d87e1c44e0a94b72da1393b2922e28f41c22c8 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 5 Oct 2018 14:41:17 +0900
Subject: [PATCH v15] Add pg_partition_tree to display information about
 partitions

This new function is useful to display a full tree of partitions with a
partitioned table given in output, and avoids the need of any complex
WITH RECURSIVE when looking at partition trees which are multi-level
deep.

It returns a set of records, one for each partition, containing the
partition OID, its immediate parent OID, if the relation is a leaf in
the tree and its level in the partition tree with given table considered
as root, beginning at zero for the root, and incrementing by one each
time the scan goes one level down.

This commit also changes find_inheritance_children so that it can
be used with partitioned indexes.

Author: Amit Langote
Reviewed-by: Jesper Pedersen, Michael Paquier
Discussion: 
https://postgr.es/m/8d00e51a-9a51-ad02-d53e-ba6bf50b2...@lab.ntt.co.jp
---
 doc/src/sgml/func.sgml   |  43 
 

Re: Refactor textToQualifiedNameList()

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 09:54:12AM -0300, Alvaro Herrera wrote:
> The committer of such a change will get a lot of flak for changing the
> #include requirements for code that calls that function, though.

So the patch has been switched to rejected...
--
Michael


signature.asc
Description: PGP signature


Re: out-of-order XID insertion in KnownAssignedXids

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 05:26:49PM +0300, Konstantin Knizhnik wrote:
> But GetRunningTransactionData may be used in some other cases... For
> example I have used it in my snapfs Postgres extensions (fast
> file-level snapshots) to check if there are no more active
> transactions in the system. In my case presence of duplicates is not
> important and ordering of XIDs is not needed.  But performance of
> GetRunningTransactionData is also not critical. But I can suspect that
> there can be cases when it is critical and doing unneeded qsort is not
> desired.  There may be thousands of active transactions and sorting
> their XIDs is notinstantaneous.

I am not sure if the performance argument is actually this much
sensible, it could be as you say, but another thing that we could argue
about is that the presence of duplicate entries in
GetRunningTransactionData() can be used as a point to understand that
2PC transactions are running, and that among the two, one of them is a
dummy entry while the other is pending for being cleared.

> So I completely understand the argument that it is better to eliminate
> source of the problem (presence of duplicates in RUNNING_XACTS record)
> but not sure that in this particular case it is really the best
> solution. If presence of duplicates is considered to be acceptable for
> procarray, why we can not accept it for RUNNING_XACTS record?

This won't solve the case where records have already been generated and
won't be processed correctly, but based on the chances this can happen
we can rather safely say that fixing only the source would be fine.

> I am not insisting that skipping duplicates in ProcArrayApplyRecoveryInfo is
> the right place (but at least is seems to be the most efficient
> alternative). But I also do not fill that moving sort to
> GetRunningTransactionData and elimination of duplicates here (requires one
> more scan and copying of the whole array) is principally better...

Making recovery a couple of instructions shorter is always a good thing
in my opinion.  Users don't care much if backups take a long time, ans
sweat less if restores take a shorter time.

Hence what about doing roughly the following:

1) Document that GetRunningTransactionData() fetches information also
about 2PC entries, like that:
  * GetRunningTransactionData -- returns information about running transactions.
  *
  * Similar to GetSnapshotData but returns more information. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes.
+ * all PGXACTs with an assigned TransactionId, even VACUUM processes and
+ * dummy two-phase related entries created when preparing the transaction.

2) Update LogStandbySnapshot so as the list of XIDs fetched is sorted
and ordered.  This can be used as a sanity check for recovery via
ProcArrayApplyRecoveryInfo().
--
Michael


signature.asc
Description: PGP signature


Re: TAP tests for pg_verify_checksums

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 05:14:50PM +0200, Michael Banck wrote:
> Am Dienstag, den 09.10.2018, 16:54 +0200 schrieb Peter Eisentraut:
>> On 06/10/2018 13:46, Michael Paquier wrote:
> 
>>> +# Time to create a corruption
> 
> That looks a bit weird, maybe "some corruption"? Or maybe it's just me
> not being a native speaker.

Okay, let's do with your suggestion.

>> I would also like to see a test that runs against a cluster without
>> checksums enabled.

OK.  I have added a test within initdb to save one initdb run.

>> +# Checks cannot happen for an online cluster
>> +$node->start;
>> +command_fails(['pg_verify_checksums',  '-D', $pgdata],
>> + "checksum checks not done");
>>
>> The test name should be something like "fails with online cluster".

Done.  I have put more thoughts into those.

> One more thing we could check is the relfilenode after we corrupted it,
> it should also catch the corruption then. Or is that too trivial?

There is one as of v3:
+# Checksum checks on single relfilenode fail
+$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata, '-r',
+   $relfilenode_corrupted],
+ 1,
+ [qr/Bad checksums:.*1/],
+ [qr/checksum verification failed/],
+ '');

The resulting patch is attached.  Does that look good?
--
Michael
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 5dc629fd5e..759779adb2 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -8,7 +8,7 @@ use Fcntl ':mode';
 use File::stat qw{lstat};
 use PostgresNode;
 use TestLib;
-use Test::More tests => 18;
+use Test::More tests => 22;
 
 my $tempdir = TestLib::tempdir;
 my $xlogdir = "$tempdir/pgxlog";
@@ -58,6 +58,18 @@ mkdir $datadir;
 			"check PGDATA permissions");
 	}
 }
+
+# Control file should tell that data checksums are disabled by default.
+command_like(['pg_controldata', $datadir],
+			 qr/Data page checksum version:.*0/,
+			 'checksums are disabled in control file');
+# pg_verify_checksums fails with checksums disabled by default.  This is
+# not part of the tests included in pg_verify_checksums to save from
+# the creation of an extra instance.
+command_fails(
+	[ 'pg_verify_checksums', '-D', $datadir],
+	"pg_verify_checksums fails with data checksum disabled");
+
 command_ok([ 'initdb', '-S', $datadir ], 'sync only');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
diff --git a/src/bin/pg_verify_checksums/.gitignore b/src/bin/pg_verify_checksums/.gitignore
index d1dcdaf0dd..0e5e569a54 100644
--- a/src/bin/pg_verify_checksums/.gitignore
+++ b/src/bin/pg_verify_checksums/.gitignore
@@ -1 +1,3 @@
 /pg_verify_checksums
+
+/tmp_check/
diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile
index d16261571f..cfe4ab1b8b 100644
--- a/src/bin/pg_verify_checksums/Makefile
+++ b/src/bin/pg_verify_checksums/Makefile
@@ -34,3 +34,9 @@ uninstall:
 clean distclean maintainer-clean:
 	rm -f pg_verify_checksums$(X) $(OBJS)
 	rm -rf tmp_check
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_verify_checksums/t/001_basic.pl b/src/bin/pg_verify_checksums/t/001_basic.pl
new file mode 100644
index 00..1fa2e12db2
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/001_basic.pl
@@ -0,0 +1,8 @@
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_verify_checksums');
+program_version_ok('pg_verify_checksums');
+program_options_handling_ok('pg_verify_checksums');
diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl
new file mode 100644
index 00..dc29da09af
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/002_actions.pl
@@ -0,0 +1,69 @@
+# Do basic sanity checks supported by pg_verify_checksums using
+# an initialized cluster.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize node with checksums enabled.
+my $node = get_new_node('node_checksum');
+$node->init(extra => ['--data-checksums']);
+my $pgdata = $node->data_dir;
+
+# Control file should know that checksums are enabled.
+command_like(['pg_controldata', $pgdata],
+	 qr/Data page checksum version:.*1/,
+		 'checksums enabled in control file');
+
+# Checksums pass on a newly-created cluster
+command_ok(['pg_verify_checksums',  '-D', $pgdata],
+		   "succeeds with offline cluster");
+
+# Checks cannot happen with an online cluster
+$node->start;
+command_fails(['pg_verify_checksums',  '-D', $pgdata],
+			  "fails with online cluster");
+
+# Create table to corrupt and get its relfilenode
+$node->safe_psql('postgres',
+	"SELECT a INTO corrupt1 FROM generate_series(1,1) AS a;
+	ALTER TABLE corrupt1 SET (autovacuum_enabled=false);");
+
+my $file_corrupted = $node->safe_psql('postgres',
+	"SELECT 

RE: IDE setup and development features?

2018-10-09 Thread Tsunakawa, Takayuki
From: Mori Bellamy [mailto:m...@invoked.net]
> I'd like a few features when developing postgres -- (1) jump to definition
> of symbol (2) find references to symbol and (3) semantic autocompletion.

For 1), you can generate tags like:

[for vi]
$ src/tools/make_ctags
[for Emacs]
$ src/tools/make_etags

Cscope works for 2).


Regards
Takayuki Tsunakawa



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-09 16:17:44 -0700, Mark Dilger wrote:
>> I have no objection, but I'm curious, when retiring a datatype and
>> associated functions, do the Oids that were assigned to them become
>> available for new uses, or do you have to expire them to avoid breaking
>> pg_upgrade and such?  Retiring built-in types and functions seems
>> rare enough that I've not seen how this is handled before.

> I don't really see a need for preserving them.

Yeah, should be fine to recycle them.

regards, tom lane



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Andres Freund
Hi,

On 2018-10-09 16:17:44 -0700, Mark Dilger wrote:
> > On Oct 9, 2018, at 12:22 PM, Andres Freund  wrote:
> > As discussed below (at [1]), I think we should remove $subject.  I plan
> > to do so, unless somebody protests soon-ish.  I thought it'd be better
> > to call attention to this in a new thread, to make sure people had a
> > chance to object.
> 
> I have no objection, but I'm curious, when retiring a datatype and
> associated functions, do the Oids that were assigned to them become
> available for new uses, or do you have to expire them to avoid breaking
> pg_upgrade and such?  Retiring built-in types and functions seems
> rare enough that I've not seen how this is handled before.

I don't really see a need for preserving them. pg_upgrade should fail
because the schema dump won't restore (as that has textual oids). You
could argue that external drivers could have the oids builtin, but I
don't find that convincing, because they'd be in trouble for new types
etc anyway.

Greetings,

Andres Freund



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Mark Dilger



> On Oct 9, 2018, at 12:22 PM, Andres Freund  wrote:
> 
> In-Reply-To: <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de>
> 
> Hi,
> 
> As discussed below (at [1]), I think we should remove $subject.  I plan
> to do so, unless somebody protests soon-ish.  I thought it'd be better
> to call attention to this in a new thread, to make sure people had a
> chance to object.

I have no objection, but I'm curious, when retiring a datatype and
associated functions, do the Oids that were assigned to them become
available for new uses, or do you have to expire them to avoid breaking
pg_upgrade and such?  Retiring built-in types and functions seems
rare enough that I've not seen how this is handled before.

mark





Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-10-09 Thread David Rowley
On 9 October 2018 at 15:49, Amit Langote  wrote:
> On 2018/10/05 21:55, David Rowley wrote:
>> I'm not so sure we need to zero the partition_tuple_slots[] array at
>> all since we always set a value there is there's a corresponding map
>> stored. I considered pulling that out, but in the end, I didn't as I
>> saw some Asserts checking it's been properly set by checking the
>> element  != NULL in nodeModifyTable.c and copy.c.  Perhaps I should
>> have just gotten rid of those Asserts along with the palloc0 and
>> subsequent memset after the expansion of the array. I'm undecided.
>
> Maybe it's a good thing that it's doing the same thing as with the
> child_to_parent_maps array, which is to zero-init it when allocated.

Perhaps, but the maps do need to be zeroed, the partition_tuple_slots
array does not since we only access it when the parent to child map is
set.

In any case, since PARTITION_ROUTING_INITSIZE is just 8, it'll likely
not save much since it's really just saving a memset(..., 0, 64), for
single-row INSERTs on a 64-bit machine. So likely it won't save more
than a bunch of nanoseconds.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: IDE setup and development features?

2018-10-09 Thread David Fetter
On Tue, Oct 09, 2018 at 02:39:42PM -0700, Mori Bellamy wrote:
> Hi all,
> 
> I'd like a few features when developing postgres -- (1) jump to definition
> of symbol (2) find references to symbol and (3) semantic autocompletion.
> 
> Was wondering if anyone has had luck getting these three set up for any IDE
> or editor configuration? Personally, I can confirm vim + ctags seems to
> work for (1).

vim + ctags + cscope gets me to (2)

Ctags:

https://twitter.com/davidfetter/status/1013192487341441024

Cscope:

https://github.com/chazy/cscope_maps/blob/master/plugin/cscope_maps.vim

Semantic Autocompletion: I've heard good things about YCM, but I haven't tried 
it out yet.

http://valloric.github.io/YouCompleteMe/

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Make Windows print float exponents like everybody else?

2018-10-09 Thread Andres Freund
Hi,

On 2018-10-09 18:00:54 -0400, Tom Lane wrote:
> I can't believe I got no comments about this, so reposting it under
> a more attention-getting title ...
> 
> In https://www.postgresql.org/message-id/29037.1539021...@sss.pgh.pa.us
> I wrote:
> 
> > Also, we have quite a few variant expected-files that exist only to cater
> > for Windows' habit of printing three exponent digits where everybody else
> > prints just two.  It struck me that it would not be hard, or expensive,
> > to undo that choice in snprintf.c (see attached untested patch).  So we
> > could considerably reduce future maintenance pain for the affected tests
> > by getting rid of those files.
> >
> > As against that, Windows users might possibly complain that float output
> > looks different than they're used to.  I'm not sure how much sympathy
> > I have for that position.  If we reimplement float output for more speed,
> > as is under discussion in nearby threads, I doubt we'd trouble to preserve
> > this Windows-ism in the rewrite.
> >
> > Comments?

No pushback here. Lower likelihood of turning the buildfarm red, lower
likelihood of being embarrased. Good. ;)

Greetings,

Andres Freund



Make Windows print float exponents like everybody else?

2018-10-09 Thread Tom Lane
I can't believe I got no comments about this, so reposting it under
a more attention-getting title ...

In https://www.postgresql.org/message-id/29037.1539021...@sss.pgh.pa.us
I wrote:

> Also, we have quite a few variant expected-files that exist only to cater
> for Windows' habit of printing three exponent digits where everybody else
> prints just two.  It struck me that it would not be hard, or expensive,
> to undo that choice in snprintf.c (see attached untested patch).  So we
> could considerably reduce future maintenance pain for the affected tests
> by getting rid of those files.
>
> As against that, Windows users might possibly complain that float output
> looks different than they're used to.  I'm not sure how much sympathy
> I have for that position.  If we reimplement float output for more speed,
> as is under discussion in nearby threads, I doubt we'd trouble to preserve
> this Windows-ism in the rewrite.
>
> Comments?

If I don't get any pushback I'm going to go do that...

regards, tom lane



Re: IDE setup and development features?

2018-10-09 Thread Eren Başak
Hi Mori,

There was a talk about that last year:
https://speakerdeck.com/citusdata/hacking-postgresql-with-eclipse-pgconf-eu-2017-metin-doslu

On Tue, Oct 9, 2018 at 2:39 PM Mori Bellamy  wrote:

> Hi all,
>
> I'd like a few features when developing postgres -- (1) jump to definition
> of symbol (2) find references to symbol and (3) semantic autocompletion.
>
> Was wondering if anyone has had luck getting these three set up for any
> IDE or editor configuration? Personally, I can confirm vim + ctags seems to
> work for (1).
>
> --
> Thanks,
> Mori
>


IDE setup and development features?

2018-10-09 Thread Mori Bellamy
Hi all,

I'd like a few features when developing postgres -- (1) jump to definition
of symbol (2) find references to symbol and (3) semantic autocompletion.

Was wondering if anyone has had luck getting these three set up for any IDE
or editor configuration? Personally, I can confirm vim + ctags seems to
work for (1).

-- 
Thanks,
Mori


Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-09 17:27:17 -0400, Tom Lane wrote:
>> I complained about this already on the other thread, I think, but:
>> I do not think we should remove timeofday().  It's unrelated to these
>> datatypes and it offers functionality that isn't quite duplicated
>> elsewhere.

> Ok.  I find it a somewhat weird function, but I agree it's not strongly
> related. It only came up because it's implemented in nabstime.c.  Moving
> it to utils/adt/timestamp.c seems to make the most sense?

Sure, if you want to flush nabstime.c completely, just shove it over
there.

regards, tom lane



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Andres Freund
Hi,

On 2018-10-09 17:27:17 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > As discussed below (at [1]), I think we should remove $subject.  I plan
> > to do so, unless somebody protests soon-ish.  I thought it'd be better
> > to call attention to this in a new thread, to make sure people had a
> > chance to object.
> 
> I complained about this already on the other thread, I think, but:
> I do not think we should remove timeofday().  It's unrelated to these
> datatypes and it offers functionality that isn't quite duplicated
> elsewhere.

Ok.  I find it a somewhat weird function, but I agree it's not strongly
related. It only came up because it's implemented in nabstime.c.  Moving
it to utils/adt/timestamp.c seems to make the most sense?

Greetings,

Andres Freund



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Tom Lane
Andres Freund  writes:
> As discussed below (at [1]), I think we should remove $subject.  I plan
> to do so, unless somebody protests soon-ish.  I thought it'd be better
> to call attention to this in a new thread, to make sure people had a
> chance to object.

I complained about this already on the other thread, I think, but:
I do not think we should remove timeofday().  It's unrelated to these
datatypes and it offers functionality that isn't quite duplicated
elsewhere.

regards, tom lane



Re: [HACKERS] kqueue

2018-10-09 Thread Thomas Munro
On Tue, Oct 2, 2018 at 6:28 AM Andres Freund  wrote:
> On 2018-10-01 19:25:45 +0200, Matteo Beccati wrote:
> > On 01/10/2018 01:09, Thomas Munro wrote:
> > > I don't know why the existence of the kqueue should make recvfrom()
> > > slower on the pgbench side.  That's probably something to look into
> > > off-line with some FreeBSD guru help.  Degraded performance for
> > > clients on the same machine does seem to be a show stopper for this
> > > patch for now.  Thanks for testing!
> >
> > Glad to be helpful!
> >
> > I've tried running pgbench from a separate VM and in fact kqueue
> > consistently takes the lead with 5-10% more tps on select/prepared pgbench
> > on NetBSD too.
> >
> > What I have observed is that sys cpu usage is ~65% (35% idle) with kqueue,
> > while unpatched master averages at 55% (45% idle): relatively speaking
> > that's almost 25% less idle cpu available for a local pgbench to do its own
> > stuff.
>
> This suggest that either the the wakeup logic between kqueue and poll,
> or the internal locking could be at issue.  Is it possible that poll
> triggers a directed wakeup path, but kqueue doesn't?

I am following up with some kernel hackers.  In the meantime, here is
a rebase for the new split-line configure.in, to turn cfbot green.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Add-kqueue-2-support-for-WaitEventSet-v12.patch
Description: Binary data


Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread David Fetter
On Tue, Oct 09, 2018 at 01:43:48PM -0700, Andres Freund wrote:
> 
> 
> On October 9, 2018 1:40:34 PM PDT, David Fetter 
> wrote:
> >On Tue, Oct 09, 2018 at 12:31:19PM -0700, Andres Freund wrote:
> >> Hi,
> >> 
> >> On 2018-10-09 21:26:31 +0200, David Fetter wrote:
> >> > On Tue, Oct 09, 2018 at 12:22:37PM -0700, Andres Freund wrote:
> >> > > In-Reply-To:
> >> > > <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de> As
> >> > > discussed below (at [1]), I think we should remove $subject. 
> >I plan
> >> > > to do so, unless somebody protests soon-ish.  I thought it'd
> >> > > be
> >better
> >> > > to call attention to this in a new thread, to make sure
> >> > > people
> >had a
> >> > > chance to object.
> >> > 
> >> > How much time would someone have to convert the timetravel
> >> > piece of contrib/spi to use non-deprecated time types in order
> >> > to make this window?
> >> 
> >> "this window"?
> >> 
> >> It's not entirely trivial, but also not that hard. It'd break
> >existing
> >> users however, as obviously their tables wouldn't dump / load or
> >> pg_upgrade into a working state.
> >> 
> >> But I think spi/timetravel is not something people can actually
> >> use /
> >do
> >> use much, the functionality is way too limited in practice, the
> >> datatypes have been arcane for about as long as postgres existed,
> >> etc. And the code isn't fit to serve as an example.
> >> 
> >> In my opinion it has negative value at this point.
> >
> >I suppose the proposals to add the standard-conformant temporal
> >stuff would make this moot, but I don't recall a complete patch for
> >that.
> 
> spi/timetravel is just a trigger. Can be written in a few lines of
> plpgsql.  What's functionality of your concern here?  Comparing it
> to actual temporal functionality doesn't strike me as meaningful.

Fair enough.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pread() and pwrite()

2018-10-09 Thread Thomas Munro
On Tue, Oct 9, 2018 at 5:08 PM Tom Lane  wrote:
> I wrote:
> > Thomas Munro  writes:
> >> On Tue, Oct 9, 2018 at 2:55 PM Tom Lane  wrote:
> >>> Yeah, I've been burnt by that too recently.  It occurs to me we could make
> >>> that at least a little less painful if we formatted the macro with one
> >>> line per function name:
>
> >> +1, was about to suggest the same!
>
> > Sold, I'll go do it.
>
> Learned a few new things about M4 along the way :-( ... but done.
> You'll need to rebase the pread patch again of course.

Thanks, much nicer.  Rebased.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Use-pread-pwrite-instead-of-lseek-read-write-v7.patch
Description: Binary data


Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Andres Freund



On October 9, 2018 1:40:34 PM PDT, David Fetter  wrote:
>On Tue, Oct 09, 2018 at 12:31:19PM -0700, Andres Freund wrote:
>> Hi,
>> 
>> On 2018-10-09 21:26:31 +0200, David Fetter wrote:
>> > On Tue, Oct 09, 2018 at 12:22:37PM -0700, Andres Freund wrote:
>> > > In-Reply-To: <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de>
>> > > As discussed below (at [1]), I think we should remove $subject. 
>I plan
>> > > to do so, unless somebody protests soon-ish.  I thought it'd be
>better
>> > > to call attention to this in a new thread, to make sure people
>had a
>> > > chance to object.
>> > 
>> > How much time would someone have to convert the timetravel piece of
>> > contrib/spi to use non-deprecated time types in order to make this
>> > window?
>> 
>> "this window"?
>> 
>> It's not entirely trivial, but also not that hard. It'd break
>existing
>> users however, as obviously their tables wouldn't dump / load or
>> pg_upgrade into a working state.
>> 
>> But I think spi/timetravel is not something people can actually use /
>do
>> use much, the functionality is way too limited in practice, the
>> datatypes have been arcane for about as long as postgres existed,
>> etc. And the code isn't fit to serve as an example.
>> 
>> In my opinion it has negative value at this point.
>
>I suppose the proposals to add the standard-conformant temporal stuff
>would make this moot, but I don't recall a complete patch for that.

spi/timetravel is just a trigger. Can be written in a few lines of plpgsql.  
What's functionality of your concern here?  Comparing it to actual temporal 
functionality doesn't strike me as meaningful.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread David Fetter
On Tue, Oct 09, 2018 at 12:31:19PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-10-09 21:26:31 +0200, David Fetter wrote:
> > On Tue, Oct 09, 2018 at 12:22:37PM -0700, Andres Freund wrote:
> > > In-Reply-To: <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de>
> > > As discussed below (at [1]), I think we should remove $subject.  I plan
> > > to do so, unless somebody protests soon-ish.  I thought it'd be better
> > > to call attention to this in a new thread, to make sure people had a
> > > chance to object.
> > 
> > How much time would someone have to convert the timetravel piece of
> > contrib/spi to use non-deprecated time types in order to make this
> > window?
> 
> "this window"?
> 
> It's not entirely trivial, but also not that hard. It'd break existing
> users however, as obviously their tables wouldn't dump / load or
> pg_upgrade into a working state.
> 
> But I think spi/timetravel is not something people can actually use / do
> use much, the functionality is way too limited in practice, the
> datatypes have been arcane for about as long as postgres existed,
> etc. And the code isn't fit to serve as an example.
> 
> In my opinion it has negative value at this point.

I suppose the proposals to add the standard-conformant temporal stuff
would make this moot, but I don't recall a complete patch for that.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Support custom socket directory in pg_upgrade

2018-10-09 Thread Daniel Gustafsson
> On 9 Oct 2018, at 16:22, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> Having hit the maximum socketdir length error a number of times in 
>> pg_upgrade,
>> especially when running tests in a deep directory hierarchy, I figured it was
>> time to see if anyone else has had the same problem?  The attached patch is
>> what I run with locally to avoid the issue, it adds a --socketdir=PATH option
>> to pg_upgrade which overrides the default use of CWD.  Is that something that
>> could be considered?
> 
> I think you could simplify matters if you installed the CWD default value
> during option processing.

The attached v2 tries to make the socketdir more like the other configurable
directories in pg_upgrade (adding an envvar for it etc).  Is that more in line
with what you were suggesting?  make -C src/bin/pg_upgrade check passes with
this, both unmodified and with a -s in the test script to override it.  Also
fixed incorrect syntax in the docs part from v1.

cheers ./daniel



pg_upgrade_sockdir-v2.patch
Description: Binary data





Re: Some incorrect comments and out-dated README from run-time pruning

2018-10-09 Thread David Rowley
On 10 October 2018 at 02:38, Peter Eisentraut
 wrote:
> - * subplan_map[] and subpart_map[] are indexed by partition index (where
> - * zero is the topmost partition, and non-leaf partitions must come before
> - * their children).  For a leaf partition p, subplan_map[p] contains the
> + * subplan_map[] and subpart_map[] are indexed by partition index (as
> defined
> + * in the PartitionDesc).  For a leaf partition p, subplan_map[p]
> contains the
>
> I don't see what someone reading this comment would do with "as defined
> in the PartitionDesc".  I don't see any PartitionDesc referenced or
> mentioned at or near that struct.

Perhaps it should have mentioned: the PartitionDesc belonging to the
partitioned table referenced by 'rtindex'.

I've attached another version.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


various_run-time_pruning_doc_fixes_v2.patch
Description: Binary data


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2018-10-09 Thread Thomas Munro
On Fri, Oct 5, 2018 at 12:36 PM Thomas Munro
 wrote:
> *  We really should get rid of that "exited with exit code 1".

Robert and I just discussed this subproblem (the original complaint of
this thread) off-list.  Our questions are: does anyone actually want
that message from the postmaster in the log, and if not, shouldn't we
just do this?

diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 41de140ae0..b34655b4bd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3193,8 +3193,7 @@ CleanupBackgroundWorker(int pid,
rw->rw_child_slot = 0;
ReportBackgroundWorkerExit();  /* report child death */

-   LogChildExit(EXIT_STATUS_0(exitstatus) ? DEBUG1 : LOG,
-namebuf, pid, exitstatus);
+   LogChildExit(DEBUG1, namebuf, pid, exitstatus);

return true;
}

As for the problem of the behaviour of bgworker processes themselves
on SIGTERM, let's discuss that separately in the other subthread[1]
(well, my mail client thinks it's a different thread, but the archives
think it's the same thread with a different subject).

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_%2BX%3Dw%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



Re: chained transactions

2018-10-09 Thread Peter Eisentraut
On 05/04/2018 10:35, Heikki Linnakangas wrote:
> On 15/03/18 18:39, Alvaro Herrera wrote:
>>>  From 517bc6d9fefdee9135857d9562f644f2984ace32 Mon Sep 17 00:00:00 2001
>>> From: Peter Eisentraut
>>> Date: Sun, 18 Feb 2018 09:33:53 -0500
>>> Subject: [PATCH v1 6/8] Turn transaction_isolation into GUC enum
>>>
>>> XXX no idea why it was done the way it was, but this seems much simpler
>>> and apparently doesn't change any functionality.
>> Enums are recent -- 52a8d4f8f7e2, only 10 years old, and the commit
>> didn't convert all cases, leaving some for later.  Funnily enough,
>> default_transaction_isolation was changed afterwards by ad6bf716baa7 but
>> not this one ... I guess "later" is now upon us for it.
> 
> With this patch, this stops working:
> 
> set transaction_isolation='default';
> 
> Other than that, +1 on this patch. I haven't looked closely at the rest 
> of the patches yet.

Based on this, I have committed this part of the patch series.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Andres Freund
Hi,

On 2018-10-09 21:26:31 +0200, David Fetter wrote:
> On Tue, Oct 09, 2018 at 12:22:37PM -0700, Andres Freund wrote:
> > In-Reply-To: <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de>
> > As discussed below (at [1]), I think we should remove $subject.  I plan
> > to do so, unless somebody protests soon-ish.  I thought it'd be better
> > to call attention to this in a new thread, to make sure people had a
> > chance to object.
> 
> How much time would someone have to convert the timetravel piece of
> contrib/spi to use non-deprecated time types in order to make this
> window?

"this window"?

It's not entirely trivial, but also not that hard. It'd break existing
users however, as obviously their tables wouldn't dump / load or
pg_upgrade into a working state.

But I think spi/timetravel is not something people can actually use / do
use much, the functionality is way too limited in practice, the
datatypes have been arcane for about as long as postgres existed,
etc. And the code isn't fit to serve as an example.

In my opinion it has negative value at this point.

Greetings,

Andres Freund



Re: [HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread David Fetter
On Tue, Oct 09, 2018 at 12:22:37PM -0700, Andres Freund wrote:
> In-Reply-To: <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de>
> 
> Hi,
> 
> As discussed below (at [1]), I think we should remove $subject.  I plan
> to do so, unless somebody protests soon-ish.  I thought it'd be better
> to call attention to this in a new thread, to make sure people had a
> chance to object.

How much time would someone have to convert the timetravel piece of
contrib/spi to use non-deprecated time types in order to make this
window?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



[HACKERS] removing abstime, reltime, tinterval.c, spi/timetravel

2018-10-09 Thread Andres Freund
In-Reply-To: <20180928223240.kgwc4czzzekrp...@alap3.anarazel.de>

Hi,

As discussed below (at [1]), I think we should remove $subject.  I plan
to do so, unless somebody protests soon-ish.  I thought it'd be better
to call attention to this in a new thread, to make sure people had a
chance to object.

Regards,

Andres

In "Something for the TODO list: deprecating abstime and friends"
On 2018-09-28 15:32:40 -0700, Andres Freund wrote:
> On 2017-12-13 00:05:06 -0800, Andres Freund wrote:
> > > * contrib/spi/timetravel depends on abstime columns to represent what
> > > would nowadays be better done as a tstzrange.  I'd have thought we
> > > could maybe toss that example module overboard, but we just today got
> > > a question about its usage, so I'm afraid it's not quite dead yet.
> > > What shall we do with that?
> > 
> > Looking at the code I'd be pretty strongly inclined to scrap it.
> > 
> > 
> > Before I'd discovered this thread, I'd started to write up a
> > patch. Attached. It's clearly not fully done. Questions I'd while
> > hacking things up:
> > - what to do with contrib/spi/timetravel - I'd just removed it from the
> >   relevant Makefile for now.
> > - nabstime.c currently implements timeofday() which imo is a pretty
> >   weird function. I'd be quite inclined to remove it at the same time as
> >   this.
> 
> Here's a refreshed version of this patch.  First patch removes
> contrib/spi/timetravel, second patch removes abstime, reltime, tinterval
> together with timeofday().
> 
> I think we should just go ahead and commit something like this soon.

[1] 
http://archives.postgresql.org/message-id/20180928223240.kgwc4czzzekrpsid%40alap3.anarazel.de



Re: pread() and pwrite()

2018-10-09 Thread Andrew Dunstan




On 10/09/2018 02:37 PM, Andres Freund wrote:

On 2018-10-09 14:32:29 -0400, Andrew Dunstan wrote:


On 10/08/2018 09:55 PM, Tom Lane wrote:

Thomas Munro  writes:

Rebased again.  Patches that touch AC_CHECK_FUNCS are fun like that!

Yeah, I've been burnt by that too recently.  It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:

AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])

You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.





By and large I think it's better not to submit patches with changes to
configure, but to let the committer run autoconf.
OTOH, this will probably confuse the heck out of the cfbot patch checker.

And make life harder for reviewers.

-1 on this one.




Maybe I'm thinking back to the time when we used to use a bunch of old 
versions of autoconf ...


cheers

andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pread() and pwrite()

2018-10-09 Thread Andres Freund
On 2018-10-09 14:32:29 -0400, Andrew Dunstan wrote:
> 
> 
> On 10/08/2018 09:55 PM, Tom Lane wrote:
> > Thomas Munro  writes:
> > > Rebased again.  Patches that touch AC_CHECK_FUNCS are fun like that!
> > Yeah, I've been burnt by that too recently.  It occurs to me we could make
> > that at least a little less painful if we formatted the macro with one
> > line per function name:
> > 
> > AC_CHECK_FUNCS([
> > cbrt
> > clock_gettime
> > fdatasync
> > ...
> > wcstombs_l
> > ])
> > 
> > You'd still get conflicts in configure itself, of course, but that
> > doesn't require manual work to resolve -- just re-run autoconf.
> > 
> > 
> 
> 
> 
> By and large I think it's better not to submit patches with changes to
> configure, but to let the committer run autoconf.

> OTOH, this will probably confuse the heck out of the cfbot patch checker.

And make life harder for reviewers.

-1 on this one.

Greetings,

Andres Freund



Re: executor relation handling

2018-10-09 Thread Tom Lane
Robert Haas  writes:
> On Sat, Oct 6, 2018 at 2:59 PM Tom Lane  wrote:
>> The reasons why we need locks on tables not physically accessed by the
>> query are (a) to ensure that we've blocked, or received sinval messages
>> for, any DDL related to views or partition parent tables, in case that
>> would invalidate the plan; (b) to allow firing triggers safely, in
>> the case of partition parent tables.  Neither of these issues apply to
>> a parallel worker -- the plan is already frozen before it can ever
>> start, and it isn't going to be firing any triggers either.

> That last part could *easily* change in a future release.  We've
> already started to allow CTAS with parallel query, and there have
> already been multiple people wanting to allow more.  It would be a
> shame if we threw up additional obstacles in the way of that...

I hardly think that this is the most serious issue in the way of
doing non-read-only things in parallel workers.

In any case, a parallel worker would surely have to open any
relations it is going to fire triggers for.  If it gets the correct
lock when it does that, all is well.  If not, the Assert in
relation_open will complain.

regards, tom lane



Re: pread() and pwrite()

2018-10-09 Thread Andrew Dunstan




On 10/08/2018 09:55 PM, Tom Lane wrote:

Thomas Munro  writes:

Rebased again.  Patches that touch AC_CHECK_FUNCS are fun like that!

Yeah, I've been burnt by that too recently.  It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:

AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])

You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.






By and large I think it's better not to submit patches with changes to 
configure, but to let the committer run autoconf.


You can avoid getting such changes in your patches by doing something 
like this:


    git config diff.nodiff.command /bin/true
    echo configure diff=nodiff >> .git/info/attributes

If you actually want to turn this off and see any diffs in configure, run

    git diff --no-ext-diff

It's also possible to supply a filter expression to 'git diff'.

OTOH, this will probably confuse the heck out of the cfbot patch checker.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Fwd: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

2018-10-09 Thread Robert Haas
On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
 wrote:
> I still think the current situation is non-ideal.  I don't have a
> strong view on whether some or all system-wide processes should say
> hello and goodbye explicitly in the log, but I do think we need a way
> to make that not look like an error condition, and ideally without
> special cases for known built-in processes.
>
> I looked into this a bit today, while debugging an extension that runs
> a background worker.  Here are some assorted approaches to shutdown:
>
> 0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
> generates a FATAL ereport "terminating background worker \"%s\" due to
> administrator command", directly in the signal handler (hmm, see also
> recent discussions about the legality of similar code in quickdie()).

Yeah, I think that code is bad news, for the same reasons discussed
with regard to quickdie().

> 1.  Some processes install their own custom SIGTERM handler that sets
> a flag, and use that to break out of their main loop and exit quietly.
> Example: autovacuum.c, or the open-source pg_cron extension, and
> probably many other things that just want a nice quiet shutdown.
>
> 2.  Some processes install the standard SIGTERM handler die(), and
> then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
> default this looks like "FATAL:  terminating connection due to
> administrator command".  This approach is effectively required if the
> main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> loop.  For example, a "launcher"-type process that calls
> WaitForBackgroundWorkerStartup() could hang forever if it used
> approach 1 (ask me how I know).

My experience with background workers has been that approach #1 does
not really work.  I mean, if you aren't doing anything complicated it
might be OK, if for example you are executing SQL queries, and you try
to do #1, then your SQL queries don't respond to interrupts.  And that
sucks.  So I have generally adopted approach #2.

To put that another way, nearly everything can reach
CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
in the case isn't much different from saying that it is required, full
stop.

> 3.  Some system processes generally use approach 2, but have a special
> case in ProcessInterrupts() to suppress or alter the usual FATAL
> message or behaviour.  This includes the logical replication launcher.

Maybe we could replace this by a general-purpose hook.  So instead of
the various tests for process types that are there now, we would just
have

if (procdie_hook != NULL)
(*procdie_hook)();

And that hook can do whatever you like (but probably including dying).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: executor relation handling

2018-10-09 Thread Robert Haas
On Sat, Oct 6, 2018 at 2:59 PM Tom Lane  wrote:
> The reasons why we need locks on tables not physically accessed by the
> query are (a) to ensure that we've blocked, or received sinval messages
> for, any DDL related to views or partition parent tables, in case that
> would invalidate the plan; (b) to allow firing triggers safely, in
> the case of partition parent tables.  Neither of these issues apply to
> a parallel worker -- the plan is already frozen before it can ever
> start, and it isn't going to be firing any triggers either.

That last part could *easily* change in a future release.  We've
already started to allow CTAS with parallel query, and there have
already been multiple people wanting to allow more.  It would be a
shame if we threw up additional obstacles in the way of that...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Index Skip Scan

2018-10-09 Thread Pavel Stehule
út 9. 10. 2018 v 16:13 odesílatel Pavel Stehule 
napsal:

>
>
> út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
> napsal:
>
>> > On Tue, 9 Oct 2018 at 15:43, Pavel Stehule 
>> wrote:
>> >
>> > Hi
>> >
>> > I tested last patch and I have some notes:
>> >
>> > 1.
>> >
>> > postgres=# explain select distinct a1 from foo;
>> >
>> +---+
>> > |QUERY PLAN
>>  |
>> >
>> +---+
>> > | Unique  (cost=0.43..4367.56 rows=9983 width=4)
>> |
>> > |   ->  Index Skip Scan using foo_a1_idx on foo
>> (cost=0.43..4342.60 rows=9983 width=4) |
>> >
>> +---+
>> > (2 rows)
>> >
>> > In this case Unique node is useless and can be removed
>>
>> Just to clarify which exactly version were you testing? If
>> index-skip-fallback.patch,
>> then the Unique node was added there to address the situation when
>> ndistinct is underestimated, with an idea to fallback to original plan
>> (and to tolerate that I suggested to use Unique, since we don't know
>> if fallback will happen or not during the planning).
>>
>
> I tested index-skip-fallback.patch.
>
> It looks like good idea, but then the node should be named "index scan"
> and other info can be displayed in detail parts. It can be similar like
> "sort".
>
> The combination of unique and index skip scan looks strange :)
>

maybe we don't need special index skip scan node - maybe possibility to
return unique values from index scan node can be good enough - some like
"distinct index scan" - and the implementation there can be dynamic -skip
scan, classic index scan,

"index skip scan" is not good  name if the implementaion is dynamic.


>
>> > 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
>>
>> Yep, as far as I understand MIN/MAX is going to be the next step after
>> this
>> patch will be accepted.
>>
>
> ok
>
> Now, the development cycle is starting - maybe it can use same
> infrastructure like MIN/MAX and this part can be short.
>
> more if you use dynamic index scan
>
>
>> > 3. Once time patched postgres crashed, but I am not able to reproduce
>> it.
>>
>> Maybe you have at least some ideas what could cause that or what's the
>> possible
>> way to reproduce that doesn't work anymore?
>>
>
> I think it was query like
>
> select count(*) from (select distinct x from tab) s
>
>
>


Re: Proposal for Signal Detection Refactoring

2018-10-09 Thread Andres Freund



On October 9, 2018 6:58:18 AM PDT, Peter Eisentraut 
 wrote:
>On 25/09/2018 02:23, Andres Freund wrote:
>>> My point was just to reduce the number of variables used and ease
>>> debugger lookups with what is on the stack.
>> I'm not sure a bitflag really gives you that - before gdb gives you
>the
>> plain value, afterwards you need to know the enum values and do bit
>math
>> to know.
>
>You could use an actual C bit field, to make debugging easier.

See Tom's reply to this suggestion...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: TupleTableSlot abstraction

2018-10-09 Thread Amit Khandekar
On Wed, 26 Sep 2018 at 05:35, Andres Freund  wrote:
> > +
> > +/*
> > + * This is a function used by all getattr() callbacks which deal with a 
> > heap
> > + * tuple or some tuple format which can be represented as a heap tuple 
> > e.g. a
> > + * minimal tuple.
> > + *
> > + * heap_getattr considers any attnum beyond the attributes available in the
> > + * tuple as NULL. This function however returns the values of missing
> > + * attributes from the tuple descriptor in that case. Also this function 
> > does
> > + * not support extracting system attributes.
> > + *
> > + * If the attribute needs to be fetched from the tuple, the function fills 
> > in
> > + * tts_values and tts_isnull arrays upto the required attnum.
> > + */
> > +Datum
> > +tts_heap_getattr_common(TupleTableSlot *slot, HeapTuple tuple, uint32 
> > *offp,
> > + int attnum, bool
> > *isnull)
>
> I'm still *vehemently* opposed to the introduction of this.

You mean, you want to remove the att_isnull() optimization, right ?
Removed that code now. Directly deforming the tuple regardless of the
null attribute.

>
>
>
> > @@ -2024,7 +2024,18 @@ FormIndexDatum(IndexInfo *indexInfo,
> >   Datum   iDatum;
> >   boolisNull;
> >
> > - if (keycol != 0)
> > + if (keycol < 0)
> > + {
> > + HeapTupleTableSlot *hslot = (HeapTupleTableSlot 
> > *)slot;
> > +
> > + /* Only heap tuples have system attributes. */
> > + Assert(TTS_IS_HEAPTUPLE(slot) || 
> > TTS_IS_BUFFERTUPLE(slot));
> > +
> > + iDatum = heap_getsysattr(hslot->tuple, keycol,
> > +  
> > slot->tts_tupleDescriptor,
> > +  
> > );
> > + }
> > + else if (keycol != 0)
> >   {
> >   /*
> >* Plain index column; get the value we need directly 
> > from the
>
> This now should access the system column via the slot, right?  There's
> other places like this IIRC.

Done. In FormIndexDatum() and ExecInterpExpr(), directly calling
slot_getsysattr() now.

In ExecInterpExpr (), I am no longer using ExecEvalSysVar() now. I am
planning to remove this definition since it would be a single line
function just calling slot_getsysattr().

In build_ExecEvalSysVar(), ExecEvalSysVar() is still used, so I
haven't removed the definition yet. I am planning to create a new
LLVMValueRef FuncSlotGetsysattr, and use that instead, in
build_ExecEvalSysVar(), or for that matter, I am thinking to revert
back build_ExecEvalSysVar() and instead have that code inline as in
HEAD.

>
>
>
> > diff --git a/src/backend/executor/execExprInterp.c 
> > b/src/backend/executor/execExprInterp.c
> > index 9d6e25a..1b4e726 100644
> > --- a/src/backend/executor/execExprInterp.c
> > +++ b/src/backend/executor/execExprInterp.c
> > @@ -490,54 +490,21 @@ ExecInterpExpr(ExprState *state, ExprContext 
> > *econtext, bool *isnull)
> >
> >   EEO_CASE(EEOP_INNER_SYSVAR)
> >   {
> > - int attnum = op->d.var.attnum;
> > - Datum   d;
> > -
> > - /* these asserts must match defenses in slot_getattr 
> > */
> > - Assert(innerslot->tts_tuple != NULL);
> > - Assert(innerslot->tts_tuple != 
> > &(innerslot->tts_minhdr));
> > -
> > - /* heap_getsysattr has sufficient defenses against 
> > bad attnums */
> > - d = heap_getsysattr(innerslot->tts_tuple, attnum,
> > - 
> > innerslot->tts_tupleDescriptor,
> > - op->resnull);
> > - *op->resvalue = d;
> > + ExecEvalSysVar(state, op, econtext, innerslot);
>
> These changes should be in a separate commit.

As mentioned above, now I am not using ExecEvalSysVar(), instead, I am
calling slot_getsysattr(). So no need of separate commit for that.

>
>
> > +const TupleTableSlotOps TTSOpsHeapTuple = {
> > + sizeof(HeapTupleTableSlot),
> > + .init = tts_heap_init,
>
> The first field should also use a named field (same in following cases).

Done.

>
>
> > @@ -185,6 +1020,7 @@ ExecResetTupleTable(List *tupleTable,/* tuple 
> > table */
> >   {
> >   TupleTableSlot *slot = lfirst_node(TupleTableSlot, lc);
> >
> > + slot->tts_cb->release(slot);
> >   /* Always release resources and reset the slot to empty */
> >   ExecClearTuple(slot);
> >   if (slot->tts_tupleDescriptor)
> > @@ -240,6 +1076,7 @@ void
> >  ExecDropSingleTupleTableSlot(TupleTableSlot *slot)
> >  {
> >   /* This should match 

Re: executor relation handling

2018-10-09 Thread Amit Langote
On Tue, Oct 9, 2018 at 11:07 PM Tom Lane  wrote:
>
> Amit Langote  writes:
> > On 2018/10/08 3:55, Tom Lane wrote:
> >> I didn't like the idea of unifying ModifyTable.nominalRelation with
> >> the partition root info.  Those fields serve different masters ---
> >> nominalRelation, at least in its original intent, is only meant for
> >> use of EXPLAIN and might have nothing to do with what happens at
> >> execution.  So even though unifying them would work today, we might
> >> regret it down the line.  Instead I left that field alone and added
> >> a separate rootRelation field to carry the partition root RT index,
> >> which ends up being the same number of fields anyway since we don't
> >> need a flag for is-the-nominal-relation-a-partition-root.
>
> > Thanks for pushing that.  I'd also named it 'rootRelation' in my original
> > patch before David had objected to calling it that, because a command may
> > not specify the "actual" root of a partition tree; it could be a non-root
> > partitioned table.  He'd suggested 'partitionedTarget' for the new field
> > [1], stressing the "target" part.  Maybe, 'rootRelation' isn't too
> > confusing though.
>
> Well, it's the root so far as the current query is concerned --- we do not
> take any semantic account of partitioning levels that might exist above
> the table named in the query, do we?

We don't, and I personally agree with the reasoning behind calling it
rootRelation.

Thanks,
Amit



Re: TAP tests for pg_verify_checksums

2018-10-09 Thread Michael Banck
Hi,

Am Dienstag, den 09.10.2018, 16:54 +0200 schrieb Peter Eisentraut:
> On 06/10/2018 13:46, Michael Paquier wrote:
> > What do you think about the updated version attached?

> > +# Time to create a corruption

That looks a bit weird, maybe "some corupption"? Or maybe it's just me
not being a native speaker.

> Other than that, it appears to cover everything.

One more thing we could check is the relfilenode after we corrupted it,
it should also catch the corruption then. Or is that too trivial?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: TAP tests for pg_verify_checksums

2018-10-09 Thread Peter Eisentraut
On 06/10/2018 13:46, Michael Paquier wrote:
> On Fri, Oct 05, 2018 at 01:38:05PM +0200, Michael Banck wrote:
>> It's too late for v11 though at this point I guess?
> 
> Unfortunately yes.
> 
>> I think it would be easy to also test the -r command-line option, as we
>> already create a table.
> 
> Good idea.  Let's add this test.
> 
>> That comment should read 'that checksums are enabled', right?
> 
> Indeed.  Fixed.
> 
>> Otherwise, LGTM and I've tested it without finding any problems.
> 
> What do you think about the updated version attached?

Looks pretty good to me.

Let's make sure the test names are useful:

+# Checks cannot happen for an online cluster
+$node->start;
+command_fails(['pg_verify_checksums',  '-D', $pgdata],
+ "checksum checks not done");

The test name should be something like "fails with online cluster".

I would also like to see a test that runs against a cluster without
checksums enabled.

Other than that, it appears to cover everything.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: out-of-order XID insertion in KnownAssignedXids

2018-10-09 Thread Konstantin Knizhnik



On 09.10.2018 10:52, Michael Paquier wrote:

On Tue, Oct 09, 2018 at 02:59:00PM +0900, Michael Paquier wrote:

+1.  I am looking at how to make that possible.

And so...  Going through a bit of investigation the problem is that each
2PC transaction preparing finishes by putting the procarray in a state
where there are two entries referring to the same transaction running as
MarkAsPrepared adds an extra entry on the top of the one already
existing.  This is expected though per the comments in
ProcArrayClearTransaction(), as it assumes that MyProc can be safely
cleaned and it assumes that there is a duplicated entry.  Then,
GetRunningTransactionData() ignores the assumption that those duplicate
entries can exist, which makes the snapshot data taken as incorrect,
generating those incorrect XLOG_RUNNING_XACT records.

The most simple fix I can think of first is to tweak the origin of the
problem in GetRunningTransactionData() so as those duplicate XIDs are
ignored if found as there has to be a state between the time
MarkAsPrepared() inserted the 2PC entry in the procarray and the time
ProcArrayClearTransaction() gets called.
--
Michael


Right now GetRunningTransactionData is used only once in Postgres code - in 
LogStandbySnapshot to log RUNNING_XACTS record.
So if this situation is not changed, there will be no difference whether to 
perform sort and exclude duplicates in LogStandbySnapsho
or GetRunningTransactionData.

But GetRunningTransactionData may be used in some other cases... For example I 
have used it in my snapfs Postgres extensions (fast file-level snapshots)
to check if there are no more active transactions in the system. In my case 
presence of duplicates is not important and ordering of XIDs is not needed.
But performance of GetRunningTransactionData is also not critical. But I can 
suspect that there can be cases when it is critical and doing unneeded qsort is 
not desired.
There may be thousands of active transactions and sorting their XIDs is 
notinstantaneous.

Also please take in account that in most cases XIDs in RUNNINGS_XACTS record 
are not processed at all (only if we perform recovery from backup and do not 
reach consistent point yet). This is why this bug was not detected before.

So I completely understand the argument that it is better to eliminate source 
of the problem (presence of duplicates in RUNNING_XACTS record)
but not sure that in this particular case it is really the best solution. If 
presence of duplicates is considered to be acceptable for procarray, why we
can not accept it for RUNNING_XACTS record?

I am not insisting that skipping duplicates in 
ProcArrayApplyRecoveryInfo is the right place (but at least is seems to 
be the most efficient alternative). But I also do not fill that moving 
sort to GetRunningTransactionData and elimination of duplicates here 
(requires one more scan and copying of the whole array) is principally 
better...



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Support custom socket directory in pg_upgrade

2018-10-09 Thread Tom Lane
Daniel Gustafsson  writes:
> Having hit the maximum socketdir length error a number of times in pg_upgrade,
> especially when running tests in a deep directory hierarchy, I figured it was
> time to see if anyone else has had the same problem?  The attached patch is
> what I run with locally to avoid the issue, it adds a --socketdir=PATH option
> to pg_upgrade which overrides the default use of CWD.  Is that something that
> could be considered?

I think you could simplify matters if you installed the CWD default value
during option processing.

regards, tom lane



Re: Add overflow test in function numeric_exp.

2018-10-09 Thread Tom Lane
Yang Xiao  writes:
> The attachment is the proposal patch for function numeric_exp in 
> src/backend/utils/adt/numeric.c.

Why do we need this?  numeric_exp() already detects result overflow.

regression=# select numeric_exp(1); 
ERROR:  value overflows numeric format

regards, tom lane



Re: [patch]overallocate memory for curly braces in array_out

2018-10-09 Thread Keiichi Hirobe
Sorry, I did not notice that you had already pushed it.
Thank you for quick action.

Cheers,
Keiichi Hirobe

2018年9月28日(金) 9:23 Tom Lane :

> Keiichi Hirobe  writes:
> > I am not sure whether to fix another bug, but I fixed and I attached a
> new
> > patch,
> > please check it.
>
> I think this duplicates what I already committed at
>
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=87d9bbca13f9c6b8f6ee986f0e399cb83bd731d4
>
> regards, tom lane
>


Re: Index Skip Scan

2018-10-09 Thread Pavel Stehule
út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Tue, 9 Oct 2018 at 15:43, Pavel Stehule 
> wrote:
> >
> > Hi
> >
> > I tested last patch and I have some notes:
> >
> > 1.
> >
> > postgres=# explain select distinct a1 from foo;
> >
> +---+
> > |QUERY PLAN
>|
> >
> +---+
> > | Unique  (cost=0.43..4367.56 rows=9983 width=4)
> |
> > |   ->  Index Skip Scan using foo_a1_idx on foo  (cost=0.43..4342.60
> rows=9983 width=4) |
> >
> +---+
> > (2 rows)
> >
> > In this case Unique node is useless and can be removed
>
> Just to clarify which exactly version were you testing? If
> index-skip-fallback.patch,
> then the Unique node was added there to address the situation when
> ndistinct is underestimated, with an idea to fallback to original plan
> (and to tolerate that I suggested to use Unique, since we don't know
> if fallback will happen or not during the planning).
>

I tested index-skip-fallback.patch.

It looks like good idea, but then the node should be named "index scan" and
other info can be displayed in detail parts. It can be similar like "sort".

The combination of unique and index skip scan looks strange :)


> > 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
>
> Yep, as far as I understand MIN/MAX is going to be the next step after this
> patch will be accepted.
>

ok

Now, the development cycle is starting - maybe it can use same
infrastructure like MIN/MAX and this part can be short.

more if you use dynamic index scan


> > 3. Once time patched postgres crashed, but I am not able to reproduce it.
>
> Maybe you have at least some ideas what could cause that or what's the
> possible
> way to reproduce that doesn't work anymore?
>

I think it was query like

select count(*) from (select distinct x from tab) s


Re: executor relation handling

2018-10-09 Thread Tom Lane
Amit Langote  writes:
> On 2018/10/08 3:55, Tom Lane wrote:
>> I didn't like the idea of unifying ModifyTable.nominalRelation with
>> the partition root info.  Those fields serve different masters ---
>> nominalRelation, at least in its original intent, is only meant for
>> use of EXPLAIN and might have nothing to do with what happens at
>> execution.  So even though unifying them would work today, we might
>> regret it down the line.  Instead I left that field alone and added
>> a separate rootRelation field to carry the partition root RT index,
>> which ends up being the same number of fields anyway since we don't
>> need a flag for is-the-nominal-relation-a-partition-root.

> Thanks for pushing that.  I'd also named it 'rootRelation' in my original
> patch before David had objected to calling it that, because a command may
> not specify the "actual" root of a partition tree; it could be a non-root
> partitioned table.  He'd suggested 'partitionedTarget' for the new field
> [1], stressing the "target" part.  Maybe, 'rootRelation' isn't too
> confusing though.

Well, it's the root so far as the current query is concerned --- we do not
take any semantic account of partitioning levels that might exist above
the table named in the query, do we?

regards, tom lane



Re: Race condition in create table

2018-10-09 Thread Tom Lane
Konstantin Knizhnik  writes:
> I wonder if it is considered to be expected behavior that concurrent 
> execution of "create table if not exists" may return errors:

No, there's not a guarantee that IF NOT EXISTS will work in such
scenarios.  As with everything else about that feature, it's a
quick-and-dirty, not very well designed effort.

regards, tom lane



Re: Proposal for Signal Detection Refactoring

2018-10-09 Thread Peter Eisentraut
On 01/10/2018 14:00, Chris Travers wrote:
> 
> 
> On Wed, Sep 26, 2018 at 9:54 AM Chris Travers  > wrote:
> 
> 
> 
> On Tue, Sep 25, 2018 at 3:23 PM Tom Lane  > wrote:
> 
> Chris Travers  > writes:
> > However,  what I think one could do is use a struct of volatile
> > sig_atomic_t members and macros for checking/setting.  Simply
> writing a
> > value is safe in C89 and higher.
> 
> Yeah, we could group those flags in a struct, but what's the point?
> 
> 
> This was one of two things I noticed in my previous patch on
> interrupts and loops where I wasn't sure what the best practice in
> our code is.
> 
> If we don't want to make this change, then would there be any
> objection to me writing up a README describing the flags, and best
> practices in terms of checking them in our code based on the current
> places we use them?  If the current approach will be unlikely to
> change in the future, then at least we can document that the way I
> went about things is consistent with current best practices so next
> time someone doesn't really wonder.
> 
> 
> Attaching a first draft of a readme.  Feedback welcome.  I noticed
> further that we used to document signals and what they did with carious
> processes but that this was removed after 7.0, probably due to
> complexity reasons.

diff --git a/src/backend/utils/init/README b/src/backend/utils/init/README
new file mode 100644
index 00..0472687f53
--- /dev/null
+++ b/src/backend/utils/init/README
@@ -0,0 +1,96 @@
+src/backend/utils/misc/README

Not only are the directory names mismatched here, but I wonder whether
this file would be long in either of these directories.

+Implementation Notes on Globals and Signal/Event Handling
+=
+
+The approch to signal handling in PostgreSQL is designed to strictly
conform
+with the C89 standard and designed to run with as few platform
assumptions as
+possible.

We use C99 now, so why would be design this to be conforming to C89?


More generally, I'd like this material to be code comments.  It's the
kind of stuff that gets outdated before long if it's kept separate.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Race condition in create table

2018-10-09 Thread Konstantin Knizhnik
I wonder if it is considered to be expected behavior that concurrent 
execution of "create table if not exists" may return errors:


knizhnik@knizhnik:~/dtm-data$ pgbench -n -c 8 -t 20 -f create_table.sql 
postgres
client 2 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

client 0 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

client 1 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

client 3 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

NOTICE:  relation "t2" already exists, skipping
client 5 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

client 6 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
client 7 aborted in command 0 (SQL) of script 0; ERROR:  duplicate key 
value violates unique constraint "pg_type_typname_nsp_index"

DETAIL:  Key (typname, typnamespace)=(t2, 2200) already exists.

NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping
NOTICE:  relation "t2" already exists, skipping





For partitions behavior is slightly different:

knizhnik@knizhnik:~/dtm-data$ pgbench -n -c 8 -t 20 -f create_part.sql 
postgres

NOTICE:  relation "t_1" already exists, skipping
client 0 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists


NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
client 1 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists


NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
client 3 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists


NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
client 6 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists


NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
client 2 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists


NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
NOTICE:  relation "t_1" already exists, skipping
client 5 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists


client 4 aborted in command 0 (SQL) of script 0; ERROR:  relation "t_1" 
already exists




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



create_table.sql
Description: application/sql


create_part.sql
Description: application/sql


Re: Index Skip Scan

2018-10-09 Thread Dmitry Dolgov
> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule  wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a1 from foo;
> +---+
> |QUERY PLAN   
>   |
> +---+
> | Unique  (cost=0.43..4367.56 rows=9983 width=4)  
>   |
> |   ->  Index Skip Scan using foo_a1_idx on foo  (cost=0.43..4342.60 
> rows=9983 width=4) |
> +---+
> (2 rows)
>
> In this case Unique node is useless and can be removed

Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).

> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.

> 3. Once time patched postgres crashed, but I am not able to reproduce it.

Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?



Re: Index Skip Scan

2018-10-09 Thread Jesper Pedersen

Hi Pavel,

On 10/9/18 9:42 AM, Pavel Stehule wrote:

I tested last patch and I have some notes:

1.

postgres=# explain select distinct a1 from foo;
+---+
|QUERY PLAN 
|
+---+
| Unique  (cost=0.43..4367.56 rows=9983 width=4)
|
|   ->  Index Skip Scan using foo_a1_idx on foo  (cost=0.43..4342.60 
rows=9983 width=4) |
+---+
(2 rows)

In this case Unique node is useless and can be removed

2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

3. Once time patched postgres crashed, but I am not able to reproduce it.



Please, send that query through if you can replicate it. The patch 
currently passes an assert'ed check-world, so your query clearly 
triggered something that isn't covered yet.



Looks like very interesting patch, and important for some BI platforms



Thanks for your review !

Best regards,
 Jesper



Re: Proposal for Signal Detection Refactoring

2018-10-09 Thread Peter Eisentraut
On 25/09/2018 02:23, Andres Freund wrote:
>> My point was just to reduce the number of variables used and ease
>> debugger lookups with what is on the stack.
> I'm not sure a bitflag really gives you that - before gdb gives you the
> plain value, afterwards you need to know the enum values and do bit math
> to know.

You could use an actual C bit field, to make debugging easier.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Index Skip Scan

2018-10-09 Thread Pavel Stehule
Hi

I tested last patch and I have some notes:

1. 

postgres=# explain select distinct a1 from foo;
+---+
|QUERY PLAN 
|
+---+
| Unique  (cost=0.43..4367.56 rows=9983 width=4)
|
|   ->  Index Skip Scan using foo_a1_idx on foo  (cost=0.43..4342.60 
rows=9983 width=4) |
+---+
(2 rows)

In this case Unique node is useless and can be removed

2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

3. Once time patched postgres crashed, but I am not able to reproduce it.

Looks like very interesting patch, and important for some BI platforms

Re: Function for listing archive_status directory

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 10:12:17AM +0200, 'Christoph Moench-Tegeder' wrote:
> Attached is the updated patch. I made sure the function's OID hadn't been
> taken otherwise, and it compiles and works in a quick check.

Committed, after some slight adjustments.  Files in
pg_wal/archive_status/ have normally a size of 0 so this field does not
usually matter, but removing it complicates pg_ls_dir_files for no real
gain so I kept it.
--
Michael


signature.asc
Description: PGP signature


Re: Some incorrect comments and out-dated README from run-time pruning

2018-10-09 Thread Peter Eisentraut
On 27/09/2018 23:20, David Rowley wrote:
> I've noticed that the comments above the PartitionedRelPruneInfo
> struct incorrectly document how subplan_map and subpart_map are
> indexed. This seems to have snuck in on 4e232364033.

- * subplan_map[] and subpart_map[] are indexed by partition index (where
- * zero is the topmost partition, and non-leaf partitions must come before
- * their children).  For a leaf partition p, subplan_map[p] contains the
+ * subplan_map[] and subpart_map[] are indexed by partition index (as
defined
+ * in the PartitionDesc).  For a leaf partition p, subplan_map[p]
contains the

I don't see what someone reading this comment would do with "as defined
in the PartitionDesc".  I don't see any PartitionDesc referenced or
mentioned at or near that struct.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Support custom socket directory in pg_upgrade

2018-10-09 Thread Daniel Gustafsson
Having hit the maximum socketdir length error a number of times in pg_upgrade,
especially when running tests in a deep directory hierarchy, I figured it was
time to see if anyone else has had the same problem?  The attached patch is
what I run with locally to avoid the issue, it adds a --socketdir=PATH option
to pg_upgrade which overrides the default use of CWD.  Is that something that
could be considered?

cheers ./daniel



pg_upgrade_sockdir.patch
Description: Binary data


Re: Refactor textToQualifiedNameList()

2018-10-09 Thread Alvaro Herrera
On 2018-Oct-09, Michael Paquier wrote:

> On Tue, Oct 09, 2018 at 10:47:48AM +0200, Pavel Stehule wrote:
> > The difference on 10M calls is about 300ms - it is about 6%.
> 
> This number gives a good argument for rejecting this patch.  I am not
> usually against code beautification, but that's a high price to pay for
> just some refactoring.  On top of that, this potentially breaks
> extension compilation.

One thing I do like about this patch is that it takes
stringToQualifiedNameList out of regproc.c, where it was put as static
by commit 52200befd04b ("Implement types regprocedure, regoper,
regoperator, regclass, regtype" April 2002); made extern by ba790a5608ea
("Here is a patch for Composite and Set returning function support."
June 2002) in what appears to have been a careless change.  The function
would end up in a place where it more reasonably belongs into,
varlena.c, next to its sibling textToQualifiedNameList.

The committer of such a change will get a lot of flak for changing the
#include requirements for code that calls that function, though.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: NOTIFY and pg_notify performance when deduplicating notifications

2018-10-09 Thread julien
> Indeed, I have the same and am very interested in this.
> 
> > I hope this patch can be reviewed and included in PostgreSQL.
> 
> I added this to the next Commitfest and added myself as a reviewer.
> Will try to a review beginning of next week.
> https://commitfest.postgresql.org/20/1820/

Thank you for reviewing. 

I just caught an error in my patch, it's fixed in the attachment. The 'never' 
and 'maybe' collapse modes were mixed up in one location.

I can't find a reasonable way to build a regression test that checks whether 
notifications are effectively deduplicated. The output of the LISTEN command 
lists the PID of the notifying backend for each notification, e.g. : 
'Asynchronous notification "foobar" received from server process with PID 
24917'. I can't just add this to async.out. I did test manually for all eight 
combinations : four collapse mode values (missing, empty string, 'maybe' and 
'never'), both with NOTIFY and pg_notify().


postgres-notify-all-v7.patch
Description: Binary data


Re: partition tree inspection functions

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 08:17:20PM +0900, Michael Paquier wrote:
> isleaf is of course wrong if the input is a partitioned index, so more
> regression tests to cover those cases would be nice.

Also, the call to find_all_inheritors needs AccessShareLock...  NoLock
is not secure.
--
Michael


signature.asc
Description: PGP signature


Re: Refactor textToQualifiedNameList()

2018-10-09 Thread Pavel Stehule
út 9. 10. 2018 v 13:20 odesílatel Michael Paquier 
napsal:

> On Tue, Oct 09, 2018 at 10:47:48AM +0200, Pavel Stehule wrote:
> > The difference on 10M calls is about 300ms - it is about 6%.
>
> This number gives a good argument for rejecting this patch.  I am not
> usually against code beautification, but that's a high price to pay for
> just some refactoring.  On top of that, this potentially breaks
> extension compilation.
>

ok

Regards

Pavel

--
> Michael
>


Re: Refactor textToQualifiedNameList()

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 10:47:48AM +0200, Pavel Stehule wrote:
> The difference on 10M calls is about 300ms - it is about 6%.

This number gives a good argument for rejecting this patch.  I am not
usually against code beautification, but that's a high price to pay for
just some refactoring.  On top of that, this potentially breaks
extension compilation.
--
Michael


signature.asc
Description: PGP signature


Re: partition tree inspection functions

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 07:20:40PM +0900, Amit Langote wrote:
> Sorry if I'm misunderstanding something, but why would we need a new
> clone?  If we don't change pg_partition_tree() to only accept tables
> (regular/partitioned/foreign tables) as input, then the same code can work
> for indexes as well.  As long as we store partition relationship in
> pg_inherits, same code should work for both.

Okay, I see what you are coming at.  However, please note that even if I
can see the dependencies in pg_inherits for indexes, the patch still
needs some work I am afraid if your intention is to do so:  
CREATE TABLE ptif_test (a int, b int) PARTITION BY range (a);
create index ptif_test_i on ptif_test (a);
CREATE TABLE ptif_test0 PARTITION OF ptif_test
  FOR VALUES FROM (minvalue) TO (0) PARTITION BY list (b);

=# select relid::regclass, parentrelid::regclass from
 pg_partition_tree('ptif_test'::regclass);
   relid| parentrelid
+-
 ptif_test  | null
 ptif_test0 | ptif_test
(2 rows)
=# select relid::regclass, parentrelid::regclass from
 pg_partition_tree('ptif_test_i'::regclass);
relid| parentrelid
-+-
 ptif_test_i | null
(1 row)

In this case I would have expected ptif_test0_a_idx to be listed, with
ptif_test_i as a parent.

isleaf is of course wrong if the input is a partitioned index, so more
regression tests to cover those cases would be nice.
--
Michael


signature.asc
Description: PGP signature


Re: partition tree inspection functions

2018-10-09 Thread Amit Langote
On 2018/10/09 19:05, Michael Paquier wrote:
> On Tue, Oct 09, 2018 at 06:41:59PM +0900, Amit Langote wrote:
>> Partitioned indexes, just like partitioned tables, don't have their own
>> storage, so pg_relation_size() cannot be used to obtain their size.  We
>> decided that the correct way to get the partitioned table's size is *not*
>> to modify pg_relation_size itself to internally find all partitions and
>> sum their sizes, but to provide a function that returns partitions and
>> tell users to write SQL queries around it to calculate the total size.
>> I'm saying that the new function should be able to be used with
>> partitioned indexes too.
> 
> I have to admit that I find the concept non-intuitive.  A partition tree
> is a set of partitions based on a root called a partitioned table.  A
> partitioned index is not a partition, it is a specific object which is
> associated to a partitioned table without any physical on-disk presence.
> An index is not a partition as well, it is an object associated to one
> relation.

I see it as follows: a partitioned index *does* have partitions and the
partitions in that case are index objects, whereas, a partitioned table's
partitions are table objects.  IOW, I see the word "partition" as
describing a relationship, not any specific object or kind of objects.

> I am not saying that there is no merit in that.  I honestly don't know,
> but being able to list all the partitioned tables and their partition
> tables looks enough to cover all the grounds discussed, and there is no
> need to work out more details for a new clone of find_all_inheritors and
> get_partition_ancestors.

Sorry if I'm misunderstanding something, but why would we need a new
clone?  If we don't change pg_partition_tree() to only accept tables
(regular/partitioned/foreign tables) as input, then the same code can work
for indexes as well.  As long as we store partition relationship in
pg_inherits, same code should work for both.

Thanks,
Amit




Re: pgbench exit code

2018-10-09 Thread Peter Eisentraut
On 10/08/2018 01:41, Fabien COELHO wrote:
> Your approach of not changing the output too much but changing the exit 
> status and adding a warning may get through more easily.
> 
> Note that there is some logic in distinguishing between different type of 
> errors (before the bench start vs the bench has started), so I'd suggest 
> that the exit status should be different.

done

> The abort is by a client, but the code seems to only check the first 
> client of a thread. ISTM that if the second or later client abort it may 
> not be detected? Probably an intermediate aggregation at the thread level 
> is needed, or maybe a global variable, or as errors are counted somewhere, 
> it may be enough just to check that the count is non zero?

fixed

>   -   [ $status ? qr{^$} : qr{processed: 0/1} ],
>   +   [],
> 
> The patch removes a check that there was an output and that no 
> transactions where processed. ISTM it should be kept. If a different exit 
> status is chosen on abort, that would allow to keep it easily.

fixed

> Probably there should be some documentation changes as well?

done

New patched attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1cab742954a625134fa1b32d5e008c32d7a9e7ee Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 9 Oct 2018 12:04:04 +0200
Subject: [PATCH v2] pgbench: Report errors during run better

When an error occurs during a benchmark run, exit with a nonzero exit
code and write a message at the end.  Previously, it would just print
the error message when it happened but then proceed to print the run
summary normally and exit with status 0.  To still allow
distinguishing setup from run-time errors, we use exit status 2 for
the new state, whereas existing errors during pgbench initialization
use exit status 1.
---
 doc/src/sgml/ref/pgbench.sgml| 11 
 src/bin/pgbench/pgbench.c| 11 +++-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 57 ++--
 3 files changed, 50 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 8c464c9d42..03d92ff451 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -804,6 +804,17 @@ Common Options
  
  
 
+ 
+  Exit Status
+
+  
+   A successful run with exit with status 0.  Exit status 1 indicates problems
+   such as invalid command-line options.  Errors during the run such as
+   database errors or problems in the script result in exit status 2.  In the
+   latter case, pgbench will print partial results.
+  
+ 
+
  
   Notes
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 436764b9c9..81bc6d8a6e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4931,6 +4931,8 @@ main(int argc, char **argv)
PGresult   *res;
char   *env;
 
+   int exit_code = 0;
+
progname = get_progname(argv[0]);
 
if (argc > 1)
@@ -5675,6 +5677,10 @@ main(int argc, char **argv)
(void) threadRun(thread);
 #endif /* ENABLE_THREAD_SAFETY 
*/
 
+   for (int j = 0; j < thread->nstate; j++)
+   if (thread->state[j].state == CSTATE_ABORTED)
+   exit_code = 2;
+
/* aggregate thread level stats */
mergeSimpleStats(, >stats.latency);
mergeSimpleStats(, >stats.lag);
@@ -5699,7 +5705,10 @@ main(int argc, char **argv)
INSTR_TIME_SUBTRACT(total_time, start_time);
printResults(threads, , total_time, conn_total_time, 
latency_late);
 
-   return 0;
+   if (exit_code != 0)
+   fprintf(stderr, "Run was aborted; the above results are 
incomplete.\n");
+
+   return exit_code;
 }
 
 static void *
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl 
b/src/bin/pgbench/t/001_pgbench_with_server.pl
index d972903f2a..0081989026 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -537,7 +537,7 @@ sub pgbench
# SQL
[
'sql syntax error',
-   0,
+   2,
[
qr{ERROR:  syntax error},
qr{prepared statement .* does not exist}
@@ -556,11 +556,11 @@ sub pgbench
 
# SHELL
[
-   'shell bad command',0,
+   'shell bad command',2,
[qr{\(shell\) .* meta-command failed}], q{\shell 
no-such-command}
],
[
-   'shell undefined variable', 0,
+   'shell undefined variable', 2,
[qr{undefined variable ":nosuchvariable"}],
q{-- undefined variable in shell
 \shell echo ::foo :nosuchvariable
@@ -600,75 

Re: partition tree inspection functions

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 06:41:59PM +0900, Amit Langote wrote:
> Partitioned indexes, just like partitioned tables, don't have their own
> storage, so pg_relation_size() cannot be used to obtain their size.  We
> decided that the correct way to get the partitioned table's size is *not*
> to modify pg_relation_size itself to internally find all partitions and
> sum their sizes, but to provide a function that returns partitions and
> tell users to write SQL queries around it to calculate the total size.
> I'm saying that the new function should be able to be used with
> partitioned indexes too.

I have to admit that I find the concept non-intuitive.  A partition tree
is a set of partitions based on a root called a partitioned table.  A
partitioned index is not a partition, it is a specific object which is
associated to a partitioned table without any physical on-disk presence.
An index is not a partition as well, it is an object associated to one
relation.

I am not saying that there is no merit in that.  I honestly don't know,
but being able to list all the partitioned tables and their partition
tables looks enough to cover all the grounds discussed, and there is no
need to work out more details for a new clone of find_all_inheritors and
get_partition_ancestors.
--
Michael


signature.asc
Description: PGP signature


Re: partition tree inspection functions

2018-10-09 Thread Amit Langote
On 2018/10/09 18:10, Michael Paquier wrote:
> On Tue, Oct 09, 2018 at 05:11:59PM +0900, Amit Langote wrote:
>> Hmm, how would one find the size of a partitioned index tree if we don't
>> allow indexes to be passed?
> 
> pg_total_relation_size() and pg_indexes_size() are designed for that
> purpose.  Anyway, the elements of a partition tree are things which can
> be directly attached to it, like a table or a foreign table, and not
> objects which are directly dependent on a given table, like indexes or
> toast tables.  So we have to add some filtering on the relkind.
> Indexes, partitioned indexes and sequences are three things which cannot
> be added as partitions.  But I think that we had better just make sure
> that the filtering allows RELKIND_RELATION, RELKIND_PARTITIONED_TABLE
> and RELKIND_FOREIGN_TABLE relations only.

I think the way I phrased my question may have been a bit confusing.  I
was pointing out that just like tables, indexes can now also form their
own partition tree.

Partitioned indexes, just like partitioned tables, don't have their own
storage, so pg_relation_size() cannot be used to obtain their size.  We
decided that the correct way to get the partitioned table's size is *not*
to modify pg_relation_size itself to internally find all partitions and
sum their sizes, but to provide a function that returns partitions and
tell users to write SQL queries around it to calculate the total size.
I'm saying that the new function should be able to be used with
partitioned indexes too.

Thanks,
Amit




[ECPG] fix functions in dt_common to correctly detect integer overflow

2018-10-09 Thread Yang Xiao
Hi,

The attachment is the proposal patch for dt_common.c.

replace strtol to strtoint, as strtol() will return 64-bit output in 
that case, while type of var, hr, tm_min
  is int.

Young

From fa744e6b2b2601a91080a016f480d443cd1dbf2d Mon Sep 17 00:00:00 2001
From: Young_X 
Date: Tue, 9 Oct 2018 11:27:51 +0800
Subject: [PATCH] [ECPG] fix functions in dt_common to correctly detect
 integer overflow when running on a 64-bit platform ... strtol() will
 happily return 64-bit output in that case, while type of var, hr, tm_min
 is int.

Signed-off-by: Young_X 
---
 src/interfaces/ecpg/pgtypeslib/dt_common.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/src/interfaces/ecpg/pgtypeslib/dt_common.c 
b/src/interfaces/ecpg/pgtypeslib/dt_common.c
index 568b172..75e6e6b 100644
--- a/src/interfaces/ecpg/pgtypeslib/dt_common.c
+++ b/src/interfaces/ecpg/pgtypeslib/dt_common.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 
+#include "common/string.h"
 #include "extern.h"
 #include "dt.h"
 #include "pgtypes_timestamp.h"
@@ -1115,7 +1116,7 @@ DecodeNumberField(int len, char *str, int fmask,
for (i = 0; i < 6; i++)
fstr[i] = *cp != '\0' ? *cp++ : '0';
fstr[i] = '\0';
-   *fsec = strtol(fstr, NULL, 10);
+   *fsec = strtoint(fstr, NULL, 10);
*cp = '\0';
len = strlen(str);
}
@@ -1206,7 +1207,7 @@ DecodeNumber(int flen, char *str, int fmask,
 
*tmask = 0;
 
-   val = strtol(str, , 10);
+   val = strtoint(str, , 10);
if (cp == str)
return -1;
 
@@ -1442,11 +1443,11 @@ DecodeTime(char *str, int *tmask, struct tm *tm, fsec_t 
*fsec)
 
*tmask = DTK_TIME_M;
 
-   tm->tm_hour = strtol(str, , 10);
+   tm->tm_hour = strtoint(str, , 10);
if (*cp != ':')
return -1;
str = cp + 1;
-   tm->tm_min = strtol(str, , 10);
+   tm->tm_min = strtoint(str, , 10);
if (*cp == '\0')
{
tm->tm_sec = 0;
@@ -1457,7 +1458,7 @@ DecodeTime(char *str, int *tmask, struct tm *tm, fsec_t 
*fsec)
else
{
str = cp + 1;
-   tm->tm_sec = strtol(str, , 10);
+   tm->tm_sec = strtoint(str, , 10);
if (*cp == '\0')
*fsec = 0;
else if (*cp == '.')
@@ -1478,7 +1479,7 @@ DecodeTime(char *str, int *tmask, struct tm *tm, fsec_t 
*fsec)
for (i = 0; i < 6; i++)
fstr[i] = *cp != '\0' ? *cp++ : '0';
fstr[i] = '\0';
-   *fsec = strtol(fstr, , 10);
+   *fsec = strtoint(fstr, , 10);
if (*cp != '\0')
return -1;
}
@@ -1510,20 +1511,20 @@ DecodeTimezone(char *str, int *tzp)
int len;
 
/* assume leading character is "+" or "-" */
-   hr = strtol(str + 1, , 10);
+   hr = strtoint(str + 1, , 10);
 
/* explicit delimiter? */
if (*cp == ':')
-   min = strtol(cp + 1, , 10);
+   min = strtoint(cp + 1, , 10);
/* otherwise, might have run things together... */
else if (*cp == '\0' && (len = strlen(str)) > 3)
{
-   min = strtol(str + len - 2, , 10);
+   min = strtoint(str + len - 2, , 10);
if (min < 0 || min >= 60)
return -1;
 
*(str + len - 2) = '\0';
-   hr = strtol(str + 1, , 10);
+   hr = strtoint(str + 1, , 10);
if (hr < 0 || hr > 13)
return -1;
}
@@ -1830,7 +1831,7 @@ DecodeDateTime(char **field, int *ftype, int nf,
if (tzp == NULL)
return -1;
 
-   val = strtol(field[i], , 10);
+   val = strtoint(field[i], , 10);
if (*cp != '-')
return -1;
 
@@ -1965,7 +1966,7 @@ DecodeDateTime(char **field, int *ftype, int nf,
char   *cp;
int val;
 
-   val = strtol(field[i], , 10);
+   val = strtoint(field[i], , 10);
 
/*
 * only a few kinds are allowed to have 
an embedded
-- 
2.7.4



Add overflow test in function numeric_exp.

2018-10-09 Thread Yang Xiao
Hi,

The attachment is the proposal patch for function numeric_exp in 
src/backend/utils/adt/numeric.c.


Young
From 0456192bbe03428247b9f55b261b24b4b890c680 Mon Sep 17 00:00:00 2001
From: Young_X 
Date: Tue, 9 Oct 2018 10:59:22 +0800
Subject: [PATCH] Add overflow test in function numeric_exp. (see
 commit 18a02ad2a506e4425c6dd2ea235039cd5659467d)

Signed-off-by: Young_X 
---
 src/backend/utils/adt/numeric.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 444e575..d8c5f54 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -2844,15 +2844,17 @@ numeric_exp(PG_FUNCTION_ARGS)
/* convert input to float8, ignoring overflow */
val = numericvar_to_double_no_overflow();
 
+   /* initial overflow test with fuzz factor */
+   if (Abs(val) > NUMERIC_MAX_RESULT_SCALE * 3.01)
+   ereport(ERROR,
+   (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+   errmsg("value overflows numeric format")));
+
/*
 * log10(result) = num * log10(e), so this is approximately the decimal
 * weight of the result:
 */
-   val *= 0.434294481903252;
-
-   /* limit to something that won't cause integer overflow */
-   val = Max(val, -NUMERIC_MAX_RESULT_SCALE);
-   val = Min(val, NUMERIC_MAX_RESULT_SCALE);
+   val *= 0.434294481903252;   /* approximate decimal result weight */
 
rscale = NUMERIC_MIN_SIG_DIGITS - (int) val;
rscale = Max(rscale, arg.dscale);
-- 
2.7.4



Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2018-10-09 Thread Kyotaro HORIGUCHI
At Fri, 5 Oct 2018 15:35:04 +0900, Michael Paquier  wrote 
in <20181005063504.gb14...@paquier.xyz>
> On Fri, Oct 05, 2018 at 12:16:03PM +0900, Michael Paquier wrote:
> > So, I have come back to this stuff, and finished with the attached
> > instead, so as the assertion is in a single place.  I find that
> > clearer.  The comments have also been improved.  Thoughts?
> 
> And so...  I have been looking at committing this thing, and while
> testing in-depth I have been able to trigger a case where an autovacuum
> has been able to be not aggressive but anti-wraparound, which is exactly
> what should not be possible, no?  I have simply created an instance with
> autovacuum_freeze_max_age = 20, then ran pgbench with
> autovacuum_freeze_table_age=20 set for each table, and also ran
> installcheck-world in parallel.  This has been able to trigger the
> assertion pretty quickly.

I investigated it and in short, it can happen.

It is a kind of race consdition between two autovacuum
processes. do_autovacuum() looks into pg_class (using a snapshot)
and vacuum_set_xid_limits() looks into relcache. If concurrent
vacuum happens and one has finished the relation, another gets
relcache invalidation and relfrozenxid is updated. If this
happens between do_autovacuum() and vacuum_set_xid_limits(), the
latter sees newer relfrozenxid than the former. The problem
happens when it moves by more than 5% of
autovacuum_freeze_max_age.

If lazy_vacuum_rel() sees the situation, the relation is already
aggressively vacuumed by a cocurrent worker. We can just ingore
the state safely but also we know that the vacuum is useless.

1. Just allow the case there (and add comment).
   Causes redundant anti-wraparound vacuum.

2. Skip the relation by the condition.

   I think that we can safely skip the relation in the
   case. (attached)

3. Ensure that do_autovacuum always sees the same relfrozenxid
   with vacuum_set_xid_limits().

4. Prevent concurrent acuuming of the same relation rigorously,
  somehow.

Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..436d573454 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -249,6 +249,21 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
+	/*
+	 * When is_wraparound, relfrozenxid is old enough and aggressive must be
+	 * true here. However, if another concurrent vacuum process has processed
+	 * this relation, relfrozenxid in relcache can be moved forward enough so
+	 * that aggressive is calculated as false here. This relation no longer
+	 * needs to be vacuumed. Bail out.
+	 */
+	if (params->is_wraparound && !aggressive)
+	{
+		elog(DEBUG1, "relation %d has been vacuumd ocncurrently, skip",
+			onerel->rd_id);
+		return;
+	}
+
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;


Re: partition tree inspection functions

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 05:11:59PM +0900, Amit Langote wrote:
> Hmm, how would one find the size of a partitioned index tree if we don't
> allow indexes to be passed?

pg_total_relation_size() and pg_indexes_size() are designed for that
purpose.  Anyway, the elements of a partition tree are things which can
be directly attached to it, like a table or a foreign table, and not
objects which are directly dependent on a given table, like indexes or
toast tables.  So we have to add some filtering on the relkind.
Indexes, partitioned indexes and sequences are three things which cannot
be added as partitions.  But I think that we had better just make sure
that the filtering allows RELKIND_RELATION, RELKIND_PARTITIONED_TABLE
and RELKIND_FOREIGN_TABLE relations only.

> Maybe, we should apply same rules as are used when describing a table or
> when getting its metadata via other means (pg_relation_size, etc.)?
> Afaics, there are no checks performed in that case:
>
> Should we do anything different in this case?

Yeah, perhaps we could live without any restriction.  There would be
still time to tweak that behavior in the v12 development cycle if there
are any complaints.
--
Michael


signature.asc
Description: PGP signature


Re: partition tree inspection functions

2018-10-09 Thread Amit Langote
On 2018/10/06 15:26, Michael Paquier wrote:
> On Fri, Oct 05, 2018 at 08:22:49AM -0400, Jesper Pedersen wrote:
>> Looks good.
> 
> Actually, after sleeping on it, there could be potentially two problems:
> 1) We don't check the relkind of the relation.  For example it is
> possible to get a tree from an index, which is incorrect.  I would
> suggest to restrain the root relation to be either a relation, a
> partitioned table or a foreign table.

Hmm, how would one find the size of a partitioned index tree if we don't
allow indexes to be passed?

> 2) What about the users who can have a look at a tree?  Shouldn't we
> tighten that a bit more?  I am sure it is not fine to allow any user to
> look at what a partition tree looks like, hence only the owner of the
> root should be able to look at its tree, no?

Maybe, we should apply same rules as are used when describing a table or
when getting its metadata via other means (pg_relation_size, etc.)?
Afaics, there are no checks performed in that case:

create table foo (a int primary key);
create user mt;
revoke all on foo from mt;
set session authorization mt;
\d foo
Table "public.foo"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Indexes:
"foo_pkey" PRIMARY KEY, btree (a)

select pg_relation_size('foo');
 pg_relation_size
──
0
(1 row)

Should we do anything different in this case?

Thanks,
Amit




Re: Function for listing archive_status directory

2018-10-09 Thread 'Christoph Moench-Tegeder'
## Michael Paquier (mich...@paquier.xyz):

> Thanks Iwata-san.  I was just trying to apply the patch but it failed so
> the new status is fine.  On top of taking care of the rebase, please
> make sure of the following:

OK, that was an easy one.

> - Calling pg_ls_dir_files() with missing_ok set to true.

Done.

> - Renaming pg_ls_archive_status to pg_ls_archive_statusdir.

Done.

> +last modified time (mtime) of each file in the write ahead log (WAL)
> +archive_status directory. By default only superusers
> Here I would mention pg_wal/archive_status.

Done.

Attached is the updated patch. I made sure the function's OID hadn't been
taken otherwise, and it compiles and works in a quick check.

Regards,
Christoph

-- 
Spare Space.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f984d069e1..3573896120 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20357,6 +20357,18 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
   
   

+pg_ls_archive_statusdir()
+   
+   setof record
+   
+List the name, size, and last modification time of files in the WAL
+archive status directory. Access is granted to members of the
+pg_monitor role and may be granted to other
+non-superuser roles.
+   
+  
+  
+   
 pg_ls_tmpdir(tablespace oid)

setof record
@@ -20443,6 +20455,18 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());

 

+pg_ls_archive_statusdir
+   
+   
+pg_ls_archive_statusdir returns the name, size, and
+last modified time (mtime) of each file in the write ahead log (WAL)
+pg_wal/archive_status directory. By default only
+superusers and members of the pg_monitor role can
+use this function. Access may be granted to others using
+GRANT.
+   
+
+   
 pg_ls_tmpdir


diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 020f28cbf6..0c1bcebb0d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1150,6 +1150,7 @@ REVOKE EXECUTE ON FUNCTION lo_export(oid, text) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_ls_logdir() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_waldir() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_ls_archive_statusdir() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_tmpdir() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_tmpdir(oid) FROM public;
 
@@ -1172,6 +1173,7 @@ REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 --
 GRANT EXECUTE ON FUNCTION pg_ls_logdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
 
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 85bea8d502..a3d60ae31f 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -658,3 +658,10 @@ pg_ls_tmpdir_1arg(PG_FUNCTION_ARGS)
 {
 	return pg_ls_tmpdir(fcinfo, PG_GETARG_OID(0));
 }
+
+/* Function to return the list of files in the WAL archive_status directory */
+Datum
+pg_ls_archive_statusdir(PG_FUNCTION_ARGS)
+{
+	return pg_ls_dir_files(fcinfo, XLOGDIR "/archive_status", true);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 963ff6848a..28240d6331 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10200,6 +10200,11 @@
   provolatile => 'v', prorettype => 'record', proargtypes => '',
   proallargtypes => '{text,int8,timestamptz}', proargmodes => '{o,o,o}',
   proargnames => '{name,size,modification}', prosrc => 'pg_ls_waldir' },
+{ oid => '3996', descr => 'list of files in the archive_status directory',
+  proname => 'pg_ls_archive_statusdir', procost => '10', prorows => '20', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int8,timestamptz}', proargmodes => '{o,o,o}',
+  proargnames => '{name,size,modification}', prosrc => 'pg_ls_archive_statusdir' },
 { oid => '5029', descr => 'list files in the pgsql_tmp directory',
   proname => 'pg_ls_tmpdir', procost => '10', prorows => '20', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',


Re: executor relation handling

2018-10-09 Thread Amit Langote
On 2018/10/08 8:18, Tom Lane wrote:
> I wrote:
>> Still need to think a bit more about whether we want 0005 in
>> anything like its current form.
> 
> So I poked at that for a bit, and soon realized that the *main* problem
> there is that ExecFindRowMark() eats O(N^2) time due to repeated searches
> of the es_rowMarks list.

Yeah, I proposed the patch only because of stumbling across O(N^2)
behavior, but went to solve it with the planner hack, which I agree is an
inferior solution overall.

> While the patch as stated would improve that
> for cases where most of the partitions can be pruned at plan time, it
> does nothing much for cases where they can't.  However, it's pretty
> trivial to fix that: let's just use an array not a list.  Patch 0001
> attached does that.
>
> A further improvement we could consider is to avoid opening the relcache
> entries for pruned-away relations.  I could not find a way to do that
> that was less invasive than removing ExecRowMark.relation and requiring
> callers to call a new function to get the relation if they need it.
> Patch 0002 attached is a delta on top of 0001 that does that.
>
> Replicating your select-lt-for-share test case as best I can (you never
> actually specified it carefully), I find that the TPS rate on my
> workstation goes from about 250 tps with HEAD to 920 tps with patch 0001
> or 1130 tps with patch 0002.  This compares to about 1600 tps for the
> non-FOR-SHARE version of the query.
>
> However, we should keep in mind that without partitioning overhead
> (ie "select * from lt_999 where b = 999 for share"), the TPS rate
> is over 25800 tps.  Most of the overhead in the partitioned case seems
> to be from acquiring locks on rangetable entries that we won't ever
> use, and none of these patch variants are touching that problem.
> So ISTM that the *real* win for this scenario is going to come from
> teaching the system to prune unwanted relations from the query
> altogether, not just from the PlanRowMark list.

Agreed, which is why I mentioned the other patch [1], which gets us closer
to that goal.

> Keeping that comparison in mind, I'm inclined to think that 0001
> is the best thing to do for now.  The incremental win from 0002
> is not big enough to justify the API break it creates, while your
> 0005 is not really attacking the problem the right way.

I agree that 0002's improvement is only incremental and would lose its
appeal if we're able to solve the bigger problem of removing range table
and other overhead when planning with large number of partitions, but that
might take a while.

Thanks,
Amit

[1] https://commitfest.postgresql.org/20/1778/




Re: out-of-order XID insertion in KnownAssignedXids

2018-10-09 Thread Michael Paquier
On Tue, Oct 09, 2018 at 02:59:00PM +0900, Michael Paquier wrote:
> +1.  I am looking at how to make that possible.

And so...  Going through a bit of investigation the problem is that each
2PC transaction preparing finishes by putting the procarray in a state
where there are two entries referring to the same transaction running as
MarkAsPrepared adds an extra entry on the top of the one already
existing.  This is expected though per the comments in
ProcArrayClearTransaction(), as it assumes that MyProc can be safely
cleaned and it assumes that there is a duplicated entry.  Then,
GetRunningTransactionData() ignores the assumption that those duplicate
entries can exist, which makes the snapshot data taken as incorrect,
generating those incorrect XLOG_RUNNING_XACT records.

The most simple fix I can think of first is to tweak the origin of the
problem in GetRunningTransactionData() so as those duplicate XIDs are
ignored if found as there has to be a state between the time
MarkAsPrepared() inserted the 2PC entry in the procarray and the time
ProcArrayClearTransaction() gets called.
--
Michael


signature.asc
Description: PGP signature


Re: executor relation handling

2018-10-09 Thread Amit Langote
On 2018/10/08 9:29, David Rowley wrote:
> On 8 October 2018 at 13:13, Tom Lane  wrote:
>> The idea I had in mind was to allow hard pruning of any leaf that's
>> been excluded *at plan time* based on partition constraints seen in
>> its parent rel(s).  That should be safe enough as long as we take
>> locks on all the non-leaf rels --- then we'd know about any change
>> in the constraint situation.
>>
>> Rather than coping with renumbering the RTEs, it might be easiest
>> to invent an "RTE_DUMMY" RTE type that a hard-pruned RTE could be
>> changed to.
> 
> The problem with that is that, if we get [1] done in PG12, then the
> RTE_DUMMYs would not exist, as we'd only have RTEs in the range table
> for partitions that survived plan-time pruning.
...

> [1] https://commitfest.postgresql.org/20/1778/

Yeah, the patch proposed at [1] postpones creating partition RTEs (hence
locking them) to a point after pruning, which also means we create only
the necessary RTEs.  In fact, it's not just the RTEs, but child
PlanRowMarks, whose creation is postponed to after pruning.  So, I
admitted upthread that my proposed patch here would only add code that
will become useless if we're able to get [1] in.

Thanks,
Amit




Re: executor relation handling

2018-10-09 Thread Amit Langote
On 2018/10/09 0:38, Tom Lane wrote:
> I wrote:
>> Keeping that comparison in mind, I'm inclined to think that 0001
>> is the best thing to do for now.  The incremental win from 0002
>> is not big enough to justify the API break it creates, while your
>> 0005 is not really attacking the problem the right way.
> 
> I've pushed 0001 now.  I believe that closes out all the patches
> discussed in this thread, so I've marked the CF entry committed.
> Thanks for all the hard work!

Thanks a lot for reviewing and committing.

Regards,
Amit




Re: merge semi join cost calculation error

2018-10-09 Thread Pavel Stehule
po 8. 10. 2018 v 17:00 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > The user sent a plan:
>
> > QUERY PLAN
> > Merge Semi Join  (cost=82.97..580.24 rows=580 width=8) (actual
> > time=0.503..9557.396 rows=721 loops=1)
> >   Merge Cond: (tips.users_id = follows.users_id_to)
> >   ->  Index Scan using tips_idx_users_id01 on tips
> (cost=0.43..8378397.19
> > rows=2491358 width=16) (actual time=0.009..9231.585 rows=2353914 loops=1)
> >   ->  Sort  (cost=1.77..1.82 rows=22 width=8) (actual time=0.052..0.089
> > rows=28 loops=1)
> > Sort Key: follows.users_id_to
> > Sort Method: quicksort  Memory: 26kB
> > ->  Seq Scan on follows  (cost=0.00..1.27 rows=22 width=8)
> (actual
> > time=0.013..0.020 rows=28 loops=1)
> >   Filter: (users_id_from = 1)
>
> > He has PostgreSQL 10.5. I cannot to understand to too low total cost of
> Merge
> > Semi Join because subnode has very high cost 8378397.
>
> The planner seems to be supposing that the merge will stop far short of
> scanning the entire LHS table, presumably as a result of thinking that
> the maximum value of follows.users_id_to is much less than the maximum
> value of tips.users_id.  Given the actual rowcounts, that's seemingly
> not true, which suggests out-of-date stats for one table or the other.
>

good tip - the table follows was too small for autovacuum, and it was
terrible effect. I didn't know about this optimization.

Thank you

Pavel

>
> regards, tom lane
>


Re: executor relation handling

2018-10-09 Thread Amit Langote
On 2018/10/08 3:55, Tom Lane wrote:
> I didn't like the idea of unifying ModifyTable.nominalRelation with
> the partition root info.  Those fields serve different masters ---
> nominalRelation, at least in its original intent, is only meant for
> use of EXPLAIN and might have nothing to do with what happens at
> execution.  So even though unifying them would work today, we might
> regret it down the line.  Instead I left that field alone and added
> a separate rootRelation field to carry the partition root RT index,
> which ends up being the same number of fields anyway since we don't
> need a flag for is-the-nominal-relation-a-partition-root.

Thanks for pushing that.  I'd also named it 'rootRelation' in my original
patch before David had objected to calling it that, because a command may
not specify the "actual" root of a partition tree; it could be a non-root
partitioned table.  He'd suggested 'partitionedTarget' for the new field
[1], stressing the "target" part.  Maybe, 'rootRelation' isn't too
confusing though.

Thanks,a
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f_M0jkgL-d%3Dk-rf6TMzghATDmZ67nzja1tz4h3G%3D27e7Q%40mail.gmail.com




Re: executor relation handling

2018-10-09 Thread Amit Langote
On 2018/10/07 3:59, Tom Lane wrote:
> Amit Langote  writes:
>> On 2018/10/05 5:59, Tom Lane wrote:
>>> So I'm inclined to just omit 0003.  AFAICS this would only mean that
>>> we couldn't drop the global PlanRowMarks list from PlannedStmt, which
>>> does not bother me much.
> 
>> To be honest, I too had begun to fail to see the point of this patch since
>> yesterday.  In fact, getting this one to pass make check-world took a bit
>> of head-scratching due to its interaction with EvalPlanQuals EState
>> building, so I was almost to drop it from the series.  But I felt that it
>> might still be a good idea to get rid of what was described as special
>> case code.  Reading your argument against it based on how it messes up
>> some of the assumptions regarding ExecRowMark design, I'm willing to let
>> go of this one as more or less a cosmetic improvement.
> 
> OK.  We do need to fix the comments in InitPlan that talk about acquiring
> locks and how that makes us need to do things in a particular order, but
> I'll go take care of that.

Thanks for doing that.

>> So, that leaves us with only the patch that revises the plan node fields
>> in light of having relieved the executor of doing any locking of its own
>> in *some* cases.  That patch's premise is that we don't need the fields
>> that the patch removes because they're only referenced for the locking
>> purposes.  But, if those plan nodes support parallel execution and hence
>> will be passed to parallel workers for execution who will need to lock the
>> tables contained in the plan nodes, then they better contain all the
>> information needed for locking *every* affected tables, so we had better
>> not removed it.
> 
> No, I think this is unduly pessimistic.  We need to make sure that a
> parallel worker has lock on tables it's actually touching, but I don't
> see why that should imply a requirement to hold lock on parent tables
> it never touches.
> 
> The reasons why we need locks on tables not physically accessed by the
> query are (a) to ensure that we've blocked, or received sinval messages
> for, any DDL related to views or partition parent tables, in case that
> would invalidate the plan; (b) to allow firing triggers safely, in
> the case of partition parent tables.  Neither of these issues apply to
> a parallel worker -- the plan is already frozen before it can ever
> start, and it isn't going to be firing any triggers either.
> 
> In particular, I think it's fine to get rid of
> ExecLockNonLeafAppendTables.  In the parent, that's clearly useless code
> now: we have already locked *every* RTE_RELATION entry in the rangetable,
> either when the parser/rewriter/planner added the RTE to the list to begin
> with, or during AcquireExecutorLocks if the plan was retrieved from the
> plancache.  In a child it seems unnecessary as long as we're locking the
> leaf rels we actually touch.
> 
> Possibly, we might fix the problem of inadequate locking in worker
> processes by having them do the equivalent of AcquireExecutorLocks, ie
> just run through the whole RT list and lock everything.  I think we'd soon
> end up doing that if we ever try to allow parallelization of non-read-only
> queries; but that's a long way off AFAIK.  For read-only queries it seems
> like it'll be fine if we just rejigger ExecGetRangeTableRelation to take a
> lock when IsParallelWorker.

Okay, thanks for the explanation.  It's now clearer to me why parallel
workers don't really need to lock non-leaf relations.

Thanks,
Amit




Re: pg_upgrade failed with ERROR: null relpartbound for relation 18159 error.

2018-10-09 Thread Amit Langote
On 2018/10/08 0:09, Michael Paquier wrote:
> On Sun, Oct 07, 2018 at 11:13:19AM -0300, Alvaro Herrera wrote:
>> Good point.  I cannot do that today, but if you want to, please go
>> ahead.
> 
> Okay, done.

Thank you both.

Regards,
Amit





Re: out-of-order XID insertion in KnownAssignedXids

2018-10-09 Thread Michael Paquier
On Mon, Oct 08, 2018 at 09:30:49AM -0700, Andres Freund wrote:
> Sounds less terrible, but still pretty bad.  I think we should fix the
> underlying data inconsistency, not paper over it a couple hundred meters
> away.

+1.  I am looking at how to make that possible.
--
Michael


signature.asc
Description: PGP signature