Re: unsupportable composite type partition keys

2020-01-31 Thread Julien Rouhaud
On Fri, Jan 31, 2020 at 04:20:36PM -0500, Tom Lane wrote:
> Julien Rouhaud  writes:
> > On Sun, Dec 22, 2019 at 10:51 PM Tom Lane  wrote:
> >> While poking at this, I also started to wonder why CheckAttributeType
> >> wasn't recursing into ranges, since those are our other kind of
> >> container type.  And the answer is that it must, because we allow
> >> creation of ranges over composite types:
> 
> > While working on regression tests for index collation versioning [1],
> > I noticed that the 2nd patch apparently broke the ability to create a
> > table using a range over collatable datatype attribute, which we
> > apparently don't test anywhere.
> 
> Ugh.
> 
> > AFAICT, this is only a thinko in CheckAttributeType(), where the range
> > collation should be provided rather than the original tuple desc one,
> > as per attached.  I also added a create/drop table in an existing
> > regression test that was already creating range over collatable type.
> 
> Looks good, although I think maybe we'd better test the case a little
> harder than this.  Will tweak that and push -- thanks!

Ah, I wasn't sure that additional tests on a table would be worthwhile enough.
Thanks for tweaking and pushing!




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-01-31 Thread Kasahara Tatsuhito
On Thu, Jan 30, 2020 at 1:49 PM Kyotaro Horiguchi
 wrote:
> At Thu, 30 Jan 2020 13:30:56 +0900, Kasahara Tatsuhito 
>  wrote in
> > > TID scan   : yes , seq_scan, , 
> > Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
> > from commit 147e3722f7.
>
> It is reflectings the discussion below, which means TID scan doesn't
> have corresponding SO_TYPE_ value. Currently it is setting
> SO_TYPE_SEQSCAN by accedent.
Ah, sorry I misunderstood..

Upon further investigation, the SO_TYPE_SEQSCAN flag was also used at
heap_beginscan() to determine whether a predicate lock was taken on
the entire relation.

if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
{
/*
 * Ensure a missing snapshot is noticed reliably, even if the
 * isolation mode means predicate locking isn't performed (and
 * therefore the snapshot isn't used here).
 */
Assert(snapshot);
PredicateLockRelation(relation, snapshot);
}

Therefore, it can not simply remove the SO_TYPE_SEQSCAN flag from a TID scan.
To keep the old behavior, I think it would be better to add a new
SO_TYPE_TIDSCAN flag and take a predicate lock on the entire relation.

Attach the v2 patch which change the status to following. (same as
-v11 but have new SO_TYPE_TIDSCAN flag)

 increments
scan typetable_beginscan?, per scan, per tuple , SO_TYPE flags
=
TID scan   : yes , , , SO_TYPE_TIDSCAN

Is it acceptable change for HEAD and v12?

Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_increments_seqscan_num_v2.patch
Description: Binary data


Re: [Proposal] Add accumulated statistics for wait event

2020-01-31 Thread Pavel Stehule
Hi

st 15. 1. 2020 v 14:15 odesílatel Imai Yoshikazu 
napsal:

> On 2020/01/13 4:11, Pavel Stehule wrote:
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   tested, passed
> > Spec compliant:   not tested
> > Documentation:tested, passed
> >
> > I like this patch, because I used similar functionality some years ago
> very successfully. The implementation is almost simple, and the result
> should be valid by used method.
>
> Thanks for your review!
>
> > The potential problem is performance impact. Very early test show impact
> cca 3% worst case, but I'll try to repeat these tests.
>
> Yes, performance impact is the main concern. I want to know how it
> affects performance in various test cases or on various environments.
>
> > There are some ending whitespaces and useless tabs.
> >
> > The new status of this patch is: Waiting on Author
> I attach v4 patches removing those extra whitespaces of the end of lines
> and useless tabs.
>

today I run 120 5minutes pgbench tests to measure impact of this patch.
Result is attached.

test

PSQL="/usr/local/pgsql/bin/psql"
PGBENCH="/usr/local/pgsql/bin/pgbench"
export PGDATABASE=postgres

echo "*** START ***" > ~/result.txt

for i in 1 5 10 50 100
do
  echo "scale factor $i" >> ~/result.txt

  $PSQL -c "create database bench$i"
  $PGBENCH -i -s $i "bench$i"

  for c in 1 5 10 50
  do
$PGBENCH -c $c -T 300 "bench$i" >> ~/result.txt
  done

  $PSQL -c "vacuum full" "bench$i"
  $PSQL -c "vacuum analyze" "bench$i"

  for c in 1 5 10 50
  do
$PGBENCH -S -c $c -T 300 "bench$i" >> ~/result.txt
  done

  $PSQL -c "drop database bench$i"
done

Tested on computer with 4CPU, 8GB RAM - configuration: shared buffers 2GB,
work mem 20MB

The result is interesting - when I run pgbench in R/W mode, then I got +/-
1% changes in performance. Isn't important if tracking time is active or
not (tested on Linux). In this mode the new code is not on critical path.

More interesting results are from read only tests (there are visible some
higher differences)

for scale 5/ and 50 users - the tracking time increase performance about
12% (same result I got for scale/users 10/50), in other direction patched
but without tracking time decreases performance about 10% for for 50/50
(with without tracking time) and 100/5

Looks so for higher scale than 5 and higher number of users 50 the results
are not too much stable (for read only tests - I repeated tests) and there
overhead is about 10% from 60K tps to 55Ktps - maybe I hit a hw limits (it
running with 4CPU)

Thanks to Tomas Vondra and 2ndq for hw for testing

Regards

Pavel

 wait_event_type |  wait_event   |   calls|times
-+---++--
 Client  | ClientRead| 1489681408 | 221616362961
 Lock| transactionid |  103113369 |  71918794185
 LWLock  | WALWriteLock  |  104781468 |  20865855903
 Lock| tuple |   21323744 |  15800875242
 IO  | WALSync   |   50862170 |   8666988491
 LWLock  | lock_manager  |   18415423 |575308266
 IO  | WALWrite  |   51482764 |205775892
 LWLock  | buffer_content|   15385387 |168446128
 LWLock  | wal_insert|1502092 | 90019731
 IPC | ProcArrayGroupUpdate  | 178238 | 46527364
 LWLock  | ProcArrayLock | 587356 | 13298246
 IO  | DataFileExtend|2715557 | 11615216
 IPC | ClogGroupUpdate   |  54319 | 10622013
 IO  | DataFileRead  |5805298 |  9596545
 IO  | SLRURead  |9518930 |  7166217
 LWLock  | CLogControlLock   | 746759 |  6783602





> --
> Yoshikazu Imai
>


pgbench.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs

2020-01-31 Thread Amit Kapila
On Mon, Nov 25, 2019 at 1:17 PM Michael Paquier  wrote:
>
> On Mon, Sep 09, 2019 at 05:34:43PM +0530, Amit Kapila wrote:
> > The only difference is in the last line where for me it gives
> > assertion failure when trying to do ReleaseAuxProcessResources.  Below
> > is the callstack:
>
> No need for Windows on this one and I have reproduced easily the same
> trace as Amit.  The patch has been moved to next CF.  Chengchao, could
> you provide an update please?
>

I have marked this patch as "Returned with feedback" as it's been long
since the author has responded.  Feel free to provide a new patch for
the next CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-01-31 Thread Amit Kapila
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar  wrote:
>
> Other comments look fine to me so I will reply to them along with the
> next version of the patch.
>

This still needs more work, so I have moved this to the next CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Don't try fetching future segment of a TLI.

2020-01-31 Thread David Steele

On 1/28/20 8:02 PM, Kyotaro Horiguchi wrote:
> At Tue, 28 Jan 2020 19:13:32 +0300, Pavel Suderevsky
>> Regading influence: issue is not about the large amount of WALs to apply
>> but in searching for the non-existing WALs on the remote storage, 
each such

>> search can take 5-10 seconds while obtaining existing WAL takes
>> milliseconds.
>
> Wow. I didn't know of a file system that takes that much seconds to
> trying non-existent files. Although I still think this is not a bug,
> but avoiding that actually leads to a big win on such systems.

I have not tested this case but I can imagine it would be slow in 
practice.  It's axiomatic that is hard to prove a negative.  With 
multi-region replication it might well take some time to be sure that 
the file *really* doesn't exist and hasn't just been lost in a single 
region.


> After a thought, I think it's safe and effectively doable to let
> XLogFileReadAnyTLI() refrain from trying WAL segments of too-high
> TLIs.  Some garbage archive files out of the range of a timeline might
> be seen, for example, after reusing archive directory without clearing
> files.  However, fetching such garbages just to fail doesn't
> contribute durability or reliablity at all, I think.

The patch seems sane, the trick will be testing it.

Pavel, do you have an environment where you can ensure this is a 
performance benefit?


Regards,
--
-David
da...@pgmasters.net




Re: dropdb --force

2020-01-31 Thread Amit Kapila
On Sat, Nov 30, 2019 at 7:46 AM Michael Paquier  wrote:
>
> On Fri, Nov 29, 2019 at 03:31:21PM +0530, Amit Kapila wrote:
> > It might be better to move it to next CF as the discussion is still active.
>
> OK, just did that.
>

I have marked this as committed in CF.  This was committed some time
back[1][2].  I was just waiting for the conclusion on what we want to
do about Windows behavior related to socket closure which we
discovered while testing this feature.  It is not very clear whether
we want to do anything about it, see discussion on thread [3], so I
closed this.

[1] - 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1379fd537f9fc7941c8acff8c879ce3636dbdb77
[2] - 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=80e05a088e4edd421c9c0374d54d787c8a4c0d86
[3] - 
https://www.postgresql.org/message-id/CALDaNm2tEvr_Kum7SyvFn0%3D6H3P0P-Zkhnd%3DdkkX%2BQ%3DwKutZ%3DA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




What's difference among insert/write/flush lsn?

2020-01-31 Thread Jinhua Luo
Hi,

pg_current_wal_flush_lsn()pg_lsnGet current write-ahead log flush location
pg_current_wal_insert_lsn()pg_lsnGet current write-ahead log insert location
pg_current_wal_lsn()pg_lsnGet current write-ahead log write location

I guess write is about how many bytes written in shared cache, and
flush is flush to file, which makes it persistent.

Anybody gives some official explanation?
Thanks.

Regards,
Jinhua Luo




Re: Shared memory leak on DSM slot exhaustion

2020-01-31 Thread Thomas Munro
On Sat, Feb 1, 2020 at 7:37 AM Robert Haas  wrote:
> Whoops. The patch looks OK to me.

Pushed.




Re: Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'

2020-01-31 Thread Tom Lane
Cleysson Lima  writes:
> this is a review of the patch: chg_to_date_wwd.patch
> There hasn't been any problem, at least that I've been able to find.

AFAICS, the point of this patch is to make to_date symmetrical
with the definition of WW that the other patch wants for to_char.
But the other patch is wrong, for the reasons I explained upthread,
so I doubt that we want this one either.

I still think that it'd be necessary to invent at least one new
format field code in order to get to a sane version of this feature.
As they stand, 'WW' and 'D' do not agree on what a week is, and
changing the behavior of either one in order to make them agree
is just not going to happen.

BTW, I went to check on what Oracle thinks about this, since these
functions are allegedly Oracle-compatible.  On PG, I get this
for the WW and D values for the next few days:

select to_char(current_date+n, '-MM-DD -WW-D Day')
from generate_series(0,10) n;
to_char 

 2020-01-31 2020-05-6 Friday   
 2020-02-01 2020-05-7 Saturday 
 2020-02-02 2020-05-1 Sunday   
 2020-02-03 2020-05-2 Monday   
 2020-02-04 2020-05-3 Tuesday  
 2020-02-05 2020-06-4 Wednesday
 2020-02-06 2020-06-5 Thursday 
 2020-02-07 2020-06-6 Friday   
 2020-02-08 2020-06-7 Saturday 
 2020-02-09 2020-06-1 Sunday   
 2020-02-10 2020-06-2 Monday   
(11 rows)

I did the same calculations using Oracle 11g R2 on sqlfiddle.com
and got the same results.  Interestingly, though, I also tried it on

https://rextester.com/l/oracle_online_compiler

and here's what I get there:

2020-01-31 2020-05-5 Freitag
2020-02-01 2020-05-6 Samstag
2020-02-02 2020-05-7 Sonntag
2020-02-03 2020-05-1 Montag
2020-02-04 2020-05-2 Dienstag
2020-02-05 2020-06-3 Mittwoch
2020-02-06 2020-06-4 Donnerstag
2020-02-07 2020-06-5 Freitag
2020-02-08 2020-06-6 Samstag
2020-02-09 2020-06-7 Sonntag
2020-02-10 2020-06-1 Montag

(I don't know how to switch locales on these sites, so I don't have
any way to check what happens in other locales.)

So we agree with Oracle on what WW means, but they count D as 1-7
starting on either Sunday or Monday according to locale.  I wonder
whether we should change to match that?  Maybe "TMD" should act that
way?  It's already the case that their "Day" acts like our "TMDay",
evidently.

Either way, though, the WW weeks don't line up with the D weeks,
and we're not likely to make them do so.

So I think an acceptable version of this feature has to involve
defining at least one new format code and maybe as many as three,
to produce year, week and day values that agree on whichever
definition of "a week" you want to use, and then to_date has to
enforce that input uses matching year/week/day field types,
very much like it already does for ISO versus Gregorian dates.

I also notice that neither patch touches the documentation.
A minimum requirement here is defining what you think the underlying
"week" is, if it's neither ISO nor the existing WW definition.
As I said before, it'd also be a good idea to provide some
evidence that there are other people using that same week definition.

regards, tom lane




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-01-31 Thread Alexey Kondratov

On 2020-01-24 08:50, Michael Paquier wrote:

On Wed, Jan 22, 2020 at 12:55:30AM +0300, Alexander Korotkov wrote:
On Sun, Jan 19, 2020 at 1:24 PM Michael Paquier  
wrote:

+static int
+RestoreArchivedWALFile(const char *path, const char *xlogfname,
+  off_t expectedSize, const char 
*restoreCommand)
Not a fan of putting that to pg_rewind/parsexlog.c.  It has nothing 
to
do with WAL parsing, and it seems to me that we could have an 
argument

for making this available as a frontend-only API in src/common/.


I've put it into src/common/fe_archive.c.


This split makes sense.  You have forgotten to update
@pgcommonfrontendfiles in Mkvcbuild.pm for the MSVC build.  Still, I
can see that we have a lot of duplication between the code of the
frontend and the backend, which is annoying..  The execution part is
tricky to split because the backend has pre- and post- callbacks, the
interruption handling is not the same and there is the problem of
elog() calls, with elevel that can be changed depending on the backend
configuration.  However, I can see one portion of the code which is
fully duplicated and could be refactored out: the construction of the
command to execute, by having in input the restore_command string and
the arguments that we expect to replace in the '%' markers which are
part of the command.



OK, I agree that duplication of this part directly is annoying. I did a 
refactoring and moved this restore_command building code into 
common/archive. Separate file was needed, since fe_archive is 
frontend-only, so I cannot put universal code there. I do not know is 
there any better place for it.




If NULL is specified for one of those values,
then the construction routine returns NULL, synonym of failure.  And
the result of the routine is the built command.  I think that you
would need an extra argument to give space for the error message
generated in the event of an error when building the command.



I did it in a bit different way, but the overall logic is exactly as you 
proposed, I hope.
Also I have checked win/linux build in the similar to cfbot CI setup and 
now everything runs well.




+++ b/src/include/common/fe_archive.h
+#ifndef ARCHIVE_H
+#define ARCHIVE_H
This should be FE_ARCHIVE_H.



Changed.



-   while ((c = getopt_long(argc, argv, "D:nNPR", long_options,
_index)) != -1)
+   while ((c = getopt_long(argc, argv, "D:nNPRC:c", long_options,
_index)) != -1)
This still includes 'C', but that should not be the case.

+   pg_rewind with the
-c or
+   -C option to automatically retrieve them from
the WAL
[...]
+  -C restore_command
+  --target-restore-command=restore_command
[...]
+  available, try re-running pg_rewind 
with

+  the -c or -C option to search
+  for the missing files in the WAL archive.
Three places in the docs need to be cleaned up.



I have fixed all the above, thanks.



Do we really need to test the new "archive" mode as well for
003_extrafiles and 002_databases?  I would imagine that only 001_basic
is enough.



We do not check any unique code paths with it, so I do it only for 
001_basic now. I have left it as a test mode, since it will be easy to 
turn this on for any other test in the future if needed and it fits well 
RewindTest.pm module.




+# Moves WAL files to the temporary location and returns 
restore_command

+# to get them back.
+sub move_wal
+{
+   my ($tmp_dir, $master_pgdata) = @_;
+   my $wal_archive_path = "$tmp_dir/master_wal_archive";
+   my $wal_path = "$master_pgdata/pg_wal";
+   my $wal_dir;
+   my $restore_command;
I think that this routine is wrongly designed.  First, in order to
copy the contents from one path to another, we have
RecursiveCopy::copy_path, and there is a long history about making
it safe for the TAP tests.  Second, there is in
PostgresNode::enable_restoring already a code path doing the
decision-making on how building restore_command.  We should keep this
code path unique.



Yeah, thank you for pointing me to the RecursiveCopy::copypath and 
especially PostgresNode::enable_restoring, I have modified test to use 
them instead. The copypath does not work with existing destination 
directories and does not preserve initial permissions, so I had to do 
some dirty work to achieve what we need in the test and keep it simple 
in the same time. However, I think that using these unified routines is 
much better for a future support.


New version is attached. Do you have any other comments or objections?


Regards
--
Alexey

From 7af0b3642f6218c764eee361e013f24bfb43ffbe Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Fri, 31 Jan 2020 20:08:13 +0300
Subject: [PATCH v14] pg_rewind: Add options to restore WAL files from archive

Currently, pg_rewind fails when it could not find required WAL files in the
target data directory.  One have to manually figure out which WAL files are
required and copy them back from archive.

This commit implements two new pg_rewind options, 

Re: Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'

2020-01-31 Thread Cleysson Lima
Em sex., 10 de jan. de 2020 às 09:22, Mark Lorenz 
escreveu:

> Updated the chg_to_date_wwd.patch with additional tests (because it
> works not only for 'D' pattern but also for all day patterns like 'Day'
> or 'DY'). Added the necessary documentation change.
>
> (The fix_to_char_wwd.patch from
> f4e740a8de3ad1e762a28f6ff253ea4f%40four-two.de is still up-to-date)



The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hi Mark,

this is a review of the patch: chg_to_date_wwd.patch

There hasn't been any problem, at least that I've been able to find.

This one applies cleanly.

The entire compilation went without error as well.

# Without patch #

postgres=# SELECT to_date('2019-1-1', '-WW-D');
  to_date

 2019-01-01
(1 row)

postgres=# SELECT to_date('2019-1-2', '-WW-D');
  to_date

 2019-01-01
(1 row)

postgres=# SELECT to_date('2019-1-9', '-WW-D');
  to_date

 2019-01-01
(1 row)


# With patch #

postgres=# SELECT to_date('2019-1-1', '-WW-D');
  to_date

 2018-12-30
(1 row)

postgres=# SELECT to_date('2019-1-2', '-WW-D');
  to_date

 2018-12-31
(1 row)

postgres=# SELECT to_date('2019-1-9', '-WW-D');
  to_date

 2019-01-07
(1 row)

+1 for committer review

--
Cleysson Lima


Re: PATCH: Fix wrong size argument to pg_strncasecmp

2020-01-31 Thread Tom Lane
Dominik Czarnota  writes:
> This patch fixes a size parameter of `pg_strncasecmp` which compared a
> "string" literal with a variable by passing a size of 5 while the "string"
> literal has 6 bytes.

Pushed, thanks for the report!

> By the way, the `strncasecmp` usages around the fixed line could use
> `strcasecmp` which doesn't accept the `size_t n` argument.

Maybe.  It's not clear to me that it's be okay to assume that the
variable input string is null-terminated.

regards, tom lane




Re: widen vacuum buffer counters

2020-01-31 Thread Tom Lane
Alvaro Herrera  writes:
> We recently noticed that vacuum buffer counters wraparound in extreme
> cases, with ridiculous results.

Ugh.

> I propose to backpatch this.

+1 for widening these counters, but since they're global variables, -0.2
or so for back-patching.  I don't know of any reason that an extension
would be touching these, but I feel like the problem isn't severe enough
to justify taking an ABI-break risk.

Also, %zd is the wrong format code for int64.  Recommended practice
these days is to use "%lld" with an explicit cast of the printf argument
to long long (just to be sure).  That doesn't work safely before v12,
and if you did insist on back-patching further, you'd need to jump
through hoops to avoid having platform-specific format codes in a
translatable string.  (The side-effects for translation seem like
an independent argument against back-patching.)

regards, tom lane




Re: WIP: Aggregation push-down

2020-01-31 Thread legrand legrand
Hello,

Thank you for this great feature !
I hope this will be reviewed/validated soon ;o)

Just a comment:
set enable_agg_pushdown to true;
isn't displayed in EXPLAIN (SETTINGS) syntax.


The following modification seems to fix that: 

src/backend/utils/misc/guc.c
 
{"enable_agg_pushdown", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables aggregation push-down."),
NULL,
GUC_EXPLAIN   --- line Added ---
},
_agg_pushdown,
false,
NULL, NULL, NULL
},


then
postgres=# set enable_agg_pushdown = true;
SET
postgres=# explain (settings) select 1;
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=4)
 Settings: enable_agg_pushdown = 'on'
(2 rows)


Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: unsupportable composite type partition keys

2020-01-31 Thread Tom Lane
Julien Rouhaud  writes:
> On Sun, Dec 22, 2019 at 10:51 PM Tom Lane  wrote:
>> While poking at this, I also started to wonder why CheckAttributeType
>> wasn't recursing into ranges, since those are our other kind of
>> container type.  And the answer is that it must, because we allow
>> creation of ranges over composite types:

> While working on regression tests for index collation versioning [1],
> I noticed that the 2nd patch apparently broke the ability to create a
> table using a range over collatable datatype attribute, which we
> apparently don't test anywhere.

Ugh.

> AFAICT, this is only a thinko in CheckAttributeType(), where the range
> collation should be provided rather than the original tuple desc one,
> as per attached.  I also added a create/drop table in an existing
> regression test that was already creating range over collatable type.

Looks good, although I think maybe we'd better test the case a little
harder than this.  Will tweak that and push -- thanks!

regards, tom lane




Re: widen vacuum buffer counters

2020-01-31 Thread Julien Rouhaud
On Fri, Jan 31, 2020 at 9:59 PM Alvaro Herrera  wrote:
>
> We recently noticed that vacuum buffer counters wraparound in extreme
> cases, with ridiculous results.  Example:
>
> 2020-01-06 16:38:38.010 EST [45625-1] app= LOG:  automatic vacuum of table 
> "somtab.sf.foobar": index scans: 17
> pages: 0 removed, 207650641 remain, 0 skipped due to pins, 13419403 
> skipped frozen
> tuples: 141265419 removed, 3186614627 remain, 87783760 are dead but 
> not yet removable
> buffer usage: -2022059267 hits, -17141881 misses, 1252507767 dirtied
> avg read rate: -0.043 MB/s, avg write rate: 3.146 MB/s
> system usage: CPU 107819.92s/2932957.75u sec elapsed 3110498.10 sec
>
> That's to be expected, as tables exist that are large enough for 4 billion
> buffer accesses to be a possibility.  Let's widen the counters, as in the
> attached patch.
>
> I propose to backpatch this.

+1, and patch LGTM.




widen vacuum buffer counters

2020-01-31 Thread Alvaro Herrera
We recently noticed that vacuum buffer counters wraparound in extreme
cases, with ridiculous results.  Example:

2020-01-06 16:38:38.010 EST [45625-1] app= LOG:  automatic vacuum of table 
"somtab.sf.foobar": index scans: 17
pages: 0 removed, 207650641 remain, 0 skipped due to pins, 13419403 
skipped frozen
tuples: 141265419 removed, 3186614627 remain, 87783760 are dead but not 
yet removable
buffer usage: -2022059267 hits, -17141881 misses, 1252507767 dirtied
avg read rate: -0.043 MB/s, avg write rate: 3.146 MB/s
system usage: CPU 107819.92s/2932957.75u sec elapsed 3110498.10 sec

That's to be expected, as tables exist that are large enough for 4 billion
buffer accesses to be a possibility.  Let's widen the counters, as in the
attached patch.

I propose to backpatch this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8ce501151e..049ec2703b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -614,7 +614,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 vacrelstats->new_dead_tuples,
 			 OldestXmin);
 			appendStringInfo(,
-			 _("buffer usage: %d hits, %d misses, %d dirtied\n"),
+			 _("buffer usage: %zd hits, %zd misses, %zd dirtied\n"),
 			 VacuumPageHit,
 			 VacuumPageMiss,
 			 VacuumPageDirty);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index b1f6291b99..eb19644419 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -140,9 +140,9 @@ int			VacuumCostPageDirty = 20;
 int			VacuumCostLimit = 200;
 double		VacuumCostDelay = 0;
 
-int			VacuumPageHit = 0;
-int			VacuumPageMiss = 0;
-int			VacuumPageDirty = 0;
+int64		VacuumPageHit = 0;
+int64		VacuumPageMiss = 0;
+int64		VacuumPageDirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 62d64aa0a1..f985453ec3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -252,9 +252,9 @@ extern int	VacuumCostPageDirty;
 extern int	VacuumCostLimit;
 extern double VacuumCostDelay;
 
-extern int	VacuumPageHit;
-extern int	VacuumPageMiss;
-extern int	VacuumPageDirty;
+extern int64 VacuumPageHit;
+extern int64 VacuumPageMiss;
+extern int64 VacuumPageDirty;
 
 extern int	VacuumCostBalance;
 extern bool VacuumCostActive;


Re: get a relations DDL server-side

2020-01-31 Thread Tom Lane
"Jordan Deitch"  writes:
> I would like to introduce the ability to get object DDL (server-side) by 
> introducing a new function with roughly the following prototype:
> get_ddl(regclass)
> RETURNS text
> LANGUAGE C STRICT PARALLEL SAFE;

Umm ... "regclass" would only be appropriate for relations.

If you actually want to support more than one type of object with a single
function, you'll need two OIDs.  Catalog's OID and object's OID are the
usual choices, per pg_describe_object() and similar functions.

I don't think "get_ddl" is a particularly apt function name, either.
It ignores the precedent of existing functions with essentially this
same functionality, such as pg_get_triggerdef(), pg_get_constraintdef(),
etc.  One wonders why duplicate that existing functionality, so maybe
you should think about adding per-object-type functions instead of
trying to make one function to rule them all.

The larger reason why this doesn't exist already, BTW, is that we've
tended to find that it's not all that useful to get back a monolithic
chunk of DDL text for complicated objects such as tables.  You should
provide a little more clarity as to what use-case you foresee, because
otherwise there are just a *whole* lot of things that aren't clear.
Some examples:

* Should the output include a CREATE COMMENT if the object has a comment?
* What about ownership and ACL (grants)?
* On tables, are foreign keys part of the table, or are they distinct
  objects?  (Hint: both answers can be correct depending on use-case)
* How about indexes, and do you want to treat constraint indexes
  differently from other ones?  (Constraint indexes *could* be made
  part of the table's DDL, but other indexes need a separate CREATE)
* Do you need options, such as whether to pretty-print expressions?

You might also find it instructive to dig through the archives for
past discussions about moving more of pg_dump's logic into the server;
that's the area where this has come up over and over.

regards, tom lane




Re: pg_restore crash when there is a failure before all child process is created

2020-01-31 Thread Tom Lane
ahsan hadi  writes:
> I have applied tested both patches separately and ran regression with both. 
> No new test cases are failing with both patches.
> The issues is fixed by both patches however the fix from Tom 
> (0001-fix-worker-status.patch) looks more elegant. I haven't done a detailed 
> code review.

Pushed, thanks for looking!

regards, tom lane




Re: [Proposal] Global temporary tables

2020-01-31 Thread Robert Haas
On Thu, Jan 30, 2020 at 4:33 AM Konstantin Knizhnik
 wrote:
> On 29.01.2020 21:16, Robert Haas wrote:
> > On Wed, Jan 29, 2020 at 10:30 AM Konstantin Knizhnik
> >  wrote:
> >
> > I think that the idea of calling ambuild() on the fly is not going to
> > work, because, again, I don't think that calling that from random
> > places in the code is safe.
>
> It is not a random place in the code.
> Actually it is just one place - _bt_getbuf
> Why it can be unsafe if it affects only private backends data?

Because, as I already said, not every operation is safe at every point
in the code. This is true even when there's no concurrency involved.
For example, executing user-defined code is not safe while holding a
buffer lock, because the user-defined code might try to do something
that locks the same buffer, which would cause an undetected,
uninterruptible deadlock.

> But GTT case is different. Heap and indexes can be easily initialized by
> backend  using existed functions.

That would be nice if we could make it work. Someone would need to
show, however, that it's safe.

> You say that it not safe. But you have not explained why it is unsafe.
> Yes, I agree that it is my responsibility to prove that it is safe.
> And as I already wrote, I can not provide such proof now. I will be
> pleased if you or anybody else can help to convince that this approach
> is safe or demonstrate problems with this approach.

That's fair, but nobody's obliged to spend time on that.

> But I really like to receive more constructive critics rather than "this
> approach is wrong because it is unsafe".

I'm sure, and that's probably valid. Equally, however, I'd like to
receive more analysis of why it is safe than "I don't see anything
wrong with it so it's probably fine." And I think that's pretty valid,
too.

> Actually index is not created on the fly.
> Index is created is usual way, by executing "create index" command.
> So all  components of the Postgres (planner, executor,...) treat GTT
> indexes in the same way as regular indexes.
> Locking and invalidations policies are exactly the same for them.
> The only difference is that content of GTT index is constructed  on
> demand using private backend data.
> Is it safe or not? We are just reading data from local buffers/files and
> writing them here.
> May be I missed something but I do not see any unsafety here.
> There are issues with updating statistic but them can be solved.

But that's not all you are doing. To build the index, you'll have to
sort the data. To sort the data, you'll have to call btree support
functions. Those functions can be user-defined, and can do complex
operations like catalog access that depend on a good transaction
state, no buffer locks being held, and nothing already in progress
within this backend that can get confused as a result of this
operation.

Just as a quick test, I tried doing this in _bt_getbuf:

+if (InterruptHoldoffCount != 0)
+elog(WARNING, "INTERRUPTS ARE HELD OFF");

That causes 103 out of 196 regression tests to fail, which means that
it's pretty common to arrive in _bt_getbuf() with interrupts held off.
At the very least, that means that the index build would be
uninterruptible, which already seems unacceptable. Probably, it means
that the calling code is already holding an LWLock, because that's
normally what causes HOLD_INTERRUPTS() to happen. And as I have
already explained, that is super-problematic, because of deadlock
risks, and because it risks putting other backends into
non-interruptible waits if they should happen to need the LWLock we're
holding here.

I really don't understand why the basic point here remains obscure. In
general, it tends to be unsafe to call high-level code from low-level
code, not just in PostgreSQL but in pretty much any program. Do you
think that we can safely add a GUC that executes a user-defined SQL
query every time an LWLock is acquired? If you do, why don't you try
adding code to do that to LWLockAcquire and testing it out a little
bit? Try making the SQL query do something like query pg_class, find a
table name that's not in use, and create a table by that name. Then
run the regression tests with the GUC set to run that query and see
how it goes. I always hate to say that things are "obvious," because
what's obvious to me may not be obvious to somebody else, but it is
clear to me, at least, that this has no chance of working. Even though
I can't say exactly what will break, or what will break first, I'm
very sure that a lot of things will break and that most of them are
unfixable.

Now, your idea is not quite as crazy as that, but it has the same
basic problem: you can't insert code into a low-level facility that
uses a high level facility which may in turn use and depend on that
very same low-level facility to not be in the middle of an operation.
If you do, it's going to break somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




get a relations DDL server-side

2020-01-31 Thread Jordan Deitch
I would like to introduce the ability to get object DDL (server-side) by 
introducing a new function with roughly the following prototype:

get_ddl(regclass)
RETURNS text
LANGUAGE C STRICT PARALLEL SAFE;

A previous conversation seemed to encourage the development of this feature

https://www.postgresql.org/message-id/CADkLM=fxfsrhaskk_by_a4uomj1te5mfggd_rwwqfv8wp68...@mail.gmail.com

I would like to start work on this patch but wanted acceptance on the function 
signature. 

Thank you!




Re: Shared memory leak on DSM slot exhaustion

2020-01-31 Thread Robert Haas
On Thu, Jan 30, 2020 at 4:54 AM Thomas Munro  wrote:
> As reported over on pgsql-general[1], we leak shared memory when we
> run out of DSM slots.  To see this, add the random-run-out-of-slots
> hack I showed in that thread, create and analyze a table t(i) with a
> million integers, run with dynamic_shared_memory_type=mmap, and try
> SELECT COUNT(*) FROM t t1 JOIN t t2 USING (i) a few times and you'll
> see that pgbase/pg_dynshmem fills up with leaked memory segments each
> time an out-of-slots errors is raised.  (It happens with all DSM
> types, but then the way to list the segments varies or there isn't
> one, depending on type and OS.)  Here's a draft patch to fix that.

Whoops. The patch looks OK to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Extracting only the columns needed for a query

2020-01-31 Thread Ashwin Agrawal
On Wed, Jan 15, 2020 at 7:54 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > For UPDATE, we need all of the columns in the table because of the
> > table_lock() API's current expectation that the slot has all of the
> > columns populated. If we want UPDATE to only need to insert the column
> > values which have changed, table_tuple_lock() will have to change.
>
> Can you please elaborate on this part? I'm probably missing something,
> since I don't see immediately where are those expectations expressed.
>

The table_tuple_lock() has TupleTableSlot as output argument. Comment for
the function states "*slot: contains the target tuple". Usage of
table_tuple_lock() in places like ExecUpdate() and GetTupleForTrigger() use
the returned tuple for EvalPlanQual. Also, it's unknown
within table_tuple_lock() what is expectation and what would be
performed on the returned slot/tuple. Hence, it becomes tricky for any AM
currently to guess and hence need to return full tuple, as API doesn't
state only which columns it would be interested in or just wishes to take
the lock and needs nothing back in slot.


Re: Extracting only the columns needed for a query

2020-01-31 Thread Melanie Plageman
I'm bumping this to the next commitfest because I haven't had a chance
to address the questions posed by Dmitry. I'm sure I'll get to it in
the next few weeks.

> I believe it would be beneficial to add this potential API extension
patch into
> the thread (as an example of an interface defining how scanCols could be
used)
> and review them together.

As for including some code that uses the scanCols, after discussing
off-list with a few folks, there are three options I would like to
pursue for doing this.

One option I will pursue is using the scanCols to inform the columns
needed to be spilled for memory-bounded hashagg (mentioned by Jeff
here [1]).

The second is potentially using the scanCols in the context of FDW.
Corey, would you happen to have any specific examples handy of when
this might be useful?

The third is exercising it with a test only but providing an example
of how a table AM API user like Zedstore uses the columns during
planning.

[1]
https://www.postgresql.org/message-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel%40j-davis.com

-- 
Melanie Plageman


Re: Unix-domain socket support on Windows

2020-01-31 Thread Peter Eisentraut

On 2020-01-30 19:41, Tom Lane wrote:

The code looks fine (and a big +1 for not having knowledge of
DEFAULT_PGSOCKET_DIR wired into UNIXSOCK_PATH).  I wonder though
whether any user-facing documentation needs to be adjusted.


There are no user-facing changes in this patch yet.  That will come with 
subsequent patches.


This patch has now been committed.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: standby apply lag on inactive servers

2020-01-31 Thread Fujii Masao




On 2020/01/31 23:47, Alvaro Herrera wrote:

On 2020-Jan-31, Fujii Masao wrote:

On 2020/01/31 22:40, Alvaro Herrera wrote:

On 2020-Jan-31, Fujii Masao wrote:


You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.


Yeah, I am thinking in backpatching it.  The documented behavior is
already not what the code does.


Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?


right.


As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?


... uh, wow, you're right about that too.  IMO this is extremely
fragile, easy to break, and under-documented.


Yeah, it's worth improving the code.


But you're right, there's
no bug there at present.


  Do you have a situation where this
change would break something?  If so, can you please explain what it is?


For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.


Hmm.

I'm not sure how you would use it in that way.  I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.


Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.


That's precisely the use case that I'm aiming at.  The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur).  During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current.  This has
actually happened in a production setting, it's not a thought
experiment.


I've heard that someone periodically generates dummy tiny
transactions (say, every minute), as a band-aid solution,
to avoid inactive primary. Of course, this is not a perfect solution.

The idea that I proposed previously was to introduce
pg_last_xact_insert_timestamp() [1] into core. This function returns
the timestamp of commit / abort records in *primary* side.
So we can retrieve that timestamp from the primary (e.g., by using dblink)
and compare its result with pg_last_xact_replay_timestamp() to
calculate the delay in the standby.

Another idea is to include the commit / abort timestamp in
primary-keepalive-message that periodially sent from the primary
to the standby. Then if we introduce the function returning
that timestamp, in the standby side, we can easily compare
the commit / abort timestamps taken from both primary and
standby, in the standby.

[1] 
https://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: Use compiler intrinsics for bit ops in hash

2020-01-31 Thread David Fetter
On Wed, Jan 15, 2020 at 03:45:12PM -0800, Jesse Zhang wrote:
> On Tue, Jan 14, 2020 at 2:09 PM David Fetter  wrote:
> > > The changes in hash AM and SIMPLEHASH do look like a net positive
> > > improvement. My biggest cringe might be in pg_bitutils:
> > >
> > > 1. Is ceil_log2_64 dead code?
> >
> > Let's call it nascent code. I suspect there are places it could go, if
> > I look for them.  Also, it seemed silly to have one without the other.
> >
> 
> While not absolutely required, I'd like us to find at least one
> place and start using it. (Clang also nags at me when we have
> unused functions).

Done in the expanded patches attached.

> > On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote:
> > > 4. It seems like you *really* would like an operation like LZCNT in x86
> > > (first appearing in Haswell) that is well defined on zero input. ISTM
> > > the alternatives are:
> > >
> > >a) Special case 1. That seems straightforward, but the branching cost
> > >on a seemingly unlikely condition seems to be a lot of performance
> > >loss
> > >
> > >b) Use architecture specific intrinsic (and possibly with CPUID
> > >shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
> > >intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
> > >instructions that are well defined on zero in most ISA's other than
> > >x86, so maybe we can get away with special-casing x86?
> 
> i. We can detect LZCNT instruction by checking one of the
> "extended feature" (EAX=8001) bits using CPUID. Unlike the
> "basic features" (EAX=1), extended feature flags have been more
> vendor-specific, but fortunately it seems that the feature bit
> for LZCNT is the same [1][2].
> 
> ii. We'll most likely still need to provide a fallback
> implementation for processors that don't have LZCNT (either
> because they are from a different vendor, or an older Intel/AMD
> processor). I wonder if simply checking for 1 is "good enough".
> Maybe a micro benchmark is in order?

I'm not sure how I'd run one on the architectures we support.  What
I've done here is generalize our implementation to be basically like
LZCNT and TZCNT at the cost of a brief branch that might go away at
runtime.

> 2. (My least favorite) use inline asm (a la our popcount
> implementation).

Yeah, I'd like to fix that, but I kept the scope of this one
relatively narrow.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 5fcaa74146206e4de05ca8cbd863aca20bba94bf Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Wed, 29 Jan 2020 02:09:59 -0800
Subject: [PATCH v4 1/2] de-long-ify
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.24.1"

This is a multi-part message in MIME format.
--2.24.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4b562d8d3f..482a569814 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -34,11 +34,11 @@ static void gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer,
 IndexTuple item);
 static void gistGetItupFromPage(GISTNodeBufferPage *pageBuffer,
 IndexTuple *item);
-static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
-static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+static uint64 gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
+static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, uint64 blocknum);
 
-static void ReadTempFileBlock(BufFile *file, long blknum, void *ptr);
-static void WriteTempFileBlock(BufFile *file, long blknum, void *ptr);
+static void ReadTempFileBlock(BufFile *file, uint64 blknum, void *ptr);
+static void WriteTempFileBlock(BufFile *file, uint64 blknum, void *ptr);
 
 
 /*
@@ -64,7 +64,7 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
 	/* Initialize free page management. */
 	gfbb->nFreeBlocks = 0;
 	gfbb->freeBlocksLen = 32;
-	gfbb->freeBlocks = (long *) palloc(gfbb->freeBlocksLen * sizeof(long));
+	gfbb->freeBlocks = (int64 *) palloc(gfbb->freeBlocksLen * sizeof(int64));
 
 	/*
 	 * Current memory context will be used for all in-memory data structures
@@ -469,7 +469,7 @@ gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
 /*
  * Select a currently unused block for writing to.
  */
-static long
+static uint64
 gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb)
 {
 	/*
@@ -487,7 +487,7 @@ gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb)
  * Return a block# to the freelist.
  */
 static void
-gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum)
+gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, uint64 blocknum)
 {
 	int			ndx;
 
@@ -495,9 +495,9 @@ gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum)
 	

Re: Marking some contrib modules as trusted extensions

2020-01-31 Thread Tom Lane
Dean Rasheed  writes:
> On Wed, 29 Jan 2020 at 21:39, Tom Lane  wrote:
>> The bigger picture here is that I don't want to get push-back that
>> we've broken somebody's security posture by marking too many extensions
>> trusted.  So for anything where there's any question about security
>> implications, we should err in the conservative direction of leaving
>> it untrusted.

> I wonder if the same could be said about pgrowlocks.

Good point.  I had figured it was probably OK given that it's
analogous to the pg_locks view (which is unrestricted AFAIR),
and that it already has some restrictions on what you can see.
I'd have no hesitation about dropping it off this list though,
since it's probably not used that much and it could also be seen
as exposing internals.

regards, tom lane




Re: standby apply lag on inactive servers

2020-01-31 Thread Alvaro Herrera
On 2020-Jan-31, Fujii Masao wrote:
> On 2020/01/31 22:40, Alvaro Herrera wrote:
> > On 2020-Jan-31, Fujii Masao wrote:
> > 
> > > You're thinking to apply this change to the back branches? Sorry
> > > if my understanding is not right. But I don't think that back-patch
> > > is ok because it changes the documented existing behavior
> > > of pg_last_xact_replay_timestamp(). So it looks like the behavior
> > > change not a bug fix.
> > 
> > Yeah, I am thinking in backpatching it.  The documented behavior is
> > already not what the code does.
> 
> Maybe you thought this because getRecordTimestamp() extracts the
> timestamp from even WAL record of a restore point? That is, you're
> concerned about that pg_last_xact_replay_timestamp() returns the
> timestamp of not only commit/abort record but also restore point one.
> Right?

right.

> As far as I read the code, this problem doesn't occur because
> SetLatestXTime() is called only for commit/abort records, in
> recoveryStopsAfter(). No?

... uh, wow, you're right about that too.  IMO this is extremely
fragile, easy to break, and under-documented.  But you're right, there's
no bug there at present.

> >  Do you have a situation where this
> > change would break something?  If so, can you please explain what it is?
> 
> For example, use the return value of pg_last_xact_replay_timestamp()
> (and also the timestamp in the log message output at the end of
> recovery) as a HINT when setting recovery_target_time later.

Hmm.

I'm not sure how you would use it in that way.  I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.

> Use it to compare with the timestamp retrieved from the master server,
> in order to monitor the replication delay.

That's precisely the use case that I'm aiming at.  The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur).  During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current.  This has
actually happened in a production setting, it's not a thought
experiment.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: standby apply lag on inactive servers

2020-01-31 Thread Fujii Masao




On 2020/01/31 22:40, Alvaro Herrera wrote:

On 2020-Jan-31, Fujii Masao wrote:


You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.


Yeah, I am thinking in backpatching it.  The documented behavior is
already not what the code does.


Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?

As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?


 Do you have a situation where this
change would break something?  If so, can you please explain what it is?


For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.
Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-31 Thread Hamid Akhtar
On Thu, Jan 30, 2020 at 8:28 AM Craig Ringer  wrote:

> On Thu, 30 Jan 2020 at 02:04, Hamid Akhtar  wrote:
> >
> > So having seen the feedback on this thread, and I tend to agree with
> most of what has been said here, I also agree that the server core isn't
> really the ideal place to handle the orphan prepared transactions.
> >
> > Ideally, these must be handled by a transaction manager, however, I do
> believe that we cannot let database suffer for failing of an external
> software, and we did a similar change through introduction of idle in
> transaction timeout behavior.
>
> The difference, IMO, is that idle-in-transaction aborts don't affect
> anything we've promised to be durable.
>
> Once you PREPARE TRANSACTION the DB has made a promise that that txn
> is durable. We don't have any consistent feedback channel to back to
> applications and say "Hey, if you're not going to finish this up we
> need to get rid of it soon, ok?". If a distributed transaction manager
> gets consensus for commit and goes to COMMIT PREPARED a previously
> prepared txn only to find that it has vanished, that's a major
> problem, and one that may bring the entire DTM to a halt until the
> admin can intervene.
>
> This isn't like idle-in-transaction aborts. It's closer to something
> like uncommitting a previously committed transaction.
>
> I do think it'd make sense to ensure that the documentation clearly
> highlights the impact of abandoned prepared xacts on server resource
> retention and performance, preferably with pointers to appropriate
> views. I haven't reviewed the docs to see how clear that is already.
>

Having seen the documentation, IMHO the document does contain enough
information for users to understand what issues can be caused by these
orphaned prepared transactions.


>
> I can also see an argument for a periodic log message (maybe from
> vacuum?) warning when old prepared xacts hold xmin down. Including one
> sent to the client application when an explicit VACUUM is executed.
> (In fact, it'd make sense to generalise that for all xmin-retention).
>

I think that opens up the debate on what we really mean by "old" and
whether that requires a syntax change when creating a prepared
transactions as Thomas Kellerer suggested earlier?

I agree that vacuum should periodically throw warnings for any prepared
xacts that are holding xmin down.

Generalising it for all xmin-retention is a fair idea IMHO, though that
does increase the overall scope here. A vacuum process should (ideally)
periodically throw out warnings for anything that is preventing it
(including
orphaned prepared transactions) from doing its routine work so that
somebody can take necessary actions.


> But I'm really not a fan of aborting such txns. If you operate with
> some kind of broken global transaction manager that can forget or
> abandon prepared xacts, then fix it, or adopt site-local periodic
> cleanup tasks that understand your site's needs.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  2ndQuadrant - PostgreSQL Solutions for the Enterprise
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus


Re: Missing break in RelationFindReplTupleSeq

2020-01-31 Thread Alvaro Herrera
On 2020-Jan-31, Konstantin Knizhnik wrote:

> Eventually we find out that logical replication in the current version of
> Postgres works significantly slower on table with replica identity full than
> old pglogical implementation.
> 
> The comment to RelationFindReplTupleSeq says:
> 
>     Note that this stops on the first matching tuple.
> 
> But actually this function continue traversal until end of the table even if
> tuple was found.
> I wonder if break; should be added to the end of for loop.

Wow, you're right, and the "break" is missing there.  I propose it like
this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 582b0cb017..30cba89da7 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -327,6 +327,9 @@ retry:
 			XactLockTableWait(xwait, NULL, NULL, XLTW_None);
 			goto retry;
 		}
+
+		/* Found our tuple and it's not locked */
+		break;
 	}
 
 	/* Found tuple, try to lock it in the lockmode. */


Missing break in RelationFindReplTupleSeq

2020-01-31 Thread Konstantin Knizhnik
Eventually we find out that logical replication in the current version 
of Postgres works significantly slower on table with replica identity 
full than old pglogical implementation.


The comment to RelationFindReplTupleSeq says:

    Note that this stops on the first matching tuple.

But actually this function continue traversal until end of the table 
even if tuple was found.

I wonder if break; should be added to the end of for loop.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: standby apply lag on inactive servers

2020-01-31 Thread Alvaro Herrera
On 2020-Jan-31, Fujii Masao wrote:

> You're thinking to apply this change to the back branches? Sorry
> if my understanding is not right. But I don't think that back-patch
> is ok because it changes the documented existing behavior
> of pg_last_xact_replay_timestamp(). So it looks like the behavior
> change not a bug fix.

Yeah, I am thinking in backpatching it.  The documented behavior is
already not what the code does.  Do you have a situation where this
change would break something?  If so, can you please explain what it is?

I think (and I said it upthread) a 100% complete fix involves tracking
two timestamps rather than one.  I was thinking that that would be too
invasive because it changes XLogCtlData shmem struct ... but that struct
is private to xlog.c, so I think it's fine to change the struct.  The
problem though is that the user-visible change that I want to achieve is
pg_last_xact_replay_timestamp(), and it would be obviously wrong to use
the new XLogCtlData field rather than the existing one, as that would be
a behavior change in the same sense that you're now complaining about.
So I would achieve nothing.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-31 Thread Juan José Santamaría Flecha
On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-01-28 16:47, Juan José Santamaría Flecha wrote:
> > This patch targets to do something symmetrical to to_char(), which will
> > just return a single value.
>
> I didn't fully realize while reading this thread that to_char() already
> supports localized output and this patch indeed just wants to do the
> opposite.
>
> So I'm withdrawing my concerns with respect to this patch.  As long as
> it can do a roundtrip conversion with to_char(), it's fine.
>
>
We can avoid issues with non injective case conversion languages with a
double conversion, so both strings in the comparison end up in the same
state.

I propose an upper/lower conversion as in the attached patch.

Regards,

Juan José Santamaría Flecha
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ceda48e..b1951e5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');


 TM prefix
-translation mode (print localized day and month names based on
+translation mode (use localized day and month names based on
  )
 TMMonth

@@ -6000,8 +6000,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
TM does not include trailing blanks.
+  
+ 
+
+ 
+  
to_timestamp and to_date ignore
-   the TM modifier.
+   the case when receiving names as an input.
   
  
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f58331d..e5b4eb5 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1059,9 +1059,11 @@ static int	from_char_parse_int_len(int *dest, const char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, const char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(const char *name, const char *const *array, int *len);
+static int	seq_search_ascii(const char *name, const char *const *array, int *len);
+static int	seq_search_localized(const char *name, char **array, int *len);
 static int	from_char_seq_search(int *dest, const char **src,
  const char *const *array,
+ char **localized_array,
  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
 			struct pg_tm *tm, fsec_t *fsec, int *fprec,
@@ -2459,7 +2461,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
 	unsigned char firstc;
 	const char *const *a;
@@ -2505,8 +2507,74 @@ seq_search(const char *name, const char *const *array, int *len)
 }
 
 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * downcasing transformation to achieve case-insensitivity.
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by pg_locale.c aren't const.
+ */
+static int
+seq_search_localized(const char *name, char **array, int *len)
+{
+	char	  **a;
+	char	   *lower_name;
+	char	   *upper_name;
+
+	*len = 0;
+
+	/* empty string can't match anything */
+	if (!*name)
+		return -1;
+
+	/*
+	 * We do an upper/lower conversion to avoid problems with languages
+	 * in which case conversions are not injective.
+	 */
+	upper_name = str_toupper(unconstify(char *, name), strlen(name),
+			 DEFAULT_COLLATION_OID);
+	lower_name = str_tolower(upper_name, strlen(upper_name),
+			 DEFAULT_COLLATION_OID);
+	pfree(upper_name);
+
+	for (a = array; *a != NULL; a++)
+	{
+		char	   *lower_element;
+		char	   *upper_element;
+		int			element_len;
+
+		/* Upper/lower-case array element, assuming it is normalized */
+		upper_element = str_toupper(*a, strlen(*a), DEFAULT_COLLATION_OID);
+		lower_element = str_tolower(upper_element, strlen(upper_element),
+	DEFAULT_COLLATION_OID);
+		pfree(upper_element);
+		element_len = strlen(lower_element);
+
+		/* Match? */
+		if (strncmp(lower_name, lower_element, element_len) == 0)
+		{
+			*len = element_len;
+			pfree(lower_element);
+			pfree(lower_name);
+			return a - array;
+		}
+		pfree(lower_element);
+	}
+
+	pfree(lower_name);
+	return -1;
+}
+
+/*
+ * Perform a sequential search in 'array' (or 'localized_array', if that's
+ * not NULL) for an entry matching the first character(s) of the 'src'
+ * string case-insensitively.
+ *
+ 

Re: pg_restore crash when there is a failure before all child process is created

2020-01-31 Thread ahsan hadi
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   tested, passed
Documentation:not tested

I have applied tested both patches separately and ran regression with both. No 
new test cases are failing with both patches.

The issues is fixed by both patches however the fix from Tom 
(0001-fix-worker-status.patch) looks more elegant. I haven't done a detailed 
code review.

[PATCH] Erase the distinctClause if the result is unique by definition

2020-01-31 Thread Andy Fan
Hi:

I wrote a patch to erase the distinctClause if the result is unique by
definition,  I find this because a user switch this code from oracle
to PG and find the performance is bad due to this,  so I adapt pg for
this as well.

This patch doesn't work for a well-written SQL,  but some drawback
of a SQL may be not very obvious,  since the cost of checking is pretty
low as well,  so I think it would be ok to add..

Please see the patch for details.

Thank you.


0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data


Re: table partitioning and access privileges

2020-01-31 Thread Fujii Masao




On 2020/01/31 13:38, Amit Langote wrote:

On Fri, Jan 31, 2020 at 1:28 AM Fujii Masao  wrote:

On 2020/01/31 1:02, Tom Lane wrote:

Fujii Masao  writes:

Thanks for updating the patch! Barring any objection,
I will commit this fix and backport it to all supported versions.


Sorry for not having paid closer attention to this thread, but ...
is back-patching this behavioral change really a good idea?

It's not that hard to imagine that somebody is expecting the old
behavior and will complain that we broke their application's security.
So I'd have thought it better to fix only in HEAD, with a
compatibility warning in the v13 release notes.

I'm afraid it's much more likely that people will complain about
making such a change in a minor release than that they will be
happy about it.  It's particularly risky to be making it in what
will be the last 9.4.x release, because we will not have any
opportunity to undo it in that branch if there is pushback.


Fair enough. I finally did back-patch because the behavior is clearly
documented and I failed to hear the opinions to object the back-patch.
But I should have heard and discussed such risks more.

I'm OK to revert all those back-patch. Instead, probably the document
should be updated in old branches.


I could find only this paragraph in the section on inheritance that
talks about how access permissions work:

9.4:

"Note how table access permissions are handled. Querying a parent
table can automatically access data in child tables without further
access privilege checking. This preserves the appearance that the data
is (also) in the parent table. Accessing the child tables directly is,
however, not automatically allowed and would require further
privileges to be granted."

9.5-12:

"Inherited queries perform access permission checks on the parent
table only. Thus, for example, granting UPDATE permission on the
cities table implies permission to update rows in the capitals table
as well, when they are accessed through cities. This preserves the
appearance that the data is (also) in the parent table. But the
capitals table could not be updated directly without an additional
grant. In a similar way, the parent table's row security policies (see
Section 5.7) are applied to rows coming from child tables during an
inherited query. A child table's policies, if any, are applied only
when it is the table explicitly named in the query; and in that case,
any policies attached to its parent(s) are ignored."

Do you mean that the TRUNCATE exception should be noted here?


Yes, that's what I was thinking.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: standby apply lag on inactive servers

2020-01-31 Thread Fujii Masao




On 2020/01/31 5:45, Alvaro Herrera wrote:

On 2020-Jan-30, Kyotaro Horiguchi wrote:


Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?

"transactions are completed up to log time %s"


That's a good point.  I used the phrase "transaction activity", which
seems sufficiently explicit to me.

So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.


You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




PATCH: Fix wrong size argument to pg_strncasecmp

2020-01-31 Thread Dominik Czarnota
Hello,

Please find a one-liner patch in the attachment.

This patch fixes a size parameter of `pg_strncasecmp` which compared a
"string" literal with a variable by passing a size of 5 while the "string"
literal has 6 bytes.

This issue can be observed with the following query (where 'X' is any
character other than 'g' and null byte):

select json_to_tsvector('"abc"'::json, '"strinX"')

Before this patch this query returns the `'abc':1` result instead of
failing with the following error:

wrong flag in flag array: "strinX"

By the way, the `strncasecmp` usages around the fixed line could use
`strcasecmp` which doesn't accept the `size_t n` argument.

---
Regards,
Disconnect3d


0001-Fix-wrong-size-argument-to-pg_strncasecmp.patch
Description: Binary data


Re: Data race in interfaces/libpq/fe-exec.c

2020-01-31 Thread Mark Charsley
According to folks significantly cleverer than me, this can be a problem:
See section 2.4 in
https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

tl;dr in a self-fulfilling prophecy kind of way, there are no benign
data-races. So the compiler can assume no-one would write a data race.
Therefore it can make aggressive optimisations that render what would
otherwise have been a benign race actively dangerous.

Granted the danger here is mainly theoretical, and the main problem for me
is that turning off ThreadSanitizer because of this issue means that other
more dangerous issues in my code (rather than the postgres client code)
won't be found. But the above is the reason why ThreadSanitizer folks don't
want to put in any "you can ignore this race, it's benign" functionality,
and told me that the right thing to do was to contact you folks and get a
fix in upstream...

Mark

On Thu, Jan 30, 2020 at 4:46 PM Tom Lane  wrote:

> Mark Charsley  writes:
> > This line
> >
> https://github.com/postgres/postgres/blob/30012a04a6c8127397a8ab71e160d9c7e7fbe874/src/interfaces/libpq/fe-exec.c#L1073
> > triggers data race errors when run under ThreadSanitizer (*)
>
> > As far as I can tell, the static variable in question is a hack to allow
> a
> > couple of deprecated functions that are already unsafe to use
> > (PQescapeString and PQescapeBytea) to be fractionally less unsafe to use.
>
> Yup.
>
> > Would there be any interest in a patch changing the type of
> > static_client_coding
> > and static_std_strings
> > <
> https://github.com/postgres/postgres/blob/30012a04a6c8127397a8ab71e160d9c7e7fbe874/src/interfaces/libpq/fe-exec.c#L49
> >
> > to
> > some atomic equivalent, so the data race goes away?
>
> I don't see that making those be some other datatype would improve anything
> usefully.  (1) On just about every platform known to man, int and bool are
> going to be atomic anyway.  (2) The *actual* hazards here, as opposed to
> theoretical ones, are that you're using more than one connection with
> different settings for these values, whereupon it's not clear whether
> those deprecated functions will see the appropriate settings when they're
> used.  A different data type won't help that.
>
> In short: this warning you're getting from ThreadSanitizer is entirely
> off-point, so contorting the code to suppress it seems useless.
>
> regards, tom lane
>


Re: allow online change primary_conninfo

2020-01-31 Thread Sergei Kornilov
Hello
Small rebase due to merge conflict of the tests. No functional changes since v7.

PS: also it is end of current CF, I will mark patch entry as moved to the next 
CF.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2e2af9e96e..7887391bbb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4005,9 +4005,15 @@ ANY num_sync ( 
@@ -4022,9 +4028,13 @@ ANY num_sync ( ).
-  This parameter can only be set at server start.
+  This parameter can only be set in the postgresql.conf
+  file or on the server command line.
   This setting has no effect if primary_conninfo is not
-  set.
+  set or the server is not in standby mode.
+ 
+ 
+  The WAL receiver is restarted after an update of primary_slot_name.
  
 

@@ -4142,7 +4152,11 @@ ANY num_sync ( ).
 The default is on.  The only reason to turn this off would be if the
 remote instance is currently out of available replication slots.  This
-parameter can only be set at server start.
+parameter can only be set in the postgresql.conf
+file or on the server command line.
+   
+   
+The WAL receiver is restarted after an update of wal_receiver_create_temp_slot.

   
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 408b9b489a..c24b93332e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -803,6 +803,12 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
 
+/*
+ * Need for restart running WalReceiver due the configuration change.
+ * Suitable only for XLOG_FROM_STREAM source
+ */
+static bool pendingWalRcvRestart = false;
+
 typedef struct XLogPageReadPrivate
 {
 	int			emode;
@@ -11831,6 +11837,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	for (;;)
 	{
 		int			oldSource = currentSource;
+		bool		startWalReceiver = false;
 
 		/*
 		 * First check if we failed to read from the current source, and
@@ -11864,54 +11871,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	if (!StandbyMode)
 		return false;
 
-	/*
-	 * If primary_conninfo is set, launch walreceiver to try
-	 * to stream the missing WAL.
-	 *
-	 * If fetching_ckpt is true, RecPtr points to the initial
-	 * checkpoint location. In that case, we use RedoStartLSN
-	 * as the streaming start position instead of RecPtr, so
-	 * that when we later jump backwards to start redo at
-	 * RedoStartLSN, we will have the logs streamed already.
-	 */
-	if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-	{
-		XLogRecPtr	ptr;
-		TimeLineID	tli;
-
-		if (fetching_ckpt)
-		{
-			ptr = RedoStartLSN;
-			tli = ControlFile->checkPointCopy.ThisTimeLineID;
-		}
-		else
-		{
-			ptr = RecPtr;
-
-			/*
-			 * Use the record begin position to determine the
-			 * TLI, rather than the position we're reading.
-			 */
-			tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-			if (curFileTLI > 0 && tli < curFileTLI)
-elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
-	 (uint32) (tliRecPtr >> 32),
-	 (uint32) tliRecPtr,
-	 tli, curFileTLI);
-		}
-		curFileTLI = tli;
-		RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-			 PrimarySlotName);
-		receivedUpto = 0;
-	}
-
 	/*
 	 * Move to XLOG_FROM_STREAM state in either case. We'll
 	 * get immediate failure if we didn't launch walreceiver,
 	 * and move on to the next state.
 	 */
 	currentSource = XLOG_FROM_STREAM;
+	startWalReceiver = true;
 	break;
 
 case XLOG_FROM_STREAM:
@@ -12057,7 +12023,69 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	Assert(StandbyMode);
 
 	/*
-	 * Check if WAL receiver is still active.
+	 * shutdown WAL receiver if restart is requested.
+	 */
+	if (!startWalReceiver && pendingWalRcvRestart)
+	{
+		if (WalRcvRunning())
+			ShutdownWalRcv();
+
+		/*
+		 * Re-scan for possible new timelines if we were
+		 * requested to recover to the latest timeline.
+		 */
+		if (recoveryTargetTimeLineGoal ==
+			RECOVERY_TARGET_TIMELINE_LATEST)
+			rescanLatestTimeLine();
+
+		startWalReceiver = true;
+	}
+	pendingWalRcvRestart = false;
+
+	/*
+	 * Launch walreceiver if needed.
+	 *
+	 * If fetching_ckpt is true, RecPtr points to the initial
+	 * checkpoint location. In that case, we use RedoStartLSN
+	 * as the streaming start position instead of RecPtr, so
+	 * that when we later jump backwards to start redo 

Re: Duplicated LSN in ReorderBuffer

2020-01-31 Thread Amit Kapila
On Fri, Jan 31, 2020 at 12:35 AM Alvaro Herrera
 wrote:
>
> On 2019-Jul-26, Andres Freund wrote:
>
> > Petr, Simon, see the potential issue related to fast forward at the
> > bottom.
>
> I think we neglected this bit.  I looked at the patch Simon submitted
> downthread, and while I vaguely understand that we need to process
> NEW_CID records during fast-forwarding,
>

Right, IIUC, that is mainly to mark txn has catalog changes
(ReorderBufferXidSetCatalogChanges) so that later such a transaction
can be used to build historic snapshots (during SnapBuildCommitTxn).
Now, if that is true, then won't we need a similar change in
DecodeHeapOp for XLOG_HEAP_INPLACE case as well?  Also, I am not sure
if  SnapBuildProcessNewCid needs to create Cid map in such a case.


> I don't quite understand why we
> still can skip XLOG_INVALIDATION messages.  I *think* we should process
> those too.
>

I also think so.  If you are allowing to execute invalidation
irrespective of fast_forward in DecodeStandbyOp, then why not do the
same in DecodeCommit where we add
ReorderBufferAddInvalidations?

>   Here's a patch that also contains that change; I also
> reworded Simon's proposed comment.  I appreciate reviews.
>

Even though, in theory, the changes look to be in the right direction,
but it is better if we can create a test case to reproduce the
problem.  I am not sure, but I think we need to generate a few DDLs
for the transaction for which we want to fast forward and then after
moving forward the DMLs dependent on those WAL should create some
problem as we have skipped executing invalidations and addition of
such a transaction in the historic snapshot.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Physical replication slot advance is not persistent

2020-01-31 Thread Alexey Kondratov

On 30.01.2020 05:19, Michael Paquier wrote:

On Wed, Jan 29, 2020 at 05:10:20PM +0900, Kyotaro Horiguchi wrote:

Looks perfect.

Thanks Horiguchi-san and others.  Applied and back-patched down to
11.


Great! Thanks for getting this done.

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: pg_restore crash when there is a failure before all child process is created

2020-01-31 Thread vignesh C
On Fri, Jan 31, 2020 at 1:09 AM Tom Lane  wrote:
>
> vignesh C  writes:
> > On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi  wrote:
> >> Can you share a test case or steps that you are using to reproduce this 
> >> issue? Are you reproducing this using a debugger?
>
> > I could reproduce with the following steps:
> > Make cluster setup.
> > Create few tables.
> > Take a dump in directory format using pg_dump.
> > Restore the dump generated above using pg_restore with very high
> > number for --jobs options around 600.
>
> I agree this is quite broken.  Another way to observe the crash is
> to make the fork() call randomly fail, as per booby-trap-fork.patch
> below (not intended for commit, obviously).
>
> I don't especially like the proposed patch, though, as it introduces
> a great deal of confusion into what ParallelState.numWorkers means.
> I think it's better to leave that as being the allocated array size,
> and instead clean up all the fuzzy thinking about whether workers
> are actually running or not.  Like 0001-fix-worker-status.patch below.
>

The patch looks fine to me. The test is also getting fixed by the patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-01-31 Thread Bernd Helmle
Am Freitag, den 31.01.2020, 13:53 +0900 schrieb Michael Paquier:
> Indeed, with a bad timing and a crash in the middle of
> write_relcache_init_file(), it could be possible to have such a file
> left around in the data folder.  Having a past tablespace version
> left
> around after an upgrade is a pilot error in my opinion because
> pg_upgrade generates a script to cleanup past tablespaces, no?

I'm suprised, why should that be a problem in copy mode? For me this is
a fair use case to test upgrades, e.g. for development purposes, if
someone want's to still have application tests against the current old
version, for fallback and whatever. And people might not want such
upgrades as a "fire-and-forget" task. We even have the --clone feature
now, making this even faster.

If our project policy is to never ever touch an pg_upgrade'd PostgreSQL
instance again in copy mode, i wasn't aware of it.

And to be honest, even PostgreSQL itself allows you to reuse tablespace
locations for multiple instances as well, so the described problem
should exist not in upgraded clusters only.


Bernd






Re: unsupportable composite type partition keys

2020-01-31 Thread Julien Rouhaud
On Sun, Dec 22, 2019 at 10:51 PM Tom Lane  wrote:
>
> I wrote:
> > Now as far as point 1 goes, I think it's not really that awful to use
> > CheckAttributeType() with a dummy attribute name.  The attached
> > incomplete patch uses "partition key" which causes it to emit errors
> > like
> > regression=# create table fool (a int, b int) partition by list ((row(a, 
> > b)));
> > ERROR:  column "partition key" has pseudo-type record
> > I don't think that that's unacceptable.  But if we wanted to improve it,
> > we could imagine adding another flag, say CHKATYPE_IS_PARTITION_KEY,
> > that doesn't affect CheckAttributeType's semantics, just the wording of
> > the error messages it throws.
>
> Here's a fleshed-out patch that does it like that.
>
> While poking at this, I also started to wonder why CheckAttributeType
> wasn't recursing into ranges, since those are our other kind of
> container type.  And the answer is that it must, because we allow
> creation of ranges over composite types:
>
> regression=# create table foo (f1 int, f2 int);
> CREATE TABLE
> regression=# create type foorange as range (subtype = foo);
> CREATE TYPE
> regression=# alter table foo add column r foorange;
> ALTER TABLE
>
> Simple things still work on table foo, but surely this is exactly
> what CheckAttributeType is supposed to be preventing.  With the
> second attached patch you get
>
> regression=# alter table foo add column r foorange;
> ERROR:  composite type foo cannot be made a member of itself
>
> The second patch needs to go back all the way, the first one
> only as far as we have partitions.

While working on regression tests for index collation versioning [1],
I noticed that the 2nd patch apparently broke the ability to create a
table using a range over collatable datatype attribute, which we
apparently don't test anywhere.  Simple example to reproduce:

CREATE TYPE myrange_text AS range (subtype = text);
CREATE TABLE test_text(
meh myrange_text
);
ERROR:  42P16: no collation was derived for column "meh" with
collatable type text
HINT:  Use the COLLATE clause to set the collation explicitly.

AFAICT, this is only a thinko in CheckAttributeType(), where the range
collation should be provided rather than the original tuple desc one,
as per attached.  I also added a create/drop table in an existing
regression test that was already creating range over collatable type.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com


fix_collatable_range-v1.diff
Description: Binary data


Re: pg_restore crash when there is a failure before all child process is created

2020-01-31 Thread Ahsan Hadi
I have applied tested both patches separately and ran regression with both.
No new test cases are failing with both patches.

The issues is fixed by both patches however the fix from Tom looks more
elegant. I haven't done a detailed code review.

On Fri, Jan 31, 2020 at 12:39 AM Tom Lane  wrote:

> vignesh C  writes:
> > On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi  wrote:
> >> Can you share a test case or steps that you are using to reproduce this
> issue? Are you reproducing this using a debugger?
>
> > I could reproduce with the following steps:
> > Make cluster setup.
> > Create few tables.
> > Take a dump in directory format using pg_dump.
> > Restore the dump generated above using pg_restore with very high
> > number for --jobs options around 600.
>
> I agree this is quite broken.  Another way to observe the crash is
> to make the fork() call randomly fail, as per booby-trap-fork.patch
> below (not intended for commit, obviously).
>
> I don't especially like the proposed patch, though, as it introduces
> a great deal of confusion into what ParallelState.numWorkers means.
> I think it's better to leave that as being the allocated array size,
> and instead clean up all the fuzzy thinking about whether workers
> are actually running or not.  Like 0001-fix-worker-status.patch below.
>
> regards, tom lane
>
>

-- 
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.h...@highgo.ca


Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-01-31 Thread Michael Banck
Hi,

Am Freitag, den 31.01.2020, 13:53 +0900 schrieb Michael Paquier:
> On Thu, Jan 30, 2020 at 06:11:22PM +0100, Michael Banck wrote:
> Having a past tablespace version left
> around after an upgrade is a pilot error in my opinion because
> pg_upgrade generates a script to cleanup past tablespaces, no?  So
> your patch does not look like a good idea to me. 

Not sure I agree with it, but if that (i.e. after pg_upgrade in copy
mode, you have no business to use the old cluster as well as the new
one) is project policy, fair enough.

However, Postgres does not disallow to just create tablespaces in the
same location from two different versions, so you don't need the
pg_upgade scenario to get into this (pg_checksums checking the wrong
cluster's data) problem:

postgres@kohn:~$ psql -p 5437 -c "CREATE TABLESPACE bar LOCATION 
'/var/lib/postgresql/bar'"
CREATE TABLESPACE
postgres@kohn:~$ psql -p 5444 -c "CREATE TABLESPACE bar LOCATION 
'/var/lib/postgresql/bar'"
CREATE TABLESPACE
postgres@kohn:~$ ls bar
PG_11_201809051  PG_12_201909212
postgres@kohn:~$ touch bar/PG_11_201809051/pg_internal.init.123
postgres@kohn:~$ pg_ctlcluster 12 main stop
  sudo systemctl stop postgresql@12-main
postgres@kohn:~$ LANG=C /usr/lib/postgresql/12/bin/pg_checksums -D 
/var/lib/postgresql/12/main
pg_checksums: error: invalid segment number 0 in file name
"/var/lib/postgresql/12/main/pg_tblspc/16396/PG_11_201809051/pg_internal
.init.123"

I believe this is in order to allow pg_upgrade to run in the first
place. But is this pilot error as well? In any case, it is a situation
we allow to happen so IMO we should fix pg_checksums to skip the foreign
tablespaces.

As an aside, I would advocate to just skip files which fail the segment
number determination step (and maybe log a warning), not bail out.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz





Re: Marking some contrib modules as trusted extensions

2020-01-31 Thread Dean Rasheed
On Wed, 29 Jan 2020 at 21:39, Tom Lane  wrote:
>
> >>> pg_stat_statements
>
> Mmm, I'm not convinced --- the ability to see what statements are being
> executed in other sessions (even other databases) is something that
> paranoid installations might not be so happy about.  Our previous
> discussions about what privilege level is needed to look at
> pg_stat_statements info were all made against a background assumption
> that you needed some extra privilege to set up the view in the first
> place.  I think that would need another look or two before being
> comfortable that we're not shifting the goal posts too far.
>
> The bigger picture here is that I don't want to get push-back that
> we've broken somebody's security posture by marking too many extensions
> trusted.  So for anything where there's any question about security
> implications, we should err in the conservative direction of leaving
> it untrusted.
>

+1

I wonder if the same could be said about pgrowlocks.

Regards,
Dean




Re: Hash join not finding which collation to use for string hashing

2020-01-31 Thread Amit Langote
On Fri, Jan 31, 2020 at 6:15 AM Mark Dilger
 wrote:
> > On Jan 30, 2020, at 12:29 PM, Tom Lane  wrote:
> > Mark Dilger  writes:
> >> Would it make sense to catch a qual with unassigned collation
> >> somewhere in the planner, where the qual's operator family is
> >> estatblished, by checking if the operator family behavior is sensitive
> >> to collations?
> >
> >> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for 
> >> that.  Can you recommend how I would proceed there?
> >
> > There's no such information attached to opfamilies, which is more or less
> > forced by the fact that individual operators don't expose it either.
> > There's not much hope of making that better without incompatible changes
> > in the requirements for extensions to define operators and/or operator
> > families.
>
> Thanks, Tom, for confirming this.

Just for the record, I will explain why I brought up doing anything
with operator families at all.  What I was really imagining is putting
a hard-coded check somewhere in the middle of equivalence processing
to see if a given qual's operator would be sensitive to collation
based *only* on whether it belongs to a text operator family, such as
TEXT_BTREE_FAM_OID, whereas the qual expression's inputcollid is 0
(parser failed to resolve collation conflict among its arguments) and
erroring out if so.  If we do that, maybe we won't need
check_collation_set() that's used in various text operators.  Also,
erroring out sooner might allow us to produce more specific error
message, which as I understand it, would help with one of the Mark's
complaints that the error message is too ambiguous due to emitted as
such a low layer.

I thought of the idea after running into a recent commit relevant to
non-deterministic collations:

commit 2810396312664bdb941e549df7dfa75218d73a1c
Author: Tom Lane 
Date:   Sat Sep 21 16:29:17 2019 -0400

Fix up handling of nondeterministic collations with pattern_ops opclasses.

text_pattern_ops and its siblings can't be used with nondeterministic
collations, because they use the text_eq operator which will not behave
as bitwise equality if applied with a nondeterministic collation.  The
initial implementation of that restriction was to insert a run-time test
in the related comparison functions, but that is inefficient, may throw
misleading errors, and will throw errors in some cases that would work.
It seems sufficient to just prevent the combination during CREATE INDEX,
so do that instead.

Lacking any better way to identify the opclasses involved, we need to
hard-wire tests for them, which requires hand-assigned values for their
OIDs, which forces a catversion bump because they previously had OIDs
that would be assigned automatically.  That's slightly annoying in the
v12 branch, but fortunately we're not at rc1 yet, so just do it.

Discussion: https://postgr.es/m/22566.1568675...@sss.pgh.pa.us

IIUC, the above commit removes a check_collation_set() call from a
operator class comparison function in favor of ensuring that an index
using that operator class can only be defined with a deterministic
collation in the first place.  But as the above commit is purportedly
only a stop-gap solution due to lacking operator class infrastructure
to consider collation in operator semantics, maybe we shouldn't spread
such a hack in other places.

Thanks,
Amit




Re: SimpleLruTruncate() mutual exclusion

2020-01-31 Thread Noah Misch
On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote:
> > On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:
> > > I'm probably missing something, so just wanted to clarify. Do I
> > > understand correctly, that thread [1] and this one are independent, and
> > > it is assumed in the scenario above that we're at the end of XID space,
> > > but not necessarily having rounding issues? I'm a bit confused, since
> > > the reproduce script does something about cutoffPage, and I'm not sure
> > > if it's important or not.
> >
> > I think the repro recipe contained an early fix for the other thread's bug.
> > While they're independent in principle, I likely never reproduced this bug
> > without having a fix in place for the other thread's bug.  The bug in the
> > other thread was easier to hit.
> 
> Just to clarify, since the CF item for this patch was withdrawn
> recently. Does it mean that eventually the thread [1] covers this one
> too?
> 
> [1]: 
> https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

I withdrew $SUBJECT because, if someone reviews one of my patches, I want it
to be the one you cite in [1].  I plan not to commit [1] without a Ready for
Committer, and I plan not to commit $SUBJECT before committing [1].  I would
be willing to commit $SUBJECT without getting a review, however.




Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-01-31 Thread Kyotaro Horiguchi
At Fri, 31 Jan 2020 17:30:43 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I don't think that is a problem right away, of course. It looks good
> to me except for the possible excessive exclusion.  So, I don't object
> it if we don't mind that.

That's a bit wrong.  All the discussion is only on excludeFiles.  I
think we should refrain from letting more files match to
nohecksumFiles.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-01-31 Thread Kyotaro Horiguchi
At Fri, 31 Jan 2020 13:53:52 +0900, Michael Paquier  wrote 
in 
> On Thu, Jan 30, 2020 at 06:11:22PM +0100, Michael Banck wrote:
> > The other question is whether it is possible to end up with a
> > pg_internal.init.$PID file in a running cluster. E.g. if an instance
> > crashes and gets started up again - are those cleaned up during crash
> > recovery, or should pg_checksums ignore them? Right now pg_checksums
> > only checks against a list of filenames and only skips on exact matches
> > not prefixes so that might take a bit of work.
> 
> Indeed, with a bad timing and a crash in the middle of
> write_relcache_init_file(), it could be possible to have such a file
> left around in the data folder.  Having a past tablespace version left
> around after an upgrade is a pilot error in my opinion because
> pg_upgrade generates a script to cleanup past tablespaces, no?  So
> your patch does not look like a good idea to me.  And now that I look
> at it, if we happen to leave behind a temporary file for
> pg_internal.init, backups fail with the following error:
> 2020-01-31 13:26:18.345 JST [102076] 010_pg_basebackup.pl ERROR:
> invalid segment number 0 in file "pg_internal.init.123"

Agreed.

> So, I think that it would be better to change basebackup.c,
> pg_checksums and pg_rewind so as files are excluded if there is a
> prefix match with the exclude lists.  Please see the attached.

Agreed that the tools should ignore such files.  Looking excludeFile,
it seems to me that basebackup makes sure to exclude only files that
should harm.  I'm not sure whether it's explicitly, but
tablespace_map.old and backup_label.old are not excluded.

The patch excludes harmless files such like "backup_label.20200131"
and the two files above.

I don't think that is a problem right away, of course. It looks good
to me except for the possible excessive exclusion.  So, I don't object
it if we don't mind that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center