date:20200618

Re: Global snapshots

2020-06-18 Thread Amit Kapila

On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
 wrote:
>
>
> On 09.06.2020 11:41, Fujii Masao wrote:
> >
> >
> > The patches seem not to be registered in CommitFest yet.
> > Are you planning to do that?
> Not now. It is a sharding-related feature. I'm not sure that this
> approach is fully consistent with the sharding way now.
>

Can you please explain in detail, why you think so?  There is no
commit message explaining what each patch does so it is difficult to
understand why you said so?  Also, can you let us know if this
supports 2PC in some way and if so how is it different from what the
other thread on the same topic [1] is trying to achieve?  Also, I
would like to know if the patch related to CSN based snapshot [2] is a
precursor for this, if not, then is it any way related to this patch
because I see the latest reply on that thread [2] which says it is an
infrastructure of sharding feature but I don't understand completely
whether these patches are related?

Basically, there seem to be three threads, first, this one and then
[1] and [2] which seems to be doing the work for sharding feature but
there is no clear explanation anywhere if these are anyway related or
whether combining all these three we are aiming for a solution for
atomic commit and atomic visibility.

I am not sure if you know answers to all these questions so I added
the people who seem to be working on the other two patches.  I am also
afraid that if there is any duplicate or conflicting work going on in
these threads so we should try to find that as well.

[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Internal key management system

2020-06-18 Thread Fabien COELHO



Hello Masahiko-san,


What I referred to "only one key" is KEK.


Ok, sorry, I misunderstood.


Yeah, it depends on KMS, meaning we need different extensions for
different KMS. A KMS might support an interface that creates key if not
exist during GET but some KMS might support CREATE and GET separately.


I disagree that it is necessary, but this is debatable. The KMS-side
interface code could take care of that, eg:

   if command is "get X"
 if (X does not exist in KMS)
   create a new key stored in KMS, return it;
 else
   return KMS-stored key;
   ...

So you can still have a "GET" only interface which adapts to the "final"
KMS. Basically, the glue code which implements the interface for the KMS
can include some logic to adapt to the KMS point of view.


Is the above code is for the extension side, right?


Such a code could be in the command with which pg communicates (eg through 
its stdin/stdout, or whatever) to get keys.


pg talks to the command, the command can do anything, such as storing keys 
or communicating with an external service to retrieve them, anything 
really, that is the point.


I'm advocating defining the pg/command protocol, something along "GET xxx" 
as you wrote, and possibly provide a possible/reasonable command 
implementation, which would be part of the code you put in your patch, 
only it would be in the command instead of postgres.


For example, if users want to use a cloud KMS, say AWS KMS, to store 
DEKs and KEK they need an extension that is loaded to postgres and can 
communicate with AWS KMS. I imagine that such extension needs to be 
written in C,


Why? I could write it in bash, probably. Ok, maybe not so good for suid, 
but in principle it could be anything. I'd probably write it in C, though.


the communication between the extension uses AWS KMS API, and the 
communication between postgres core and the extension uses text 
protocol.


I'm not sure of the word "extension" above. For me the postgres side could 
be an extension as in "CREATE EXTENSION". The command itself could be 
provided in the extension code, but would not be in the "CREATE 
EXTENSION", it would be something run independently.


When postgres core needs a DEK identified by KEY-A, it asks 
for the extension to get the DEK by passing something like “GET KEY-A” 
message, and then the extension asks the existence of that key to AWK 
KMS, creates if not exist and returns it to the postgres core. Is my 
understanding right?


Yes. The command in the use-case you outline would just be an 
intermediary, but for another use-case it would do the storing. The point 
of aiming at extensibility if that from pg point of view the external 
commands provide keys, but what these commands really do to do this can be 
anything.



When we have TDE feature in the future, we would also need to change
frontend tools such as pg_waldump and pg_rewind that read database
files so that they can read encrypted files. It means that these
frond-end tools also somehow need to communicate with the external
place to get DEKs in order to decrypt encrypted database files. In
your idea, what do you think about how we can support it?


Hmmm. My idea was that the natural interface would be to get things 
through postgres. For a debug tool such as pg_waldump, probably it needs 
to be adapted if it needs to decrypt data to operate.


Now I'm not sure I understood, because of the DEK are managed by postgres 
in your patch, waldump and other external commands would have no access to 
the decrypted data anyway, so the issue would be the same?


With file-level encryption, obviously all commands which have to read and 
understand the files have to be adapted if they are to still work, which 
is another argument to have some interface rather than integrated 
server-side stuff, because these external commands would need to be able 
to get keys and use them as well.


Or I misunderstood something.


I'd like an extensible design to have anything in core. As I said in an
other mail, if you want to handle a somehow restricted use case, just
provide an external extension and do nothing in core, please. Put in core
something that people with a slightly different use case or auditor can
build on as well. The current patch makes a dozen hard-coded decisions
which it should not, IMHO.


It might have confused you that I included key manager and encryption
SQL functions to the patches but this key manager has been designed
dedicated to only TDE.


Hmmm. This is NOT AT ALL what the patch does. The documentation in your 
patch talks about "column level encryption", which is an application 
thing. Now you seem to say that it does not matter and can be removed 
because the use case is elsewhere.



It might be better to remove both SQL interface
and SQL key we discussed from the patch set as they are actually not
necessary for TDE purposes.


The documentation part of the patch, at no point, talks about TDE 
(transparent data encryption

Re: POC and rebased patch for CSN based snapshots

2020-06-18 Thread Fujii Masao





On 2020/06/19 13:36, movead...@highgo.ca wrote:


 >You mean that the last generated CSN needs to be WAL-logged because any 
smaller

CSN than the last one should not be reused after crash recovery. Right?

Yes that's it.


If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?


CSN collected based on time of  system in this patch, but time is not reliable 
all the
time. And it designed for Global CSN(for sharding) where it may rely on CSN from
other node , which generated from other machine.

So monotonically is not reliable and it need to keep it's largest CSN in wal in 
case
of crash.


Thanks for the explanaion! Understood.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Fast DSM segments

2020-06-18 Thread Thomas Munro

On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro  wrote:
> Here's a version that adds some documentation.

I jumped on a dual socket machine with 36 cores/72 threads and 144GB
of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge
pages available, and I ran a very simple test: select count(*) from t
t1 join t t2 using (i), where the table was created with create table
t as select generate_series(1, 4)::int i, and then prewarmed
into 20GB of shared_buffers.  I compared the default behaviour to
preallocate_dynamic_shared_memory=20GB, with work_mem set sky high so
that there would be no batching (you get a hash table of around 16GB),
and I set things up so that I could test with a range of worker
processes, and computed the speedup compared to a serial hash join.

Here's what I got:

Processes   Default   Preallocated
1   627.6s
9   101.3s = 6.1x 68.1s = 9.2x
18   56.1s = 11.1x34.9s = 17.9x
27   42.5s = 14.7x23.5s = 26.7x
36   36.0s = 17.4x18.2s = 34.4x
45   33.5s = 18.7x15.5s = 40.5x
54   35.6s = 17.6x13.6s = 46.1x
63   35.4s = 17.7x12.2s = 51.4x
72   33.8s = 18.5x11.3s = 55.5x

It scaled nearly perfectly up to somewhere just under 36 threads, and
then the slope tapered off a bit so that each extra process was
supplying somewhere a bit over half of its potential.  I can improve
the slope after the halfway point a bit by cranking HASH_CHUNK_SIZE up
to 128KB (and it doesn't get much better after that):

Processes   Default   Preallocated
1   627.6s
9   102.7s = 6.1x 67.7s = 9.2x
18   56.8s = 11.1x34.8s = 18.0x
27   41.0s = 15.3x23.4s = 26.8x
36   33.9s = 18.5x18.2s = 34.4x
45   30.1s = 20.8x15.4s = 40.7x
54   27.2s = 23.0x13.3s = 47.1x
63   25.1s = 25.0x11.9s = 52.7x
72   23.8s = 26.3x10.8s = 58.1x

I don't claim that this is representative of any particular workload
or server configuration, but it's a good way to show that bottleneck,
and it's pretty cool to be able to run a query that previously took
over 10 minutes in 10 seconds.  (I can shave a further 10% off these
times with my experimental hash join prefetching patch, but I'll
probably write about that separately when I've figured out why it's
not doing better than that...).

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Kyotaro Horiguchi

At Fri, 19 Jun 2020 08:59:48 +0530, Amit Kapila  wrote 
in 
> > > > Mmm. pg_walfile_name seems too specialize to
> > > > pg_stop_backup(). (pg_walfile_name_offset() returns wrong result for
> > > > segment boundaries.)  I'm not willing to do that only to follow such
> > > > suspicious(?) specification, but surely it would practically be better
> > > > doing that. Please find the attached first patch.
> > > >
> > >
> > > It is a little unclear to me how this or any proposed patch will solve
> > > the original problem reported by Fujii-San?  Basically, the problem
> > > arises because we don't have an interlock between when the checkpoint
> > > removes the WAL segment and the view tries to acquire the same.  Am, I
> > > missing something?
> >
> > I'm not sure, but I don't get the point of blocking WAL segment
> > removal until the view is completed.
> >
> 
> I am not suggesting to do that.
> 
> > The said columns of the view are
> > just for monitoring, which needs an information snapshot seemingly
> > taken at a certain time. And InvalidateObsoleteReplicationSlots kills
> > walsenders using lastRemovedSegNo of a different time.  The two are
> > independent each other.
> >
> > Also the patch changes min_safe_lsn to show an LSN at segment boundary
> > + 1.
> >
> 
> But aren't we doing last_removed_seg+1 even without the patch?  See code below
> 
> - {
> - XLogRecPtr min_safe_lsn;
> -
> - XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,
> - wal_segment_size, min_safe_lsn);

It is at the beginning byte of the *next* segment. Fujii-san told that
it should be the next byte of it, namely
"XLogSegNoOffsetToRecPtr(last_removed_seg + 1, *1*,", and the patch
calculates as that. It adds the follows instead.

+   if (max_slot_wal_keep_size_mb >= 0 && last_removed_seg != 0)
+   XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 1,
+   
wal_segment_size, min_safe_lsn);

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Kyotaro Horiguchi

At Fri, 19 Jun 2020 09:09:03 +0530, Amit Kapila  wrote 
in 
> On Fri, Jun 19, 2020 at 8:44 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Fri, 19 Jun 2020 10:39:58 +0900, Michael Paquier  
> > wrote in
> > > Honestly, I find a bit silly the design to compute and use the same
> > > minimum LSN value for all the tuples returned by
> > > pg_get_replication_slots, and you can actually get a pretty good
> >
> > I see it as silly.  I think I said upthread that it was the distance
> > to the point where the slot loses a segment, and it was rejected but
> > just removing it makes us unable to estimate the distance so it is
> > there.
> >
> 
> IIUC, the value of min_safe_lsn will lesser than restart_lsn, so one
> can compute the difference of those to see how much ahead the
> replication slot's restart_lsn is from min_safe_lsn but still it is
> not clear how user will make any use of it.  Can you please explain
> how the distance you are talking about is useful to users or anyone?

When max_slot_wal_keep_size is set, the slot may retain up to as many
as that amount of old WAL segments then suddenly loses the oldest
segments.  *I* thought that I would use it in an HA cluster tool to
inform users about the remaining time (not literally, of course) a
disconnected standy is allowed diconnected.  Of course even if some
segments have been lost, they could be copied from the primary's
archive so that's not critical in theory.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread Bharath Rupireddy

> > > I don't see any problem in removing this extra parameter.
> > >
> > > However another thought, can it be used to report a bit meaningful
> > > error for field size < 0 check?
> >
> > column_no was used for that purpose in the past, but commit 0e319c7ad7
> > changed that. If we want to use column_no in the log message again,
> > it's better to check why commit 0e319c7ad7 got rid of column_no from
> > the message.
>
> I noticed that displaying of column information is present and it is
> done in a different way. Basically cstate->cur_attname is set with the
> column name before calling CopyReadBinaryAttribute function. If there
> is any error in CopyReadBinaryAttribute function,
> CopyFromErrorCallback will be called. CopyFromErrorCallback function
> takes care of displaying the column name by using cstate->cur_attname.
> I tried simulating this and it displays the column name neatly in the
> error message.:
> postgres=# copy t1 from '/home/db/copydata/t1_copy.bin' with (format 
> 'binary');
> ERROR:  invalid field size
> CONTEXT:  COPY t1, line 1, column c1
> I feel we can safely remove the parameter as in the patch.
>

thanks for this information.

+1 for this patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread David Rowley

On Fri, 19 Jun 2020 at 15:28, Jeff Davis  wrote:
>
> On Thu, 2020-06-18 at 15:37 +1200, David Rowley wrote:
> > The attached patch fixes both the missing parallel worker information
> > and puts the properties on a single line for format=text.
>
> Thank you.
>
> I noticed some strange results in one case where one worker had a lot
> more batches than another. After I investigated, I think everything is
> fine -- it seems that one worker was getting more tuples from the
> underlying parallel scan. So everything looks good to me.

Thanks for having a look at this. I just pushed it.

David

missing support for Microsoft VSS Writer

2020-06-18 Thread Pavel Stehule

Hi

some czech user reported a broken database when he used a repair point on
Microsoft Windows.

The reply from Microsoft was, so it is not a Microsoft issue, but Postgres
issue, because Postgres doesn't handle VSS process correctly, and then
restore point can restore some parts of Postgres's data too.

https://docs.microsoft.com/en-us/windows-server/storage/file-server/volume-shadow-copy-service

Regards

Pavel

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread vignesh C

On Thu, Jun 18, 2020 at 10:05 PM Fujii Masao
 wrote:
>
>
>
> On 2020/06/18 23:09, Bharath Rupireddy wrote:
> > On Thu, Jun 18, 2020 at 7:01 PM vignesh C  wrote:
> >>
> >> Hi,
> >>
> >> While checking copy from code I found that the function parameter
> >> column_no is not used in CopyReadBinaryAttribute. I felt this could be
> >> removed.
> >> Attached patch contains the changes for the same.
> >> Thoughts?
> >>
> >
> > I don't see any problem in removing this extra parameter.
> >
> > However another thought, can it be used to report a bit meaningful
> > error for field size < 0 check?
>
> column_no was used for that purpose in the past, but commit 0e319c7ad7
> changed that. If we want to use column_no in the log message again,
> it's better to check why commit 0e319c7ad7 got rid of column_no from
> the message.

I noticed that displaying of column information is present and it is
done in a different way. Basically cstate->cur_attname is set with the
column name before calling CopyReadBinaryAttribute function. If there
is any error in CopyReadBinaryAttribute function,
CopyFromErrorCallback will be called. CopyFromErrorCallback function
takes care of displaying the column name by using cstate->cur_attname.
I tried simulating this and it displays the column name neatly in the
error message.:
postgres=# copy t1 from '/home/db/copydata/t1_copy.bin' with (format 'binary');
ERROR:  invalid field size
CONTEXT:  COPY t1, line 1, column c1
I feel we can safely remove the parameter as in the patch.

Regards,
Vignesh

Re: POC and rebased patch for CSN based snapshots

2020-06-18 Thread movead...@highgo.ca


>You mean that the last generated CSN needs to be WAL-logged because any smaller
>CSN than the last one should not be reused after crash recovery. Right?
Yes that's it. 

>If right, that WAL-logging seems not necessary because CSN mechanism assumes
>CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
>crash recovery must be larger than that before. No?

CSN collected based on time of  system in this patch, but time is not reliable 
all the
time. And it designed for Global CSN(for sharding) where it may rely on CSN from
other node , which generated from other machine. 

So monotonically is not reliable and it need to keep it's largest CSN in wal in 
case
of crash.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread David Rowley

On Fri, 19 Jun 2020 at 16:06, Justin Pryzby  wrote:
>
> On Fri, Jun 19, 2020 at 03:03:41PM +1200, David Rowley wrote:
> > On Fri, 19 Jun 2020 at 14:20, Justin Pryzby  wrote:
> > > Please be sure to use two spaces between each field !
> > >
> > > See earlier discussions (and commits referenced by the Opened Items page).
> > > https://www.postgresql.org/message-id/20200402054120.gc14...@telsasoft.com
> > > https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com
> >
> > Thanks. I wasn't aware of that conversion.
>
> To be clear, there wasn't a "conversion".

Sorry, I meant to write "conversation".

>
> Some minor nitpicks now that we've dealt with the important issue of
> whitespace:
>
> +   boolgotone = false;
>
> => Would you consider calling that "found" ?

I'll consider it.  I didn't really invent the name. There's plenty of
other places that use that name for the same thing.  I think of
"found" as more often used when we're looking for something and need
to record if we found it or not.  That's not really happening here. I
just want to record if we've added a property yet.

> +   int n;
> +
> +   for (n = 0; n < aggstate->shared_info->num_workers; n++)
>
> => Maybe use a C99 declaration ?

Maybe.  It'll certainly save a couple of lines.

> +   if (hash_batches_used > 0)
> +   {
> +   ExplainPropertyInteger("Disk Usage", 
> "kB", hash_disk_used,
> + 
>  es);
> +   ExplainPropertyInteger("HashAgg 
> Batches", NULL,
> + 
>  hash_batches_used, es);
> +   }
>
> => Shouldn't those *always* be shown in nontext format ?  I realize that's not
> a change new to your patch, and maybe should be a separate commit.

Perhaps a separate commit. I don't want to overload the debate with
too many things. I'd rather push forward with just the originally
proposed change since nobody has shown any objection for it.

> +   size = offsetof(SharedAggInfo, sinstrument)
> +   + pcxt->nworkers * sizeof(AggregateInstrumentation);
>
> => There's a couple places where I'd prefer to see "+" at the end of the
> preceding line (but YMMV).

I pretty much just copied the whole of that code from nodeSort.c. I'm
more inclined to just keep it as similar to that as possible. However,
if pgindent decides otherwise, then I'll go with that. I imagine it
won't move it though as that code has already been through indentation
a few times before.

Thanks for the review.

David

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread Justin Pryzby

On Fri, Jun 19, 2020 at 03:03:41PM +1200, David Rowley wrote:
> On Fri, 19 Jun 2020 at 14:20, Justin Pryzby  wrote:
> > Please be sure to use two spaces between each field !
> >
> > See earlier discussions (and commits referenced by the Opened Items page).
> > https://www.postgresql.org/message-id/20200402054120.gc14...@telsasoft.com
> > https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com
> 
> Thanks. I wasn't aware of that conversion.

To be clear, there wasn't a "conversion".  There were (and are still) different
formats (which everyone agrees isn't ideal) used by "explain(BUFFERS)" vs Sort
and Hash.  The new incr sort changed from output that looked like Buffers (one
space, and equals) to output that looked like Sort/Hashjoin (two spaces and
colons).  And the new explain(WAL) originally used a hybrid (which, on my
request, used two spaces), but it was changed (back) to use one space, for
consistency with explain(BUFFERS).

Some minor nitpicks now that we've dealt with the important issue of
whitespace:

+   boolgotone = false;

=> Would you consider calling that "found" ?

I couldn't put my finger on it at first why this felt so important to ask, but
realized that my head was tripping over a variable whose name starts with
"goto", and spending 0.2 seconds trying to figure out what you might have meant
by "goto ne".

+   int n;
+
+   for (n = 0; n < aggstate->shared_info->num_workers; n++)

=> Maybe use a C99 declaration ?

+   if (hash_batches_used > 0)
+   {
+   ExplainPropertyInteger("Disk Usage", 
"kB", hash_disk_used,
+   
   es);
+   ExplainPropertyInteger("HashAgg 
Batches", NULL,
+   
   hash_batches_used, es);
+   }

=> Shouldn't those *always* be shown in nontext format ?  I realize that's not
a change new to your patch, and maybe should be a separate commit.

+   size = offsetof(SharedAggInfo, sinstrument)
+   + pcxt->nworkers * sizeof(AggregateInstrumentation);

=> There's a couple places where I'd prefer to see "+" at the end of the
preceding line (but YMMV).

-- 
Justin

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread Amit Kapila

On Fri, Jun 19, 2020 at 9:18 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Thu, Jun 18, 2020 at 10:05 PM Fujii Masao
> >  wrote:
> >> column_no was used for that purpose in the past, but commit 0e319c7ad7
> >> changed that.
>
> > Yeah, but not sure why?  By looking at the commit message and change
> > it is difficult to say why it has been removed?  Tom has made that
> > change but I don't think he would remember it, in any case, adding him
> > in the email to see if he remembers anything related to it.
>
> Hm, no, that commit is nearly old enough to vote :-(
>
> However, I dug around in the archives, and I found what seems to be
> the relevant pghackers thread:
>
> https://www.postgresql.org/message-id/flat/28188.1064615075%40sss.pgh.pa.us#8e0c07452bb7e729829d456cfb0ec485
>
> Looking at that, I think I concluded that these error cases are not useful
> indications of problems within the specific column's data, but most likely
> indicate corruption at the level of the overall COPY line format; ergo the
> line-level context display is sufficient.  You could quibble with that
> conclusion of course, but if you agree with it, then the column_no
> parameter is useless here.
>

I don't see any problem with your conclusion and the fact that we
haven't came across any case which requires column_no in such messages
favors your conclusion.

>  I probably just failed to notice at the time
> that the parameter was otherwise unused, else I would have removed it.
>

No issues, I can take care of this (probably in HEAD only).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-06-18 Thread Justin Pryzby

On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote:
> On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby  wrote:
> > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote:
> > > I've pushed the fist part of this patch series - I've reorganized it a
> >
> > I scanned through this again post-commit.  Find attached some suggestions.
> >
> > Shouldn't non-text explain output always show both disk *and* mem, including
> > zeros ?
>
> Could you give more context on this? Is there a standard to follow?
> Regular sort nodes only ever report one type, so there's not a good
> parallel there.

The change I proposed was like:

@@ -2829,7 +2829,6 @@ show_incremental_sort_group_info(IncrementalSortGroupInfo 
*groupInfo,

ExplainPropertyList("Sort Methods Used", methodNames, es);

-   if (groupInfo->maxMemorySpaceUsed > 0)
{
longavgSpace = 
groupInfo->totalMemorySpaceUsed / groupInfo->groupCount;
const char *spaceTypeName;
...
-   if (groupInfo->maxDiskSpaceUsed > 0)
{
...

To show in non-text format *both* disk and memory space used, even if zero.

I still think that's what's desirable.

If it's important to show *whether* a sort space was used, then I think there
should be a boolean, or an int 0/1.  But I don't think it's actually needed,
since someone parsing the explain output could just check
|if _dict['Peak Sort Space Used'] > 0: ...
the same as you're doing, without having to write some variation on:
|if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ...

-- 
Justin

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread amul sul

On Thu, Jun 18, 2020 at 8:23 PM Robert Haas  wrote:
>
> On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila  wrote:
> > For buffer replacement, many-a-times we have to also perform
> > XLogFlush, what do we do for that?  We can't proceed without doing
> > that and erroring out from there means stopping read-only query from
> > the user perspective.
>
> I think we should stop WAL writes, then XLogFlush() once, then declare
> the system R/O. After that there might be more XLogFlush() calls but
> there won't be any new WAL, so they won't do anything.
>
Yeah, the proposed v1 patch does the same.

Regards,
Amul

Re: POC and rebased patch for CSN based snapshots

2020-06-18 Thread Fujii Masao





On 2020/06/19 12:12, movead...@highgo.ca wrote:


Thanks for reply.

 >Probably it's not time to do the code review yet, but when I glanced the 
patch,

I came up with one question.
0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
(and inserts it into WAL buffers). Which means that new WAL record is generated
whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
really necessary for CSN?

This is designed for crash recovery, here we record our most new lsn in wal so 
it
will not use a history lsn after a restart. It will not write into wal every 
time, but with
a gap which you can see CSNAddByNanosec() function.


You mean that the last generated CSN needs to be WAL-logged because any smaller
CSN than the last one should not be reused after crash recovery. Right?

If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?



BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
acceptable because spinlocks are intended for *very* short-term locks.
Secondly, I don't think that WAL generation during ProcArrayLock is good
design because ProcArrayLock is likely to be bottleneck and its term should
be short for performance gain.

Thanks for point out which may help me deeply, I will reconsider that.


Thanks for working on this!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread Tom Lane

Amit Kapila  writes:
> On Thu, Jun 18, 2020 at 10:05 PM Fujii Masao
>  wrote:
>> column_no was used for that purpose in the past, but commit 0e319c7ad7
>> changed that.

> Yeah, but not sure why?  By looking at the commit message and change
> it is difficult to say why it has been removed?  Tom has made that
> change but I don't think he would remember it, in any case, adding him
> in the email to see if he remembers anything related to it.

Hm, no, that commit is nearly old enough to vote :-(

However, I dug around in the archives, and I found what seems to be
the relevant pghackers thread:

https://www.postgresql.org/message-id/flat/28188.1064615075%40sss.pgh.pa.us#8e0c07452bb7e729829d456cfb0ec485

Looking at that, I think I concluded that these error cases are not useful
indications of problems within the specific column's data, but most likely
indicate corruption at the level of the overall COPY line format; ergo the
line-level context display is sufficient.  You could quibble with that
conclusion of course, but if you agree with it, then the column_no
parameter is useless here.  I probably just failed to notice at the time
that the parameter was otherwise unused, else I would have removed it.

regards, tom lane

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Amit Kapila

On Fri, Jun 19, 2020 at 8:44 AM Kyotaro Horiguchi
 wrote:
>
> At Fri, 19 Jun 2020 10:39:58 +0900, Michael Paquier  
> wrote in
> > Honestly, I find a bit silly the design to compute and use the same
> > minimum LSN value for all the tuples returned by
> > pg_get_replication_slots, and you can actually get a pretty good
>
> I see it as silly.  I think I said upthread that it was the distance
> to the point where the slot loses a segment, and it was rejected but
> just removing it makes us unable to estimate the distance so it is
> there.
>

IIUC, the value of min_safe_lsn will lesser than restart_lsn, so one
can compute the difference of those to see how much ahead the
replication slot's restart_lsn is from min_safe_lsn but still it is
not clear how user will make any use of it.  Can you please explain
how the distance you are talking about is useful to users or anyone?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC and rebased patch for CSN based snapshots

2020-06-18 Thread movead...@highgo.ca


Thanks for reply.

>AFAIU, this patch is to improve scalability and also will be helpful
>for Global Snapshots stuff, is that right?  If so, how much
>performance/scalability benefit this patch will have after Andres's
>recent work on scalability [1]?
The patch focus on to be an infrastructure of sharding feature, according
to my test almost has the same performance with and without the patch.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Amit Kapila

On Fri, Jun 19, 2020 at 6:32 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 18 Jun 2020 18:18:37 +0530, Amit Kapila  
> wrote in
> > On Thu, Jun 18, 2020 at 11:52 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Wed, 17 Jun 2020 21:37:55 +0900, Fujii Masao 
> > >  wrote in
> > > > On 2020/06/15 16:35, Kyotaro Horiguchi wrote:
> > > > Isn't it better to use 1 as the second argument of the above,
> > > > in order to address the issue that I reported upthread?
> > > > Otherwise, the WAL file name that pg_walfile_name(min_safe_lsn)
> > > > returns
> > > > would be confusing.
> > >
> > > Mmm. pg_walfile_name seems too specialize to
> > > pg_stop_backup(). (pg_walfile_name_offset() returns wrong result for
> > > segment boundaries.)  I'm not willing to do that only to follow such
> > > suspicious(?) specification, but surely it would practically be better
> > > doing that. Please find the attached first patch.
> > >
> >
> > It is a little unclear to me how this or any proposed patch will solve
> > the original problem reported by Fujii-San?  Basically, the problem
> > arises because we don't have an interlock between when the checkpoint
> > removes the WAL segment and the view tries to acquire the same.  Am, I
> > missing something?
>
> I'm not sure, but I don't get the point of blocking WAL segment
> removal until the view is completed.
>

I am not suggesting to do that.

> The said columns of the view are
> just for monitoring, which needs an information snapshot seemingly
> taken at a certain time. And InvalidateObsoleteReplicationSlots kills
> walsenders using lastRemovedSegNo of a different time.  The two are
> independent each other.
>
> Also the patch changes min_safe_lsn to show an LSN at segment boundary
> + 1.
>

But aren't we doing last_removed_seg+1 even without the patch?  See code below

- {
- XLogRecPtr min_safe_lsn;
-
- XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,
- wal_segment_size, min_safe_lsn);



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread Jeff Davis

On Thu, 2020-06-18 at 15:37 +1200, David Rowley wrote:
> The attached patch fixes both the missing parallel worker information
> and puts the properties on a single line for format=text.

Thank you. 

I noticed some strange results in one case where one worker had a lot
more batches than another. After I investigated, I think everything is
fine -- it seems that one worker was getting more tuples from the
underlying parallel scan. So everything looks good to me.

Regards,
Jeff Davis

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Kyotaro Horiguchi

At Fri, 19 Jun 2020 10:39:58 +0900, Michael Paquier  wrote 
in 
> On Fri, Jun 19, 2020 at 10:02:54AM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 18 Jun 2020 18:18:37 +0530, Amit Kapila  
> > wrote in 
> >> It is a little unclear to me how this or any proposed patch will solve
> >> the original problem reported by Fujii-San?  Basically, the problem
> >> arises because we don't have an interlock between when the checkpoint
> >> removes the WAL segment and the view tries to acquire the same.  Am, I
> >> missing something?
> 
> The proposed patch fetches the computation of the minimum LSN across
> all slots before taking ReplicationSlotControlLock so its value gets
> more lossy, and potentially older than what the slots actually
> include.  So it is an attempt to take the safest spot possible.

Minimum LSN (lastRemovedSegNo) is not protected by the lock. That
makes no defference.

> Honestly, I find a bit silly the design to compute and use the same
> minimum LSN value for all the tuples returned by
> pg_get_replication_slots, and you can actually get a pretty good

I see it as silly.  I think I said upthread that it was the distance
to the point where the slot loses a segment, and it was rejected but
just removing it makes us unable to estimate the distance so it is
there.

> estimate of that by emulating ReplicationSlotsComputeRequiredLSN()
> directly with what pg_replication_slot provides as we have a min()
> aggregate for pg_lsn.

min(lastRemovedSegNo) is the earliest value. It is enough to read it
at the first then use it in all slots.

> For these reasons, I think that we should remove for now this
> information from the view, and reconsider this part more carefully for
> 14~ with a clear definition of how much lossiness we are ready to
> accept for the information provided here, if necessary.  We could for
> example just have a separate SQL function that just grabs this value
> (or a more global SQL view for XLogCtl data that includes this data).

I think, we need at least one of the "distance" above or min_safe_lsn
in anywhere reachable from users.

> > I'm not sure, but I don't get the point of blocking WAL segment
> > removal until the view is completed.
> 
> We should really not do that anyway for a monitoring view.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: POC and rebased patch for CSN based snapshots

2020-06-18 Thread movead...@highgo.ca


Thanks for reply.

>Probably it's not time to do the code review yet, but when I glanced the patch,
>I came up with one question.
>0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
>(and inserts it into WAL buffers). Which means that new WAL record is generated
>whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
>really necessary for CSN?
This is designed for crash recovery, here we record our most new lsn in wal so 
it
will not use a history lsn after a restart. It will not write into wal every 
time, but with
a gap which you can see CSNAddByNanosec() function.

>BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
>WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
>acceptable because spinlocks are intended for *very* short-term locks.
>Secondly, I don't think that WAL generation during ProcArrayLock is good
>design because ProcArrayLock is likely to be bottleneck and its term should
>be short for performance gain.
Thanks for point out which may help me deeply, I will reconsider that.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: Asynchronous Append on postgres_fdw nodes.

2020-06-18 Thread Kyotaro Horiguchi

At Wed, 17 Jun 2020 15:01:08 +0500, "Andrey V. Lepikhov" 
 wrote in 
> On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:
> > They return 25056 rows, which is far more than 9741 rows. So remote
> > join won.
> > Of course the number of returning rows is not the only factor of the
> > cost change but is the most significant factor in this case.
> > 
> Thanks for the attention.
> I see one slight flaw of this approach to asynchronous append:
> AsyncAppend works only for ForeignScan subplans. if we have
> PartialAggregate, Join or another more complicated subplan, we can't
> use asynchronous machinery.

Yes, the asynchronous append works only when it has at least one
async-capable immediate subnode. Currently there's only one
async-capable node, ForeignScan.

> I imagine an Append node, that can switch current subplan from time to
> time and all ForeignScan nodes of the overall plan are added to one
> queue. The scan buffer can be larger than a cursor fetch size and each
> IterateForeignScan() call can induce asynchronous scan of another
> ForeignScan node if buffer is not full.
> But these are only thoughts, not an proposal. I have no questions to
> your patch right now.

A major property of async-capable nodes is yieldability(?), that is,
it ought to be able to give way for other nodes when it is not ready
to return a tuple. That means such nodes are state machine rather than
function.  Fortunately ForeignScan is natively a kind of state machine
in a sense so it is easily turned into async-capable node. Append is
also a state machine in the same sense but currently no other nodes
can use it as async-capable node.

For example, an Agg or Sort node generally needs two or more tuples
from its subnode to generate a tuple to be returned to parent.  Some
working memory is needed while generating a returning tuple.  If the
node takes in a tuple from a subnode but not generated a result tuple,
the node must yield CPU time to other nodes. These nodes are not state
machines at all and it is somewhat hard to make it so.  It gets quite
complex in WindowAgg since it calls subnodes at arbitrary call level
of component functions.

Further issue is leaf scan nodes, SeqScan, IndexScan, etc. also need
to be asynchronous.

Finally the executor will turn into push-up style from the current
volcano (pull-style).

I tried all of that (perhaps except scan nodes) a couple of years ago
but the result was a kind of crap^^;

After all, I returned to the current shape.  It doesn't seem bad as
Thomas proposed the same thing.

*1: async-aware is defined (here) as a node that can have
async-capable subnodes.

> It may lead to a situation than small difference in a filter constant
> can cause a big difference in execution time.

It is what we usually see?  We could get a big win for certain
condition without a loss even otherwise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread David Rowley

On Fri, 19 Jun 2020 at 14:20, Justin Pryzby  wrote:
> Please be sure to use two spaces between each field !
>
> See earlier discussions (and commits referenced by the Opened Items page).
> https://www.postgresql.org/message-id/20200402054120.gc14...@telsasoft.com
> https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com

Thanks. I wasn't aware of that conversion.

v2 attached.

David


hashagg_explain_fix_v2.patch
Description: Binary data

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread Amit Kapila

On Thu, Jun 18, 2020 at 10:05 PM Fujii Masao
 wrote:
>
>
> On 2020/06/18 23:09, Bharath Rupireddy wrote:
> > On Thu, Jun 18, 2020 at 7:01 PM vignesh C  wrote:
> >>
> >> Hi,
> >>
> >> While checking copy from code I found that the function parameter
> >> column_no is not used in CopyReadBinaryAttribute. I felt this could be
> >> removed.
> >> Attached patch contains the changes for the same.
> >> Thoughts?
> >>
> >
> > I don't see any problem in removing this extra parameter.
> >
> > However another thought, can it be used to report a bit meaningful
> > error for field size < 0 check?
>
> column_no was used for that purpose in the past, but commit 0e319c7ad7
> changed that.
>

Yeah, but not sure why?  By looking at the commit message and change
it is difficult to say why it has been removed?  Tom has made that
change but I don't think he would remember it, in any case, adding him
in the email to see if he remembers anything related to it.

commit 0e319c7ad7665673103f0b10752700fd2f33acd3
Author: Tom Lane 
Date:   Mon Sep 29 22:06:40 2003 +

Improve context display for failures during COPY IN, as recently
discussed on pghackers.
..
..
@@ -1917,7 +2019,7 @@ CopyReadBinaryAttribute(int column_no, FmgrInfo
*flinfo, Oid typelem,
  if (fld_size < 0)
  ereport(ERROR,
  (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
- errmsg("invalid size for field %d", column_no)));
+ errmsg("invalid field size")));

  /* reset attribute_buf to empty, and load raw data in it */
  attribute_buf.len = 0;
@@ -1944,8 +2046,7 @@ CopyReadBinaryAttribute(int column_no, FmgrInfo
*flinfo, Oid typelem,
  if (attribute_buf.cursor != attribute_buf.len)
  ereport(ERROR,
  (errcode(ERRCODE_INVALID_BINARY_REPRESENTATION),
- errmsg("incorrect binary data format in field %d",
- column_no)));
+ errmsg("incorrect binary data format")));

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread Justin Pryzby

On Fri, Jun 19, 2020 at 02:02:29PM +1200, David Rowley wrote:
> On Fri, 19 Jun 2020 at 01:45, Justin Pryzby  wrote:
> > Note that "incremental sort" is also new, and splits things up more than 
> > sort.
> >
> > See in particular 6a918c3ac8a6b1d8b53cead6fcb7cbd84eee5750, which splits 
> > things
> > up even more.
> >
> > ->  Incremental Sort (actual rows=70 loops=1)
> >   Sort Key: t.a, t.b
> >   Presorted Key: t.a
> > - Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB 
> > peak=NNkB Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort 
> > Memory: avg=NNkB peak=NNkB
> > + Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB 
> >  Peak Memory: NNkB
> > + Pre-sorted Groups: 5  Sort Methods: top-N heapsort, quicksort  
> > Average Memory: NNkB  Peak Memory: NNkB
> >
> > That's not really a "precedent" and I don't think that necessarily 
> > invalidates
> > your change.
> 
> I imagine you moved "Per-sorted Groups" to a new line due to the lines
> becoming too long? Or was there something else special about that
> property to warrant putting it on a new line?

https://www.postgresql.org/message-id/20200419023625.gp26...@telsasoft.com
|I still think Pre-sorted groups should be on a separate line, as in 0002.
|In addition to looking better (to me), and being easier to read, another reason
|is that there are essentially key=>values here, but the keys are repeated (Sort
|Method, etc).

> postgres=# explain analyze select a,sum(b) from ab group by a;
>  HashAggregate  (cost=175425.12..194985.79 rows=988 width=12) (actual 
> time=1551.920..5281.670 rows=1000 loops=1)
>Group Key: a
>Peak Memory Usage: 97 kB Disk Usage: 139760 kB HashAgg Batches: 832

Please be sure to use two spaces between each field !

See earlier discussions (and commits referenced by the Opened Items page).
https://www.postgresql.org/message-id/20200402054120.gc14...@telsasoft.com
https://www.postgresql.org/message-id/20200407042521.GH2228%40telsasoft.com

+   appendStringInfoChar(es->str, ' ');
+
+   appendStringInfo(es->str, "Peak Memory Usage: " INT64_FORMAT " 
kB", memPeakKb);
+
+   if (aggstate->hash_batches_used > 0)
+   appendStringInfo(es->str, " Disk Usage: " UINT64_FORMAT 
" kB HashAgg Batches: %d",

-- 
Justin

Re: Parallel Seq Scan vs kernel read ahead

2020-06-18 Thread David Rowley

On Fri, 19 Jun 2020 at 11:34, David Rowley  wrote:
>
> On Fri, 19 Jun 2020 at 03:26, Robert Haas  wrote:
> >
> > On Thu, Jun 18, 2020 at 6:15 AM David Rowley  wrote:
> > > With a 32TB relation, the code will make the chunk size 16GB.  Perhaps
> > > I should change the code to cap that at 1GB.
> >
> > It seems pretty hard to believe there's any significant advantage to a
> > chunk size >1GB, so I would be in favor of that change.
>
> I could certainly make that change.  With the standard page size, 1GB
> is 131072 pages and a power of 2. That would change for non-standard
> page sizes, so we'd need to decide if we want to keep the chunk size a
> power of 2, or just cap it exactly at whatever number of pages 1GB is.
>
> I'm not sure how much of a difference it'll make, but I also just want
> to note that synchronous scans can mean we'll start the scan anywhere
> within the table, so capping to 1GB does not mean we read an entire
> extent. It's more likely to span 2 extents.

Here's a patch which caps the maximum chunk size to 131072.  If
someone doubles the page size then that'll be 2GB instead of 1GB. I'm
not personally worried about that.

I tested the performance on a Windows 10 laptop using the test case from [1]

Master:

workers=0: Time: 141175.935 ms (02:21.176)
workers=1: Time: 316854.538 ms (05:16.855)
workers=2: Time: 323471.791 ms (05:23.472)
workers=3: Time: 321637.945 ms (05:21.638)
workers=4: Time: 308689.599 ms (05:08.690)
workers=5: Time: 289014.709 ms (04:49.015)
workers=6: Time: 267785.270 ms (04:27.785)
workers=7: Time: 248735.817 ms (04:08.736)

Patched:

workers=0: Time: 155985.204 ms (02:35.985)
workers=1: Time: 112238.741 ms (01:52.239)
workers=2: Time: 105861.813 ms (01:45.862)
workers=3: Time: 91874.311 ms (01:31.874)
workers=4: Time: 92538.646 ms (01:32.539)
workers=5: Time: 93012.902 ms (01:33.013)
workers=6: Time: 94269.076 ms (01:34.269)
workers=7: Time: 90858.458 ms (01:30.858)

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn%2BSa7NTKEVWtg%40mail.gmail.com
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 537913d1bb..5792040bc1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -520,12 +520,14 @@ heapgettup(HeapScanDesc scan,
 			{
 ParallelBlockTableScanDesc pbscan =
 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ParallelBlockTableScanWork pbscanwork =
+(ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work;
 
 table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-		 pbscan);
+		 pbscanwork, pbscan);
 
 page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
-		 pbscan);
+		 pbscanwork, pbscan);
 
 /* Other processes might have already finished the scan. */
 if (page == InvalidBlockNumber)
@@ -720,9 +722,11 @@ heapgettup(HeapScanDesc scan,
 		{
 			ParallelBlockTableScanDesc pbscan =
 			(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+			ParallelBlockTableScanWork pbscanwork =
+			(ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work;
 
 			page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
-	 pbscan);
+	 pbscanwork, pbscan);
 			finished = (page == InvalidBlockNumber);
 		}
 		else
@@ -834,12 +838,14 @@ heapgettup_pagemode(HeapScanDesc scan,
 			{
 ParallelBlockTableScanDesc pbscan =
 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+ParallelBlockTableScanWork pbscanwork =
+(ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work;
 
 table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-		 pbscan);
+		 pbscanwork, pbscan);
 
 page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
-		 pbscan);
+		 pbscanwork, pbscan);
 
 /* Other processes might have already finished the scan. */
 if (page == InvalidBlockNumber)
@@ -1019,9 +1025,11 @@ heapgettup_pagemode(HeapScanDesc scan,
 		{
 			ParallelBlockTableScanDesc pbscan =
 			(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+			ParallelBlockTableScanWork pbscanwork =
+			(ParallelBlockTableScanWork) scan->rs_base.rs_parallel_work;
 
 			page = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
-	 pbscan);
+	 pbscanwork, pbscan);
 			finished = (page == InvalidBlockNumber);
 		}
 		else
@@ -1155,6 +1163,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_nkeys = nkeys;
 	scan->rs_base.rs_flags = flags;
 	scan->rs_base.rs_parallel = parallel_scan;
+	scan->rs_base.rs_parallel_work =
+		palloc0(sizeof(ParallelBlockTableScanWorkData));
 	scan->rs_strategy = NULL;	/* set in initscan */
 
 	/*
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..c2ee42e795 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -25,10 +2

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread David Rowley

On Fri, 19 Jun 2020 at 01:45, Justin Pryzby  wrote:
> Note that "incremental sort" is also new, and splits things up more than sort.
>
> See in particular 6a918c3ac8a6b1d8b53cead6fcb7cbd84eee5750, which splits 
> things
> up even more.
>
> ->  Incremental Sort (actual rows=70 loops=1)
>   Sort Key: t.a, t.b
>   Presorted Key: t.a
> - Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB 
> peak=NNkB Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort Memory: 
> avg=NNkB peak=NNkB
> + Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  
> Peak Memory: NNkB
> + Pre-sorted Groups: 5  Sort Methods: top-N heapsort, quicksort  
> Average Memory: NNkB  Peak Memory: NNkB
>
> That's not really a "precedent" and I don't think that necessarily invalidates
> your change.

I imagine you moved "Per-sorted Groups" to a new line due to the lines
becoming too long? Or was there something else special about that
property to warrant putting it on a new line?

If it's due to the length of the line, then I don't think there are
quite enough properties for HashAgg to warrant wrapping them to
another line.

Perhaps there's some merit having something else decide when we should
wrap to a new line. e.g once we've put 4 properties on a single line
with the text format. However, it seems like we're pretty inconsistent
with the normal form of properties. Some have multiple values per
property, e.g:

if (es->format == EXPLAIN_FORMAT_TEXT)
{
ExplainIndentText(es);
appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
sortMethod, spaceType, spaceUsed);
}
else
{
ExplainPropertyText("Sort Method", sortMethod, es);
ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
ExplainPropertyText("Sort Space Type", spaceType, es);
}

So spaceType is a "Sort Method" in the text format, but it's "Sort
Space Type" in other formats.  It might not be easy to remove all the
special casing for the text format out of explain.c without changing
the output.

As for this patch, I don't think it's unreasonable to have the 3
possible HashAgg properties on a single line Other people might
disagree, so here's an example of what the patch changes it to:

postgres=# explain analyze select a,sum(b) from ab group by a;
 QUERY PLAN

 HashAggregate  (cost=175425.12..194985.79 rows=988 width=12) (actual
time=1551.920..5281.670 rows=1000 loops=1)
   Group Key: a
   Peak Memory Usage: 97 kB Disk Usage: 139760 kB HashAgg Batches: 832
   ->  Seq Scan on ab  (cost=0.00..72197.00 rows=5005000 width=8)
(actual time=0.237..451.228 rows=5005000 loops=1)

Master currently does:

QUERY PLAN
-
 HashAggregate (actual time=31.724..87.638 rows=1000 loops=1)
   Group Key: a
   Peak Memory Usage: 97 kB
   Disk Usage: 3928 kB
   HashAgg Batches: 798
   ->  Seq Scan on ab (actual time=0.006..9.243 rows=10 loops=1)

David

Re: Add support for INDEX_CLEANUP and TRUNCATE to vacuumdb

2020-06-18 Thread Michael Paquier

On Thu, Jun 18, 2020 at 09:26:50PM +, Bossart, Nathan wrote:
> It looks like I missed a couple of tags in the documentation changes.
> That should be fixed in v3.

Thanks.  This flavor looks good to me in terms of code, and the test
coverage is what's needed for all the code paths added.  This version
is using my suggestion of upthread for the option names: --no-truncate
and --no-index-cleanup.  Are people fine with this choice?
--
Michael


signature.asc
Description: PGP signature

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Michael Paquier

On Fri, Jun 19, 2020 at 10:02:54AM +0900, Kyotaro Horiguchi wrote:
> At Thu, 18 Jun 2020 18:18:37 +0530, Amit Kapila  
> wrote in 
>> It is a little unclear to me how this or any proposed patch will solve
>> the original problem reported by Fujii-San?  Basically, the problem
>> arises because we don't have an interlock between when the checkpoint
>> removes the WAL segment and the view tries to acquire the same.  Am, I
>> missing something?

The proposed patch fetches the computation of the minimum LSN across
all slots before taking ReplicationSlotControlLock so its value gets
more lossy, and potentially older than what the slots actually
include.  So it is an attempt to take the safest spot possible.

Honestly, I find a bit silly the design to compute and use the same
minimum LSN value for all the tuples returned by
pg_get_replication_slots, and you can actually get a pretty good
estimate of that by emulating ReplicationSlotsComputeRequiredLSN()
directly with what pg_replication_slot provides as we have a min()
aggregate for pg_lsn.

For these reasons, I think that we should remove for now this
information from the view, and reconsider this part more carefully for
14~ with a clear definition of how much lossiness we are ready to
accept for the information provided here, if necessary.  We could for
example just have a separate SQL function that just grabs this value
(or a more global SQL view for XLogCtl data that includes this data).

> I'm not sure, but I don't get the point of blocking WAL segment
> removal until the view is completed.

We should really not do that anyway for a monitoring view.
--
Michael

signature.asc
Description: PGP signature

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Kyotaro Horiguchi

At Thu, 18 Jun 2020 18:18:37 +0530, Amit Kapila  wrote 
in 
> On Thu, Jun 18, 2020 at 11:52 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 17 Jun 2020 21:37:55 +0900, Fujii Masao 
> >  wrote in
> > > On 2020/06/15 16:35, Kyotaro Horiguchi wrote:
> > > Isn't it better to use 1 as the second argument of the above,
> > > in order to address the issue that I reported upthread?
> > > Otherwise, the WAL file name that pg_walfile_name(min_safe_lsn)
> > > returns
> > > would be confusing.
> >
> > Mmm. pg_walfile_name seems too specialize to
> > pg_stop_backup(). (pg_walfile_name_offset() returns wrong result for
> > segment boundaries.)  I'm not willing to do that only to follow such
> > suspicious(?) specification, but surely it would practically be better
> > doing that. Please find the attached first patch.
> >
> 
> It is a little unclear to me how this or any proposed patch will solve
> the original problem reported by Fujii-San?  Basically, the problem
> arises because we don't have an interlock between when the checkpoint
> removes the WAL segment and the view tries to acquire the same.  Am, I
> missing something?

I'm not sure, but I don't get the point of blocking WAL segment
removal until the view is completed. The said columns of the view are
just for monitoring, which needs an information snapshot seemingly
taken at a certain time. And InvalidateObsoleteReplicationSlots kills
walsenders using lastRemovedSegNo of a different time.  The two are
independent each other.

Also the patch changes min_safe_lsn to show an LSN at segment boundary
+ 1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

2020-06-18 Thread Michael Paquier

On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote:
> If the WAL segment for the specified restart_lsn (STOP_LSN of the backup)
> exists, then the function will create a physical replication slot and will
> keep all the WAL segments required by the replica to catch up with the
> primary. Otherwise, it returns error, which means that the required WAL
> segments have been already utilised, so we do need to take a new backup.
> Without passing this newly added parameter
> pg_create_physical_replication_slot() works as before.
> 
> What do you think about this?

I think that this was discussed in the past (perhaps one of the
threads related to WAL advancing actually?), and this stuff is full of
holes when it comes to think about error handling with checkpoints
running in parallel, potentially doing recycling of segments you would
expect to be around based on your input value for restart_lsn *while*
pg_create_physical_replication_slot() is still running and
manipulating the on-disk slot information.  I suspect that this also
breaks a couple of assumptions behind concurrent calls of the minimum
LSN calculated across slots when a caller sees fit to recompute the
thresholds (WAL senders mainly here, depending on the replication
activity).
--
Michael


signature.asc
Description: PGP signature

Re: Extracting only the columns needed for a query

2020-06-18 Thread Melanie Plageman

On Fri, Mar 13, 2020 at 12:09 PM Dmitry Dolgov <9erthali...@gmail.com>
wrote:

> In general implemented functionality looks good. I've checked how it
> works on the existing tests, almost everywhere required columns were not
> missing in scanCols (which is probably the most important part).
> Sometimes exressions were checked multiple times, which could
> potentially introduce some overhead, but I believe this effect is
> negligible. Just to mention some counterintuitive examples, for this
> kind of query
>
> SELECT min(q1) FROM INT8_TBL;
>
> the same column was checked 5 times in my tests, since it's present also
> in baserestrictinfo, and build_minmax_path does one extra round of
> planning and invoking make_one_rel.

Thanks so much for the very thorough review, Dmitry.

These extra calls to extract_scan_cols() should be okay in this case.
As you mentioned, for min/max queries, planner attempts an optimization
with an indexscan and, to do this, it modifies the querytree and then
calls query_planner() on it.
It tries it with NULLs first and then NULLs last -- each of which
invokes query_planner(), so that is two out of three calls. The
third is the normal invocation. I'm not sure how you would get five,
though.

> Another interesting
> example is Values Scan (e.g. in an insert statements with multiple
> records), can an abstract table AM user leverage information about
> columns in it?
>
> One case, where I believe columns were missing, is statements with
> returning:
>
> INSERT INTO foo (col1)
>   VALUES ('col1'), ('col2'), ('col3')
>   RETURNING *;
>
> Looks like in this situation there is only expression in reltarget is
> for col1, but returning result contains all columns.
>

This relates to both of your above points:
For this RETURNING query, it is a ValuesScan, so no columns have to be
scanned. We actually do add the reference to col1 to the scanCols
bitmap, though. I'm not sure we want to do so, since we don't want to
scan col1 in this case. I wonder what cases we would miss if we special
cased RTEKind RTE_VALUES when doing extract_scan_cols().

-- 
Melanie

Re: Add A Glossary

2020-06-18 Thread Alvaro Herrera

On 2020-Jun-16, Justin Pryzby wrote:
> On Tue, Jun 16, 2020 at 08:09:26PM -0400, Alvaro Herrera wrote:

Thanks for the review.  I merged all your suggestions.  This one:

> >Most local objects belong to a specific
> > +  schema in their
> > +  containing database, such as
> > +  all types of 
> > relations,
> > +  all types of 
> > functions,
> 
> Maybe say: >Relations< (all types), and >Functions< (all types)

led me down not one but two rabbit holes; first I realized that
"functions" is an insufficient term since procedures should also be
included but weren't, so I had to add the more generic term "routine"
and then modify the definitions of all routine types to mix in well.  I
think overall the quality of these definitions is improved as a result.

I also felt the need to revise the definition of "relations", so I did
that too; this made me change the definition of resultset too.

On 2020-Jun-17, Jürgen Purtz wrote:

> +1, with two formal changes:
> 
> -  Rearrangement of term "Data page" to meet alphabetical order.

To forestall these ordering issues (look, another rabbit hole), I
grepped the file for all glossterms and sorted that under en_US rules,
then reordered the terms to match that.  Turns out there were several
other ordering mistakes.

git grep ''  | sed -e 's/<[^>]*>\([^<]*\)<[^>]*>/\1/' > orig
LC_COLLATE=en_US.UTF-8 sort orig > sorted

(Eliminating the tags is important, otherwise the sort uses the tags
themselves to disambiguate)

> One last question: The definition of "Data directory" reads "... A cluster's
> storage space comprises the data directory plus ..." and 'cluster' links to
> '"glossary-instance". Shouldn't it link to "glossary-db-cluster"?

Yes, an oversight, thanks.

I also added TPS, because I had already written it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 25b03f3b37..5274feabba 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -23,7 +23,7 @@
   
 
   
-   Aggregate function
+   Aggregate function (routine)

 
  A function that
@@ -39,6 +39,11 @@

   
 
+  
+   Analytic function
+   
+  
+
   
Analyze (operation)

@@ -57,11 +62,6 @@

   
 
-  
-   Analytic function
-   
-  
-
   
Atomic

@@ -389,40 +389,33 @@

   
 
-  
-   Data directory
+  
+   Database

 
- The base directory on the filesystem of a
- server that contains all
- data files and subdirectories associated with an
- instance (with the
- exception of tablespaces).
- The environment variable PGDATA is commonly used to
- refer to the
- data directory.
-
-
- An instance's storage
- space comprises the data directory plus any additional tablespaces.
+ A named collection of
+ local SQL objects.
 
 
  For more information, see
- .
+ .
 

   
 
-  
-   Database
+  
+   Database cluster

 
- A named collection of
- SQL objects.
+ A collection of databases and global SQL objects,
+ and their common static and dynamic metadata.
+ Sometimes referred to as a
+ cluster.
 
 
- For more information, see
- .
+ In PostgreSQL, the term
+ cluster is also sometimes used to refer to an instance.
+ (Don't confuse this term with the SQL command CLUSTER.)
 

   
@@ -432,6 +425,31 @@

   
 
+  
+   Data directory
+   
+
+ The base directory on the filesystem of a
+ server that contains all
+ data files and subdirectories associated with a
+ database cluster
+ (with the exception of
+ tablespaces,
+ and optionally WAL).
+ The environment variable PGDATA is commonly used to
+ refer to the data directory.
+
+
+ A cluster's storage
+ space comprises the data directory plus any additional tablespaces.
+
+
+ For more information, see
+ .
+
+   
+  
+
   
Data page

@@ -578,7 +596,7 @@
   
 
   
-   Foreign table
+   Foreign table (relation)

 
  A relation which appears to have
@@ -631,12 +649,20 @@
   
 
   
-   Function
+   Function (routine)

 
- Any defined transformation of data. Many functions are already defined
- within PostgreSQL itself, but user-defined
- ones can also be added.
+ A type of routine that receives zero or more arguments, returns zero or more
+ output values, and is constrained to run within one transaction.
+ Functions are invoked as part of a query, for example via
+ SELECT.
+ Certain functions can return
+ sets; those are
+ called set-returning functions.
+
+
+ Functions can also be used for
+ triggers to invoke.
 
 
  For more information, see
@@ -689,13 +715,12 @@
   
 
   
-   Index
+   Index (relation)

 
  A relation

Re: Parallel Seq Scan vs kernel read ahead

2020-06-18 Thread David Rowley

On Fri, 19 Jun 2020 at 03:26, Robert Haas  wrote:
>
> On Thu, Jun 18, 2020 at 6:15 AM David Rowley  wrote:
> > With a 32TB relation, the code will make the chunk size 16GB.  Perhaps
> > I should change the code to cap that at 1GB.
>
> It seems pretty hard to believe there's any significant advantage to a
> chunk size >1GB, so I would be in favor of that change.

I could certainly make that change.  With the standard page size, 1GB
is 131072 pages and a power of 2. That would change for non-standard
page sizes, so we'd need to decide if we want to keep the chunk size a
power of 2, or just cap it exactly at whatever number of pages 1GB is.

I'm not sure how much of a difference it'll make, but I also just want
to note that synchronous scans can mean we'll start the scan anywhere
within the table, so capping to 1GB does not mean we read an entire
extent. It's more likely to span 2 extents.

David

Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-18 Thread Tom Lane

I wrote:
> ... We should expect that, starting probably this fall, there
> will be installations with no posixrules file.

> The minimum thing that we have to do, I'd say, is to change the
> documentation to explain what happens if there's no posixrules file.
> However, in view of the fact that the posixrules feature doesn't work
> past 2037 and isn't going to be fixed, maybe we should just nuke it
> now rather than waiting for our hand to be forced.  I'm not sure that
> I've ever heard of anyone replacing the posixrules file anyway.
> (The fallback case is actually better in that it works for dates past
> 2037; it's worse only in that you can't configure it.)

I experimented with removing the posixrules support, and was quite glad
I did, because guess what: our regression tests fall over.  If we do
nothing we can expect that they'll start failing on various random systems
come this fall.

The cause of the failure is that we set the timezone for all regression
tests to just 'PST8PDT', which is exactly the underspecified POSIX syntax
that is affected by the posixrules feature.  So, with the fallback
rule of "M3.2.0,M11.1.0" (which corresponds to US law since 2007)
we get the wrong answers for some old test cases involving dates in 2005.

I'm inclined to think that the simplest fix is to replace 'PST8PDT' with
'America/Los_Angeles' as the standard zone setting for the regression
tests.  We definitely should be testing behavior with time-varying DST
laws, and we can no longer count on POSIX-style zone names to do that.

Another point, which I've not looked into yet, is that I'd always
supposed that PST8PDT and the other legacy US zone names would result
in loading the zone files of those names, ie /usr/share/zoneinfo/PST8PDT
and friends.  This seems not to be happening though.  Should we try
to make it happen?  It would probably result in fewer surprises once
posixrules goes away, because our regression tests are likely not the
only users of these zone names.

(I'd still be inclined to do the first thing though; it seems to me
that the historical behavior of 'America/Los_Angeles' is way more
likely to hold still than that of 'PST8PDT'.  The IANA crew might
nuke the latter zone entirely at some point, especially if the
repeated proposals to get rid of DST in the US ever get anywhere.)

regards, tom lane

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-18 Thread Andres Freund

Hi,

On 2020-06-16 15:20:11 -0400, Alvaro Herrera wrote:
> On 2020-Jun-15, Andres Freund wrote:
> 
> > > Also, I wonder if someone would be willing to set up a BF animal for this.
> > 
> > FWIW, I've requested a buildfarm animal id for this a few days ago, but
> > haven't received a response yet...
> 
> I did send it out, with name rorqual -- didn't you get that?  Will send
> the secret separately.

That animal is now live. Will take a bit for all branches to report in
though.

Need to get faster storage for my buildfarm animal host...

Greetings,

Andres Freund

Re: pg_dump, gzwrite, and errno

2020-06-18 Thread Tom Lane

Alvaro Herrera  writes:
> On 2020-Jun-11, Justin Pryzby wrote:
>> --- a/src/bin/pg_dump/pg_backup_directory.c
>> +++ b/src/bin/pg_dump/pg_backup_directory.c
>> @@ -347,8 +347,12 @@ _WriteData(ArchiveHandle *AH, const void *data, size_t 
>> dLen)
>> lclContext *ctx = (lclContext *) AH->formatData;
>>  
>> if (dLen > 0 && cfwrite(data, dLen, ctx->dataFH) != dLen)
>> +   {
>> +   if (errno == 0)
>> +   errno = ENOSPC;
>> fatal("could not write to output file: %s",
>>   get_cfp_error(ctx->dataFH));
>> +   }
>>  }

> This seems correct to me.

Surely it's insufficient as-is, because there is no reason to suppose
that errno is zero at entry.  You'd need to set errno = 0 first.

Also it's fairly customary in our sources to include a comment about
this machination; so the full ritual is usually more like

errno = 0;
if (pg_pwrite(fd, data, len, xlrec->offset) != len)
{
/* if write didn't set errno, assume problem is no disk space */
if (errno == 0)
errno = ENOSPC;
ereport ...

> (I spent a long time looking at zlib sources
> to convince myself that it does work with compressed files too)

Yeah, it's not obvious that gzwrite has the same behavior w.r.t. errno
as a plain write.  But there's not much we can do to improve matters
if it does not, so we might as well assume it does.

regards, tom lane

Re: pg_dump, gzwrite, and errno

2020-06-18 Thread Alvaro Herrera

On 2020-Jun-11, Justin Pryzby wrote:

> --- a/src/bin/pg_dump/pg_backup_directory.c
> +++ b/src/bin/pg_dump/pg_backup_directory.c
> @@ -347,8 +347,12 @@ _WriteData(ArchiveHandle *AH, const void *data, size_t 
> dLen)
> lclContext *ctx = (lclContext *) AH->formatData;
>  
> if (dLen > 0 && cfwrite(data, dLen, ctx->dataFH) != dLen)
> +   {
> +   if (errno == 0)
> +   errno = ENOSPC;
> fatal("could not write to output file: %s",
>   get_cfp_error(ctx->dataFH));
> +   }
>  }

This seems correct to me.  (I spent a long time looking at zlib sources
to convince myself that it does work with compressed files too).  There
are more calls to cfwrite in pg_backup_directory.c though -- we should
patch them all.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: jsonpath versus NaN

2020-06-18 Thread Andrew Dunstan



On 6/18/20 12:35 PM, Tom Lane wrote:
> Robert Haas  writes:
>> On Thu, Jun 18, 2020 at 11:51 AM Oleg Bartunov  
>> wrote:
>>> The problem is that we tried to find a trade-off  between standard and 
>>> postgres
>>> implementation, for example, in postgres CAST  allows NaN and Inf, and SQL 
>>> Standard
>>> requires .double should works as CAST.
>> It seems like the right thing is to implement the standard, not to
>> implement whatever PostgreSQL happens to do in other cases. I can't
>> help feeling like re-using the numeric data type for other things has
>> led to this confusion. I think that fails in other cases, too: like
>> what if you have a super-long integer that can't be represented as a
>> numeric? I bet jsonb will fail, or maybe it will convert it to a
>> string, but I don't see how it can do anything else.
> Actually, the JSON spec explicitly says that any number that doesn't fit
> in an IEEE double isn't portable [1].  So we're already very far above and
> beyond the spec's requirements by using numeric.  We don't need to improve
> on that.  But I concur with your point that just because PG does X in
> some other cases doesn't mean that we must do X in json or jsonpath.
>
>   regards, tom lane
>
> [1] https://tools.ietf.org/html/rfc7159#page-6
>
>This specification allows implementations to set limits on the range
>and precision of numbers accepted.  Since software that implements
>IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
>generally available and widely used, good interoperability can be
>achieved by implementations that expect no more precision or range
>than these provide, in the sense that implementations will
>approximate JSON numbers within the expected precision.  A JSON
>number such as 1E400 or 3.141592653589793238462643383279 may indicate
>potential interoperability problems, since it suggests that the
>software that created it expects receiving software to have greater
>capabilities for numeric magnitude and precision than is widely
>available.
>
>Note that when such software is used, numbers that are integers and
>are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
>sense that implementations will agree exactly on their numeric
>values.
>


Just to complete the historical record, that standard wasn't published
at the time we created the JSON type, and the then existing standard
(rfc4627) contains no such statement. We felt it was important to be
able to represent any Postgres data value in as natural a manner as
possible given the constraints of JSON. rfc7159 was published just as we
were finalizing 9.4 with JSONB, although I'm not sure it made a heavy
impact on our consciousness. If it had I would still not have wanted to
impose any additional limitation on numerics. If you want portable
numbers cast the numeric to double before producing the JSON.

ISTR having a conversation about the extended use of jsonb in jsonpath a
while back, although I don't remember if that was on or off list. I know
it troubled me some.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-18 Thread Tom Lane

Robert Haas  writes:
> You could consider something along the lines of:

> This form specifies a transition that always happens during the same
> month and on the same day of the week. m identifies the month, from 1
> to 12. n specifies the n'th occurrence of the day number identified by
> d. n is a value between 1 and 4, or 5 meaning the last occurrence of
> that weekday in the month (which could be the fourth or the fifth). d
> is a value between 0 and 6, with 0 indicating Sunday.

Adopted with some minor tweaks.  Thanks for the suggestion!

regards, tom lane

re: BUG #14053: postgres crashes in plpgsql under load of concurrent transactions

2020-06-18 Thread Ranier Vilela

Hi Maxim,

Coverity show that trap (1), still persists, even in HEAD.
CID 1425439 (#1 of 1): Explicit null dereferenced (FORWARD_NULL)
15. var_deref_model: Passing null pointer expr->expr_simple_state to
ExecEvalExpr, which dereferences it. [show details
]


Is really, it is very difficult to provide a minimum reproducible test case?
I tried, without success.

regards,
Ranier Vilela

1.
https://www.postgresql.org/message-id/20160330070414.8944.52106%40wrigleys.postgresql.org

Re: Physical replication slot advance is not persistent

2020-06-18 Thread Alexey Kondratov


On 2020-06-16 10:27, Michael Paquier wrote:

On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote:
New test reproduces this issue well. Left it running for a couple of 
hours

in repeat and it seems to be stable.


Thanks for testing.  I have been thinking about the minimum xmin and
LSN computations on advancing, and actually I have switched the
recomputing to be called at the end of pg_replication_slot_advance().
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.



Sorry for a bit late response, but I see a couple of issues with this 
modified version of the patch in addition to the waste call if no 
advancing is done, mentioned by you:


1. Both ReplicationSlotsComputeRequiredXmin() and 
ReplicationSlotsComputeRequiredLSN() may have already been done in the 
LogicalConfirmReceivedLocation() if it was a logical slot. It may be 
fine and almost costless to do it twice, but it looks untidy for me.


2. It seems that we do not need ReplicationSlotsComputeRequiredXmin() at 
all if it was a physical slot, since we do not modify xmin in the 
pg_physical_replication_slot_advance(), doesn't it?


That's why I wanted (somewhere around v5 of the patch in this thread) to 
move all dirtying and recomputing calls to the places, where xmin / lsn 
slot modifications are actually done — 
pg_physical_replication_slot_advance() and 
LogicalConfirmReceivedLocation(). LogicalConfirmReceivedLocation() 
already does this, so we only needed to teach 
pg_physical_replication_slot_advance() to do the same.


However, just noted that LogicalConfirmReceivedLocation() only does 
ReplicationSlotsComputeRequiredLSN() if updated_xmin flag was set, which 
looks wrong from my perspective, since updated_xmin and updated_restart 
flags are set separately.


That way, I would solve this all as per attached, which works well for 
me, but definitely worth of a better testing.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..2ceb078418 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1081,8 +1081,14 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			SpinLockRelease(&MyReplicationSlot->mutex);
 
 			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
 		}
+
+		/*
+		 * Now recompute the minimum required LSN across all slots if
+		 * restart_lsn of the slot has been updated.
+		 */
+		if (updated_restart)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 	else
 	{
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 06e4955de7..d9c19c07e5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -419,6 +419,12 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		 * crash, but this makes the data consistent after a clean shutdown.
 		 */
 		ReplicationSlotMarkDirty();
+
+		/*
+		 * Recompute the minimum required LSN across all slots to adjust with
+		 * the advancing done.
+		 */
+		ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return retlsn;
@@ -621,13 +627,6 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(&MyReplicationSlot->data.name);
 	nulls[0] = false;
 
-	/*
-	 * Recompute the minimum LSN and xmin across all slots to adjust with the
-	 * advancing potentially done.
-	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
-
 	ReplicationSlotRelease();
 
 	/* Return the reached position. */

Re: [patch] demote

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 1:24 PM Tom Lane  wrote:
> It seems like this is the core decision that needs to be taken.  If
> we're willing to have these state transitions include a server restart,
> then many things get simpler.  If we're not, it's gonna cost us in
> code complexity and hence bugs.  Maybe the usability gain is worth it,
> or maybe not.
>
> I think it would probably be worth the trouble to pursue both designs in
> parallel for awhile, so we can get a better handle on exactly how much
> complexity we're buying into with the more ambitious definition.

I wouldn't vote to reject a patch that performed a demotion by doing a
full server restart, because it's a useful incremental step, but I
wouldn't be excited about spending a lot of time on it, either,
because it's basically crippleware by design. Either you have to
checkpoint before restarting, or you have to run recovery after
restarting, and either of those can be extremely slow. You also break
all the connections, which can produce application errors unless the
applications are pretty robustly designed, and you lose the entire
contents of shared_buffers, which makes things run very slowly even
after the restart is completed, which can cause a lengthy slow period
even after the system is nominally back up. All of those things are
really bad, and AFAICT the first one is the worst by a considerable
margin. It can take 20 minutes to checkpoint and even longer to run
recovery, so doing this once per year already puts you outside of
five-nines territory, and we release critical updates that cannot be
applied without a server restart about four times per year. That means
that if you perform updates by using a switchover -- a common practice
-- and if you apply all of your updates in a timely fashion --
unfortunately, not such a common practice, but one we'd surely like to
encourage -- you can't even achieve four nines if a switchover
requires either a checkpoint or running recovery. And that's without
accounting for any switchovers that you may need to perform for
reasons unrelated to upgrades, and without accounting for any
unplanned downtime. Not many people in 2020 are interested in running
a system with three nines of availability, so I think it is clear that
we need to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-18 Thread Andres Freund

Hi,

On 2020-06-18 12:29:40 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Thu, Jun 18, 2020 at 11:59 AM Tom Lane  wrote:
> >> Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
> >> sufficient to address that point, with no need to touch s_lock.h at all?
> 
> > I mean, wouldn't you then end up with a bunch of 1-line functions
> > where you can step into the function but not through whatever
> > individual things it does?
> 
> Not following your point.  The s_lock.h implementations tend to be either
> simple C statements ("*lock = 0") or asm blocks; if you feel a need to
> step through them you're going to be resorting to "si" anyway.

I agree on that.

I do think it'd be better to not have the S_LOCK macro though (but have
TAS/TAS_SPIN as we do now). And instead have SpinLockAcquire() call
s_lock() (best renamed to something a bit more meaningful). Makes the
code a bit easier to understand (no S_LOCK vs s_lock) and yields simpler
macros.

There's currently no meaningful ifdefs for S_LOCK (in contrast to
e.g. S_UNLOCK), so I don't see it making s_lock.h more complicated.

I think part of the issue here is that the naming of the s_lock exposed
macros is confusing. We have S_INIT_LOCK, TAS, SPIN_DELAY, TAS,
TAS_SPIN, S_UNLOCK, S_LOCK_FREE that are essentially hardware
dependent. But then there's S_LOCK which basically isn't.

It may have made some sense when the file was originally written, if one
imagines that S_ is the only external API, and the rest is
implementation. But given that s_lock() uses TAS() directly (and says
"platform-independent portion of waiting for a spinlock"), and that we
only have one definition of S_UNLOCK that doesn't seem quite right.

> I think the main usefulness of doing anything here would be (a) separating
> the spinlock infrastructure from callers and (b) ensuring that we have a
> declared argument type, and single-evaluation semantics, for the spinlock
> function parameters.  Both of those are adequately addressed by fixing
> spin.h, IMO anyway.

The abstraction point made me grep for includes of s_lock.h. Seems we
have some unnecessary includes of s_lock.h.

lwlock.h doesn't need to include spinlock related things anymore, and
hasn't for years, the spinlocks are replaced with atomics. That seems
pretty obvious. I'm gonna fix that in master, unless somebody thinks we
should do that more widely?

But we also have s_lock.h includes in condition_variable.h and
main.c. Seems the former should instead include spin.h and the latter
include should just be removed?

All of this however makes me wonder whether it's worth polishing
spinlocks instead of just ripping them out. Obviously we'd still need a
fallback for atomics, but it not be hard to just replace the
spinlock use with either "smaller" atomics or semaphores.

Greetings,

Andres Freund

Re: Internal key management system

2020-06-18 Thread Jose Luis Tallon


On 18/6/20 19:41, Cary Huang wrote:

Hi all

Having read through the discussion, I have some comments and 
suggestions that I would like to share.


I think it is still quite early to even talk about external key 
management system even if it is running on the same host as PG. This 
is most likely achieved as an extension that can provide communication 
to external key server and it would be a separate project/discussion. 
I think the focus now is to finalize on the internal KMS design, and 
we can discuss about how to migrate internally managed keys to the 
external when the time is right.


As long as there exists a clean interface, and the "default" (internal) 
backend is a provider of said functionality, it'll be fine.


Given that having different KMS within a single instance (e.g. per 
database) is quite unlikely, I suggest just exposing hook-like 
function-pointer variables and be done with it. Requiring a preloaded 
library for this purpose doesn't seem too restrictive ---at least at 
this stage--- and can be very easily evolved in the future --- 
super-simple API which receives a struct made of function pointers, plus 
another function to reset it to "internal defaults" and that's it.




Key management system is generally built to manage the life cycle of 
cryptographic keys, so our KMS in my opinion needs to be built with 
key life cycle in mind such as:


* Key generation
* key protection
* key storage
* key rotation
* key rewrap
* key disable/enable
* key destroy


Add the support functions for your suggested "key information" 
functionality, and that's a very rough first draft of the API ...


KMS should not perform the above life cycle management by itself 
automatically or hardcoded, instead it should expose some interfaces 
to the end user or even a backend process like TDE to trigger the above.
The only key KMS should manage by itself is the KEK, which is derived 
from cluster passphrase value. This is fine in my opinion. This KEK 
should exist only within KMS to perform key protection (by wrapping) 
and key storage (save as file).


Asking for the "cluster password" is something better left optional / 
made easily overrideable ... or we risk thousands of clusters suddenly 
not working after a reboot :S



Just my .02€


Thanks,

    J.L.

Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 1:05 PM Tom Lane  wrote:
> Robert Haas  writes:
> > It's a little confusing, though, that you documented it as Mm.n.d but
> > then in the text the order of explanation is d then m then n. Maybe
> > switch the text around so the order matches, or even use something
> > like Mmonth.occurrence.day.
>
> Yeah, I struggled with that text for a bit.  It doesn't seem to make sense
> to explain that n means the n'th occurrence of a particular d value before
> we've explained what d is, so explaining the fields in their syntactic
> order seems like a loser.  But we could describe m first without that
> problem.

You could consider something along the lines of:

This form specifies a transition that always happens during the same
month and on the same day of the week. m identifies the month, from 1
to 12. n specifies the n'th occurrence of the day number identified by
d. n is a value between 1 and 4, or 5 meaning the last occurrence of
that weekday in the month (which could be the fourth or the fifth). d
is a value between 0 and 6, with 0 indicating Sunday.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

2020-06-18 Thread Andres Freund

Hi,

On 2020-06-18 13:24:38 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > ... To go back to recovery rather than just to a read-only
> > state, I think you'd need to grapple with some additional issues that
> > patch doesn't touch, like some of the snapshot-taking stuff, but I
> > think you still need to solve all of the problems that it does deal
> > with, unless you're OK with killing every session.
> 
> It seems like this is the core decision that needs to be taken.  If
> we're willing to have these state transitions include a server restart,
> then many things get simpler.  If we're not, it's gonna cost us in
> code complexity and hence bugs.  Maybe the usability gain is worth it,
> or maybe not.
> 
> I think it would probably be worth the trouble to pursue both designs in
> parallel for awhile, so we can get a better handle on exactly how much
> complexity we're buying into with the more ambitious definition.

What I like about ALTER SYSTEM READ ONLY is that it basically would
likely be a part of both a restart and a non-restart based
implementation.

I don't really get why the demote in this thread is mentioned as an
alternative - it pretty obviously has to include a large portion of
ALTER SYSTEM READ ONLY.

The only part that could really be skipped by going straight to demote
is a way to make ASRO invocable directly. You can simplify a bit more by
killing all user sessions, but at that point there's not that much
upshot for having no-restart version of demote in the first place.

The demote patch in this thread doesn't even start to attack much of the
real complexity around turning a primary into a standby.

To me the complexity of a restartless demotion are likely worth it. But
it doesn't seem feasible to get there in one large step. So adding
individually usable sub-steps like ASRO makes sense imo.

Greetings,

Andres Freund

Re: Internal key management system

2020-06-18 Thread Cary Huang

Hi all

Having read through the discussion, I have some comments and suggestions that I 
would like to share. 

I think it is still quite early to even talk about external key management 
system even if it is running on the same host as PG. This is most likely 
achieved as an extension that can provide communication to external key server 
and it would be a separate project/discussion. I think the focus now is to 
finalize on the internal KMS design, and we can discuss about how to migrate 
internally managed keys to the external when the time is right.

Key management system is generally built to manage the life cycle of 
cryptographic keys, so our KMS in my opinion needs to be built with key life 
cycle in mind such as:

* Key generation
* key protection
* key storage
* key rotation

* key rewrap
* key disable/enable
* key destroy

KMS should not perform the above life cycle management by itself automatically 
or hardcoded, instead it should expose some interfaces to the end user or even 
a backend process like TDE to trigger the above. 

The only key KMS should manage by itself is the KEK, which is derived from 
cluster passphrase value. This is fine in my opinion. This KEK should exist 
only within KMS to perform key protection (by wrapping) and key storage (save 
as file).

The other life cycle stages should be just left as interfaces, waiting for 
somebody to request KMS to perform. Somebody could be end user or back end 
process like TDE.

Using TDE as example, when TDE initializes, it calls KMS's key_generation 
interface to get however many keys that it needs, KMS should not return the 
keys in clear text of hex, it can return something like a key ID.

Using regular user as example, each user can also call KMS's key_generation 
interface to create a cryptographic key for their own purpose, KMS should also 
return just an key ID and this key should be bound to the user and we can limit 
that each user can have only one key managed, and regular user can only manage 
his/her own key with KMS, rotate, destroy, disable..etc; he/she cannot manage 
others' keys

super user (or key admin), however, can do all kinds of management to all keys, 
(generated by TDE or by other users). He or she can do key rotation, key 
rewrap, disable or destroy. Here we will need to think about how to prevent 
this user from misusing the key management functions. 

Super user should also be able to view the status of all the keys managed, 
information such as: 

* date of generation

* key ID

* owner

* status

* key length

* suggested date of rotation..etc etc

* expiry date??

to actually perform the encryption with keys managed by internal KMS, I suggest 
adding a set of wrapper functions within KMS using the Key ID as input 
argument. So for example, TDE wants to encrypt some data, it will call KMS's 
wrapper encryption function with Key ID supplied, KMS looked up the key with  
key ID ,verify caller's permission and translate these parameters and feed to 
pgcrypto for example. This may be a little slower due to the lookup, but we can 
have a variation of the function where KMS can look up the key with supplied 
key ID and convert it to encryption context and return it back to TDE. Then TDE 
can use this context to call another wrapper function for encryption without 
lookup all the time. If an end user wants to encrypt something, it will also 
call KMS's wrapper function and supply the key ID in the same way. 

I know that there is a discussion on moving the cryptographic functions as 
extension. In an already running PG, it is fine, but when TDE and XLOG 
bootstraps during initdb, it requires cryptographic function to encrypt the 
initial WAL file and during initdb i dont think it has access to any 
cryptographic function provided by the extension. we may need to include 
limited cryptographic function within KMS and TDE so it is enough to finish 
initdb with intial WAl encrypted.

This is just my thought how this KMS and TDE should look like. 

Best

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Tue, 16 Jun 2020 23:52:03 -0700 Masahiko Sawada 
 wrote 

On Sun, 14 Jun 2020 at 19:39, Fabien COELHO  wrote: 
> 
> 
> Hello Masahiko-san, 
> 
> >>> * The external place needs to manage more encryption keys than the 
> >>> current patch does. 
> >> 
> >> Why? If the external place is just a separate process on the same host, 
> >> probably it would manage the very same amount as what your patch. 
> > 
> > In the current patch, the external place needs to manage only one key 
> > whereas postgres needs to manages multiple DEKs. But with your idea, 
> > the external place needs to manage both KEK and DEKs. 
> 
> Hmmm. I do not see a good use case for a "management system" which would 
> only have to manage a singleton. ISTM that one point of using one KEK is 
> to allows several DEKs under it. Ma

Re: [patch] demote

2020-06-18 Thread Tom Lane

Robert Haas  writes:
> ... To go back to recovery rather than just to a read-only
> state, I think you'd need to grapple with some additional issues that
> patch doesn't touch, like some of the snapshot-taking stuff, but I
> think you still need to solve all of the problems that it does deal
> with, unless you're OK with killing every session.

It seems like this is the core decision that needs to be taken.  If
we're willing to have these state transitions include a server restart,
then many things get simpler.  If we're not, it's gonna cost us in
code complexity and hence bugs.  Maybe the usability gain is worth it,
or maybe not.

I think it would probably be worth the trouble to pursue both designs in
parallel for awhile, so we can get a better handle on exactly how much
complexity we're buying into with the more ambitious definition.

regards, tom lane

Re: [patch] demote

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 12:55 PM Fujii Masao
 wrote:
> Even with ASRO, the server restart is necessary and RO sessions are
> kicked out when demoting RO primary to a standby, i.e., during a clean
> switchover?

The ASRO patch doesn't provide a way to put a running server to be put
back into recovery, so yes, that is required, unless some other patch
fixes it so that it isn't. It wouldn't be better to find a way where
we never need to kill of R/O sessions at all, and I think that would
require all the machinery from the ASRO patch plus some more. If you
want to allow sessions to survive a state transition like this -
whether it's to a WAL-read-only state or all the way back to recovery
- you need a way to prevent further WAL writes in those sessions. Most
of the stuff that the ASRO patch does is concerned with that. So it
doesn't seem likely to me that we can just throw all that code away,
unless by chance somebody else has got a better version of the same
thing already. To go back to recovery rather than just to a read-only
state, I think you'd need to grapple with some additional issues that
patch doesn't touch, like some of the snapshot-taking stuff, but I
think you still need to solve all of the problems that it does deal
with, unless you're OK with killing every session.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

2020-06-18 Thread Andres Freund

Hi,

On 2020-06-18 21:41:45 +0900, Fujii Masao wrote:
> ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> What about the following procedure?
> 
> 1. Demote the primary to a standby. Then this demoted standby is read-only.

As far as I can tell this step includes ALTER SYSTEM READ ONLY. Sure you
can choose not to expose ASRO to the user separately from demote, but
that's a miniscule part of the complexity.

Greetings,

Andres Freund

Re: [patch] demote

2020-06-18 Thread Andres Freund

Hi,

On 2020-06-18 12:16:27 +0200, Jehan-Guillaume de Rorthais wrote:
> On Wed, 17 Jun 2020 11:14:47 -0700
> > I don't think there's a fundamental issue, but I think it needs to deal
> > with a lot more things than it does right now. StartupXLOG doesn't
> > currently deal correctly with subsystems that are already
> > initialized. And your patch doesn't make it so as far as I can tell.
> 
> If you are talking about bgwriter, checkpointer, etc, as far as I understand
> the current state machine, my patch actually deal with them.

I'm talking about subsystems like subtrans multixact etc not being ok
with suddenly being initialized a second time. You cannot call
StartupXLOG twice without making modifications to it.

Greetings,

Andres Freund

Re: Operator class parameters and sgml docs

2020-06-18 Thread Alexander Korotkov

On Wed, Jun 17, 2020 at 2:00 PM Alexander Korotkov
 wrote:
> On Wed, Jun 17, 2020 at 2:50 AM Peter Geoghegan  wrote:
> > On Tue, Jun 16, 2020 at 4:24 AM Alexander Korotkov
> >  wrote:
> > > Thank you for patience.  The documentation patch is attached.  I think
> > > it requires review by native english speaker.
> >
> > * "...paramaters that controls" should be "...paramaters that control".
> >
> > * "with set of operator class specific option" should be "with a set
> > of operator class specific options".
> >
> > * "The options could be accessible from each support function" should
> > be "The options can be accessed from other support functions"
>
> Fixed, thanks!
>
> > It's very hard to write documentation like this, even for native
> > English speakers. I think that it's important to have something in
> > place, though. The GiST example helps a lot.
>
> I've added a complete example for defining a set of parameters and
> accessing them from another support function to the GiST
> documentation.

I'm going to push this patch if there are no objections.  I'm almost
sure that documentation of opclass options will require further
adjustments.  However, I think the current patch makes it better, not
worse.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-18 Thread Tom Lane

Robert Haas  writes:
> It's a little confusing, though, that you documented it as Mm.n.d but
> then in the text the order of explanation is d then m then n. Maybe
> switch the text around so the order matches, or even use something
> like Mmonth.occurrence.day.

Yeah, I struggled with that text for a bit.  It doesn't seem to make sense
to explain that n means the n'th occurrence of a particular d value before
we've explained what d is, so explaining the fields in their syntactic
order seems like a loser.  But we could describe m first without that
problem.

Not sure about replacing the m/n/d notation --- that's straight out of
POSIX, so inventing our own terminology might just confuse people who
do know the spec.

regards, tom lane

Re: jsonpath versus NaN

2020-06-18 Thread Alexander Korotkov

On Thu, Jun 18, 2020 at 7:45 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > Thank you for your answer. I'm trying to understand your point.
> > Standard claims that .double() method should behave the same way as
> > CAST to double.  However, standard references the standard behavior of
> > CAST here, not behavior of your implementation of CAST.  So, if we
> > extend the functionality of standard CAST in our implementation, that
> > doesn't automatically mean we should extend the .double() jsonpath
> > method in the same way.  Is it correct?
>
> Right.  We could, if we chose, extend jsonpath to allow Inf/NaN, but
> I don't believe there's an argument that the spec requires us to.
>
> Also the larger point is that it doesn't make sense to extend jsonpath
> that way when we haven't extended json(b) that way.  This code wart
> wouldn't exist were it not for that inconsistency.  Also, I find it hard
> to see why anyone would have a use for NaN in a jsonpath test when they
> can't write NaN in the input json data, nor have it be correctly reflected
> into output json data either.

Ok, I got the point.  I have nothing against removing support of NaN
in jsonpath as far as it doesn't violates the standard.  I'm going to
write the patch for this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [patch] demote

2020-06-18 Thread Fujii Masao





On 2020/06/19 0:22, Robert Haas wrote:

On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao  wrote:

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
 connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be streamed
 to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
 the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.


There's something to that idea. I think it somewhat depends on how
invasive the various operations are. For example, I'm not really sure
how feasible it is to demote without a full server restart that kicks
out all sessions. If that is required, it's a significant disadvantage
compared to ASRO.


Even with ASRO, the server restart is necessary and RO sessions are
kicked out when demoting RO primary to a standby, i.e., during a clean
switchover?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Mark btree_gist functions as PARALLEL SAFE

2020-06-18 Thread Winfield, Steven

Done - thanks again.

Steven.

From: Tom Lane 
Sent: 18 June 2020 16:28
To: Winfield, Steven 
Cc: pgsql-hackers@lists.postgresql.org 
Subject: Re: Mark btree_gist functions as PARALLEL SAFE

"Winfield, Steven"  writes:
> On the back of this thread[1] over at pgsql-general, I've attached a patch 
> that marks the functions in btree_gist as PARALLEL SAFE.

Cool, please add this patch to the commitfest queue to make sure we
don't lose track of it:

https://urldefense.proofpoint.com/v2/url?u=https-3A__commitfest.postgresql.org_28_&d=DwIFAg&c=55O-mPK0zNHOMgGDdj4__Q&r=JVb896wEUUvWD7VB-jCoEHWXWQRzU6xqZ3aOcIepVzQ&m=oE70Ndze7YukhfkcIi70pjHsKB2zAgJuNkcyEXLkSso&s=G4QLL-xXQrC_txQXOnZGH8bD5kq6sOPh5BY-DyWA9fw&e=

(You'll need to have a community-website login if you don't already)

regards, tom lane

** Cantab Capital Partners LLP is now named GAM Systematic LLP. Please note 
that our email addresses have changed from @cantabcapital.com to @gam.com.**

This email was sent by and on behalf of GAM Investments. GAM Investments is the 
corporate brand for GAM Holding AG and its direct and indirect subsidiaries. 
These companies may be referred to as 'GAM' or 'GAM Investments'. In the United 
Kingdom, the business of GAM Investments is conducted by GAM (U.K.) Limited 
(No. 01664573) or one or more entities under the control of GAM (U.K.) Limited, 
including the following entities authorised and regulated by the Financial 
Conduct Authority: GAM International Management Limited (No. 01802911), GAM 
London Limited (No. 00874802), GAM Sterling Management Limited (No. 01750352), 
GAM Unit Trust Management Company Limited (No. 2873560) and GAM Systematic LLP 
(No. OC317557). GAM (U.K.) Limited and its regulated entities are registered in 
England and Wales. The registered office and principal place of business of GAM 
(U.K.) Limited and its regulated entities is at 8 Finsbury Circus, London, 
England, EC2M 7GB. The registered office of GAM Systematic LLP is at City 
House, Hills Road, Cambridge, CB2 1RE. This email, and any attachments, is 
confidential and may be privileged or otherwise protected from disclosure. It 
is intended solely for the stated addressee(s) and access to it by any other 
person is unauthorised. If you are not the intended recipient, you must not 
disclose, copy, circulate or in any other way use or rely on the information 
contained herein. If you have received this email in error, please inform us 
immediately and delete all copies of it. See - 
https://www.gam.com/en/legal/email-disclosures-eu/ for further information on 
confidentiality, the risks of non-secure electronic communication, and certain 
disclosures which we are required to make in accordance with applicable 
legislation and regulations. If you cannot access this link, please notify us 
by reply message and we will send the contents to you. GAM Investments will 
collect and use information about you in the course of your interactions with 
us. Full details about the data types we collect and what we use this for and 
your related rights is set out in our online privacy policy at 
https://www.gam.com/en/legal/privacy-policy. Please familiarise yourself with 
this policy and check it from time to time for updates as it supplements this 
notice.

Re: jsonpath versus NaN

2020-06-18 Thread Tom Lane

Alexander Korotkov  writes:
> Thank you for your answer. I'm trying to understand your point.
> Standard claims that .double() method should behave the same way as
> CAST to double.  However, standard references the standard behavior of
> CAST here, not behavior of your implementation of CAST.  So, if we
> extend the functionality of standard CAST in our implementation, that
> doesn't automatically mean we should extend the .double() jsonpath
> method in the same way.  Is it correct?

Right.  We could, if we chose, extend jsonpath to allow Inf/NaN, but
I don't believe there's an argument that the spec requires us to.

Also the larger point is that it doesn't make sense to extend jsonpath
that way when we haven't extended json(b) that way.  This code wart
wouldn't exist were it not for that inconsistency.  Also, I find it hard
to see why anyone would have a use for NaN in a jsonpath test when they
can't write NaN in the input json data, nor have it be correctly reflected
into output json data either.

Maybe there's a case for extending json(b) that way; it wouldn't be so
different from the work I'm doing nearby to extend type numeric for
infinities.  But we'd have to have a conversation about whether
interoperability with other JSON implementations is worth sacrificing
to improve consistency with our float and numeric datatypes.  In the
meantime, though, we aren't allowing Inf/NaN in json(b) so I don't think
jsonpath should accept them either.

regards, tom lane

Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 12:26 AM Tom Lane  wrote:
> Here's a proposed patch to do that.  To explain this, we more or less
> have to fully document the POSIX timezone string format (otherwise
> nobody's gonna understand what "M3.2.0,M11.1.0" means).  That's something
> we've glossed over for many years, and I still feel like it's not
> something to explain in-line in section 8.5.3, so I shoved all the gory
> details into a new section in Appendix B.  To be clear, nothing here is
> new behavior, it was just undocumented before.

I'm glad you are proposing to document this, because the set of people
who had no idea what "M3.2.0,M11.1.0" means definitely included me.
It's a little confusing, though, that you documented it as Mm.n.d but
then in the text the order of explanation is d then m then n. Maybe
switch the text around so the order matches, or even use something
like Mmonth.occurrence.day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: jsonpath versus NaN

2020-06-18 Thread Alexander Korotkov

On Thu, Jun 18, 2020 at 7:34 PM Alexander Korotkov
 wrote:
> Thank you for your answer. I'm trying to understand your point.
> Standard claims that .double() method should behave the same way as
> CAST to double.  However, standard references the standard behavior of
> CAST here, not behavior of your implementation of CAST.

Typo here: please read "our implementation of CAST" here.

> So, if we
> extend the functionality of standard CAST in our implementation, that
> doesn't automatically mean we should extend the .double() jsonpath
> method in the same way.  Is it correct?


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: jsonpath versus NaN

2020-06-18 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jun 18, 2020 at 11:51 AM Oleg Bartunov  
> wrote:
>> The problem is that we tried to find a trade-off  between standard and 
>> postgres
>> implementation, for example, in postgres CAST  allows NaN and Inf, and SQL 
>> Standard
>> requires .double should works as CAST.

> It seems like the right thing is to implement the standard, not to
> implement whatever PostgreSQL happens to do in other cases. I can't
> help feeling like re-using the numeric data type for other things has
> led to this confusion. I think that fails in other cases, too: like
> what if you have a super-long integer that can't be represented as a
> numeric? I bet jsonb will fail, or maybe it will convert it to a
> string, but I don't see how it can do anything else.

Actually, the JSON spec explicitly says that any number that doesn't fit
in an IEEE double isn't portable [1].  So we're already very far above and
beyond the spec's requirements by using numeric.  We don't need to improve
on that.  But I concur with your point that just because PG does X in
some other cases doesn't mean that we must do X in json or jsonpath.

regards, tom lane

[1] https://tools.ietf.org/html/rfc7159#page-6

   This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
   generally available and widely used, good interoperability can be
   achieved by implementations that expect no more precision or range
   than these provide, in the sense that implementations will
   approximate JSON numbers within the expected precision.  A JSON
   number such as 1E400 or 3.141592653589793238462643383279 may indicate
   potential interoperability problems, since it suggests that the
   software that created it expects receiving software to have greater
   capabilities for numeric magnitude and precision than is widely
   available.

   Note that when such software is used, numbers that are integers and
   are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
   sense that implementations will agree exactly on their numeric
   values.

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread Fujii Masao





On 2020/06/18 23:09, Bharath Rupireddy wrote:

On Thu, Jun 18, 2020 at 7:01 PM vignesh C  wrote:


Hi,

While checking copy from code I found that the function parameter
column_no is not used in CopyReadBinaryAttribute. I felt this could be
removed.
Attached patch contains the changes for the same.
Thoughts?



I don't see any problem in removing this extra parameter.

However another thought, can it be used to report a bit meaningful
error for field size < 0 check?


column_no was used for that purpose in the past, but commit 0e319c7ad7
changed that. If we want to use column_no in the log message again,
it's better to check why commit 0e319c7ad7 got rid of column_no from
the message.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: jsonpath versus NaN

2020-06-18 Thread Alexander Korotkov

Tom,

On Thu, Jun 18, 2020 at 7:07 PM Tom Lane  wrote:
> Oleg Bartunov  writes:
> > The problem is that we tried to find a trade-off between standard and
> > postgres implementation, for example, in postgres CAST allows NaN and
> > Inf, and SQL Standard requires .double should works as CAST.
>
> As I said, I think this is a fundamental misreading of the standard.
> The way I read it is that it requires the set of values that are legal
> according to the standard to be processed the same way as CAST would.

Thank you for your answer. I'm trying to understand your point.
Standard claims that .double() method should behave the same way as
CAST to double.  However, standard references the standard behavior of
CAST here, not behavior of your implementation of CAST.  So, if we
extend the functionality of standard CAST in our implementation, that
doesn't automatically mean we should extend the .double() jsonpath
method in the same way.  Is it correct?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [patch] demote

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 11:56 AM Jehan-Guillaume de Rorthais
 wrote:
> Considering the current demote patch improvement. I was considering to digg in
> the following direction:
>
> * add a new state in the state machine where all backends are idle
> * this new state forbid any new writes, the same fashion we do on standby 
> nodes
> * this state could either wait for end of xact, or cancel/kill
>   RW backends, in the same fashion current smart/fast stop do
> * from this state, we might then rollback pending prepared xact, stop other
>   sub-process etc (as the current patch does), and demote safely to
>   PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).
>
> Is it something worth considering?
> Maybe the code will be so close from ASRO, it would just be kind of a fusion 
> of
> both patch? I don't know, I didn't look at the ASRO patch yet.

I don't think that the postmaster state machine is the interesting
part of this problem. The tricky parts have to do with updating shared
memory state, and with updating per-backend private state. For
example, snapshots are taken in a different way during recovery than
they are in normal operation, hence SnapshotData's takenDuringRecovery
member. And I think that we allocate extra shared memory space for
storing the data that those snapshots use if, and only if, the server
starts up in recovery. So if the server goes backward from normal
running into recovery, we might not have the space that we need in
shared memory to store the extra data, and even if we had the space it
might not be populated correctly, and the code that takes snapshots
might not be written properly to handle multiple transitions between
recovery and normal running, or even a single backward transition.

In general, there's code scattered all throughout the system that
assumes the recovery -> normal running transition is one-way. If we go
back into recovery by killing off all backends and reinitializing
shared memory, then we don't have to worry about that stuff. If we do
anything less than that, we have to find all the code that relies on
never reentering recovery and fix it all. Now it's also true that we
have to do some other things, like restarting the startup process, and
stopping things like autovacuum, and the postmaster may need to be
involved in some of that. There's clearly some engineering work there,
but I think it's substantially less than the amount of engineering
work involved in fixing problems with shared memory contents and
backend-local state.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-18 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jun 18, 2020 at 11:59 AM Tom Lane  wrote:
>> Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
>> sufficient to address that point, with no need to touch s_lock.h at all?

> I mean, wouldn't you then end up with a bunch of 1-line functions
> where you can step into the function but not through whatever
> individual things it does?

Not following your point.  The s_lock.h implementations tend to be either
simple C statements ("*lock = 0") or asm blocks; if you feel a need to
step through them you're going to be resorting to "si" anyway.

I think the main usefulness of doing anything here would be (a) separating
the spinlock infrastructure from callers and (b) ensuring that we have a
declared argument type, and single-evaluation semantics, for the spinlock
function parameters.  Both of those are adequately addressed by fixing
spin.h, IMO anyway.

regards, tom lane

Re: jsonpath versus NaN

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 11:51 AM Oleg Bartunov  wrote:
> The problem is that we tried to find a trade-off  between standard and 
> postgres
> implementation, for example, in postgres CAST  allows NaN and Inf, and SQL 
> Standard
> requires .double should works as CAST.

It seems like the right thing is to implement the standard, not to
implement whatever PostgreSQL happens to do in other cases. I can't
help feeling like re-using the numeric data type for other things has
led to this confusion. I think that fails in other cases, too: like
what if you have a super-long integer that can't be represented as a
numeric? I bet jsonb will fail, or maybe it will convert it to a
string, but I don't see how it can do anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 11:59 AM Tom Lane  wrote:
> Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
> sufficient to address that point, with no need to touch s_lock.h at all?

I mean, wouldn't you then end up with a bunch of 1-line functions
where you can step into the function but not through whatever
individual things it does?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: jsonpath versus NaN

2020-06-18 Thread Tom Lane

Oleg Bartunov  writes:
> The problem is that we tried to find a trade-off between standard and
> postgres implementation, for example, in postgres CAST allows NaN and
> Inf, and SQL Standard requires .double should works as CAST.

As I said, I think this is a fundamental misreading of the standard.
The way I read it is that it requires the set of values that are legal
according to the standard to be processed the same way as CAST would.

While we certainly *could* choose to extend jsonpath, and/or jsonb
itself, to allow NaN/Inf, I do not think that it's sane to argue that
the standard requires us to do that; the wording in the opposite
direction is pretty clear.  Also, I do not find it convincing to
extend jsonpath that way when we haven't extended jsonb.  Quite aside
from the ensuing code warts, what in the world is the use-case?

regards, tom lane

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-18 Thread Tom Lane

Robert Haas  writes:
> On Wed, Jun 17, 2020 at 5:29 PM Tom Lane  wrote:
>> I wouldn't object to making the outer-layer macros in spin.h into static
>> inlines; as mentioned, that might have some debugging benefits.  But I
>> think messing with s_lock.h for marginal cosmetic reasons is a foolish
>> idea.  For one thing, there's no way whoever does it can verify all the
>> architecture-specific stanzas.  (I don't think we even have all of them
>> covered in the buildfarm.)

> It would be a pretty mechanical change to use a separate preprocessor
> symbol for the conditional and just define the static inline functions
> on the spot. There might be one or two goofs, but if those platforms
> are not in the buildfarm, they're either dead and they don't matter,
> or someone will tell us what we did wrong. I don't know. I don't have
> a huge desire to spend time cleaning up s_lock.h and I do think it's
> better not to churn stuff around just for the heck of it, but I'm also
> sympathetic to Andres's point that using macros everywhere is
> debugger-unfriendly.

Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
sufficient to address that point, with no need to touch s_lock.h at all?

regards, tom lane

Re: [patch] demote

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Thu, 18 Jun 2020 11:22:47 -0400
Robert Haas  wrote:

> On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao 
> wrote:
> > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> > What about the following procedure?
> >
> > 1. Demote the primary to a standby. Then this demoted standby is read-only.
> > 2. The orignal standby automatically establishes the cascading replication
> > connection with the demoted standby.
> > 3. Wait for all the WAL records available in the demoted standby to be
> > streamed to the orignal standby.
> > 4. Promote the original standby to new primary.
> > 5. Change primary_conninfo in the demoted standby so that it establishes
> > the replication connection with new primary.
> >
> > So it seems enough to implement "demote" feature for a clean switchover.  
> 
> There's something to that idea. I think it somewhat depends on how
> invasive the various operations are. For example, I'm not really sure
> how feasible it is to demote without a full server restart that kicks
> out all sessions. If that is required, it's a significant disadvantage
> compared to ASRO. On the other hand, if a machine can be demoted just
> by kicking out R/W sessions, as ASRO currently does, then maybe
> there's not that much difference. Or maybe both designs are subject to
> improvement and we can do something even less invasive...

Considering the current demote patch improvement. I was considering to digg in
the following direction:

* add a new state in the state machine where all backends are idle
* this new state forbid any new writes, the same fashion we do on standby nodes
* this state could either wait for end of xact, or cancel/kill
  RW backends, in the same fashion current smart/fast stop do
* from this state, we might then rollback pending prepared xact, stop other
  sub-process etc (as the current patch does), and demote safely to
  PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).

Is it something worth considering?
Maybe the code will be so close from ASRO, it would just be kind of a fusion of
both patch? I don't know, I didn't look at the ASRO patch yet.

> One thing I think people are going to want to do is have the master go
> read-only if it loses communication to the rest of the network, to
> avoid or at least mitigate split-brain. However, such network
> interruptions are often transient, so it might not be uncommon to
> briefly go read-only due to a network blip, but then recover quickly
> and return to a read-write state. It doesn't seem to matter much
> whether that read-only state is a new kind of normal operation (like
> what ASRO would do) or whether we've actually returned to a recovery
> state (as demote would do) but the collateral effects of the state
> change do matter.

Well, triggering such actions (demote or read only) often occurs external
decision, hopefully relying on at least some quorum and being able to escalade
to watchdog or fencing is required.

Most tools around will need to demote or fence. It seems dangerous to flip
between read only/read write on a bunch of cluster nodes. It might be quickly
messy, especially since a former primary with non replicated data could
automatically replicate from a new primary without screaming...

Regards,

Re: jsonpath versus NaN

2020-06-18 Thread Oleg Bartunov

On Wed, Jun 17, 2020 at 6:33 PM Tom Lane  wrote:

> Alexander Korotkov  writes:
> > On Thu, Jun 11, 2020 at 10:00 PM Tom Lane  wrote:
> >> I don't think this is very relevant.  The SQL standard has not got the
> >> concepts of Inf or NaN either (see 4.4.2 Characteristics of numbers),
> >> therefore their definition is only envisioning that a string
> representing
> >> a normal finite number should be castable to DOUBLE PRECISION.  Thus,
> >> both of the relevant standards think that "numbers" are just finite
> >> numbers.
>
> > Yes, I see.  No standard insists we should support NaN.  However,
> > standard claims .double() should behave the same as CAST to double.
> > So, I think if CAST supports NaN, but .double() doesn't, it's still a
> > violation.
>
> No, I think you are completely misunderstanding the standard.  They
> are saying that strings that look like legal numbers according to SQL
> should be castable into numbers.  But NaN and Inf are not legal
> numbers according to SQL, so there is nothing in that text that
> justifies accepting "NaN".  Nor does the JSON standard provide any
> support for that position.  So I think it is fine to leave NaN/Inf
> out of the world of what you can write in jsonpath.
>

rfc and sql json forbid Nan and Inf ( Technical Report is freely available,
https://standards.iso.org/ittf/PubliclyAvailableStandards/c067367_ISO_IEC_TR_19075-6_2017.zip
)

Page 10 JSON terminology.
“A sequence comprising an integer part, optionally followed by a fractional
part and/or
an exponent part (non-numeric values, such as infinity and NaN are not
permitted)”


>
> I'd be more willing to let the code do this if it didn't require such
> a horrid, dangerous kluge to do so.  But it does, and I don't see any
> easy way around that, so I think we should just take out the kluge.
> And do so sooner not later, before some misguided user starts to
> depend on it.
>

The problem is that we tried to find a trade-off  between standard and
postgres
implementation, for example, in postgres CAST  allows NaN and Inf, and SQL
Standard
requires .double should works as CAST.

 SELECT 'nan'::real, 'inf'::real;
 float4 |  float4
+--
NaN | Infinity
(1 row)



>
> regards, tom lane
>
>
>

-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-18 Thread Robert Haas

On Wed, Jun 17, 2020 at 5:29 PM Tom Lane  wrote:
> See the "Default Definitions", down near the end.

Oh, yeah. I didn't realize you meant just inside this file itself.
That is slightly awkward. I initially thought there was no problem
because there seem to be no remaining non-default definitions of
S_LOCK, but I now see that the other macros still do have some
non-default definitions. Hmm.

> > Really? Multiple layers of macros seem like they pretty clearly make
> > the source code harder to understand. There are plenty of places where
> > such devices are necessary for one reason or another, but it doesn't
> > seem like something we ought to keep around for no reason.
>
> I wouldn't object to making the outer-layer macros in spin.h into static
> inlines; as mentioned, that might have some debugging benefits.  But I
> think messing with s_lock.h for marginal cosmetic reasons is a foolish
> idea.  For one thing, there's no way whoever does it can verify all the
> architecture-specific stanzas.  (I don't think we even have all of them
> covered in the buildfarm.)

It would be a pretty mechanical change to use a separate preprocessor
symbol for the conditional and just define the static inline functions
on the spot. There might be one or two goofs, but if those platforms
are not in the buildfarm, they're either dead and they don't matter,
or someone will tell us what we did wrong. I don't know. I don't have
a huge desire to spend time cleaning up s_lock.h and I do think it's
better not to churn stuff around just for the heck of it, but I'm also
sympathetic to Andres's point that using macros everywhere is
debugger-unfriendly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-18 Thread Dilip Kumar

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila  wrote:
>
> On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila  wrote:
> >
> > I think one of the usages we still need is in ReorderBufferForget
> > because it can be called when we skip processing the txn.  See the
> > comments in DecodeCommit where we call this function.  If I am
> > correct, we need to probably collect all invalidations in
> > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> >
>
> One more point related to this is that after this patch series, we
> need to consider executing all invalidation during transaction abort.
> Because it is possible that due to memory overflow, we have processed
> some of the messages which also contain a few XACT_INVALIDATION
> messages, so to avoid cache pollution, we need to execute all of them
> in abort.  We also do the similar thing in Rollback/Rollback To
> Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Yes, we need to do that,  So now we are collecting all the
invalidation under txn->invalidation so they are getting executed on
abort.

>
> Few other comments on
> 0002-Issue-individual-invalidations-with-wal_level-lo.patch
> ---
> 1.
> + if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
> + {
> + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
> + MakeSharedInvalidMessagesArray);
> + invalMessages = SharedInvalidMessagesArray;
> + nmsgs  = numSharedInvalidMessagesArray;
> + SharedInvalidMessagesArray = NULL;
> + numSharedInvalidMessagesArray = 0;
>
> a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
> to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
> && SharedInvalidMessagesArray == NULL));?

Done

> b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
> required?  If you see xactGetCommittedInvalidationMessages where we do
> something similar, we only check for valid value of transInvalInfo and
> here we check the same in the caller of LogLogicalInvalidations, isn't
> that sufficient?  If that is sufficient, we can either have the same
> check here or have an Assert for the same.

I have put the same check here.

>
> 2.
> @@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
>   if (transInvalInfo == NULL)
>   return;
>
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
> +
>   ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
>   LocalExecuteInvalidationMessage);
> Generally, we WAL log the action after performing it but here you are
> writing WAL first.  Is there any specific reason?  If so, can we write
> a comment about the same?

Yeah, there is no reason for the same so moved it down.

>
> 3.
> + * When wal_level=logical, write invalidations into WAL at each command end 
> to
> + * support the decoding of the in-progress transaction.  As of now it was
> + * enough to log invalidation only at commit because we are only decoding the
> + * transaction at the commit time.   We only need to log the catalog cache 
> and
> + * relcache invalidation.  There can not be any active MVCC scan in logical
> + * decoding so we don't need to log the snapshot invalidation.
>
> I think this comment doesn't hold good after we have changed the patch
> to LOG invalidations at the time of CCI.

Right, modified.

>
> 4.
> +
> +/*
> + * Emit WAL for invalidations.
> + */
> +static void
> +LogLogicalInvalidations()
>
> Add the function name atop of this function in comments to match the
> style with other nearby functions.  How about modifying it to
> something like: "Emit WAL for invalidations.  This is currently only
> used for logging invalidations at the command end."

Done

>
> 5.
> + *
> + * XXX Do we need to care about relcacheInitFileInval and
> + * the other fields added to ReorderBufferChange, or just
> + * about the message itself?
> + */
>
> I don't think we need to do anything about relcacheInitFileInval.
> This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
> have obsolete information about relcache.  The walsender process that
> is doing decoding doesn't require us to do anything about this.  Also,
> if you see before this patch, we don't do anything about relcache
> files during decoding of invalidation messages.  In short, I think we
> can remove this comment unless you see some use of it.

Now, we have removed the Invalidation change itself so this comment is gone.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Mark btree_gist functions as PARALLEL SAFE

2020-06-18 Thread Tom Lane

"Winfield, Steven"  writes:
> On the back of this thread[1] over at pgsql-general, I've attached a patch 
> that marks the functions in btree_gist as PARALLEL SAFE.

Cool, please add this patch to the commitfest queue to make sure we
don't lose track of it:

https://commitfest.postgresql.org/28/

(You'll need to have a community-website login if you don't already)

regards, tom lane

Re: [patch] demote

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Thu, 18 Jun 2020 11:15:02 -0400
Robert Haas  wrote:

> On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
>  wrote:
> > At some expense, Admin can already set the system as readonly from the
> > application point of view, using:
> >
> >   alter system set default_transaction_read_only TO on;
> >   select pg_reload_conf();
> >
> > Current RW xact will finish, but no other will be allowed.  
> 
> That doesn't block all WAL generation, though:
> 
> rhaas=# alter system set default_transaction_read_only TO on;
> ALTER SYSTEM
> rhaas=# select pg_reload_conf();
>  pg_reload_conf
> 
>  t
> (1 row)
> rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
> rhaas=#

Yes, this, and the fact that any user can switch transaction_read_only back to
on easily. This was a terrible example.

My point was that ALTER SYSTEM READ ONLY as described here doesn't feel like a
required user feature, outside of the demote scope. It might be useful for the
demote process, but only from the core point of view, without user interaction.
It seems there's no other purpose from the admin standpoint.

> There's a bunch of other things it also doesn't block, too. If you're
> trying to switch to a new primary, you really want to stop WAL
> generation completely on the old one. Otherwise, you can't guarantee
> that the machine you're going to promote is completely caught up,
> which means you might lose some changes, and you might have to
> pg_rewind the old master.

Yes, of course. I wasn't explaining transaction_read_only was useful in a 
switchover procedure, sorry for the confusion and misleading comment.

Regards,

Re: Parallel Seq Scan vs kernel read ahead

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 6:15 AM David Rowley  wrote:
> With a 32TB relation, the code will make the chunk size 16GB.  Perhaps
> I should change the code to cap that at 1GB.

It seems pretty hard to believe there's any significant advantage to a
chunk size >1GB, so I would be in favor of that change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 11:08 AM Jehan-Guillaume de Rorthais
 wrote:
> Thanks to cascading replication, it could be very possible without this READ
> ONLY mode, just in recovery mode, isn't it?

Yeah, perhaps. I just wrote an email about that over on the demote
thread, so I won't repeat it here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao  wrote:
> ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> What about the following procedure?
>
> 1. Demote the primary to a standby. Then this demoted standby is read-only.
> 2. The orignal standby automatically establishes the cascading replication
> connection with the demoted standby.
> 3. Wait for all the WAL records available in the demoted standby to be 
> streamed
> to the orignal standby.
> 4. Promote the original standby to new primary.
> 5. Change primary_conninfo in the demoted standby so that it establishes
> the replication connection with new primary.
>
> So it seems enough to implement "demote" feature for a clean switchover.

There's something to that idea. I think it somewhat depends on how
invasive the various operations are. For example, I'm not really sure
how feasible it is to demote without a full server restart that kicks
out all sessions. If that is required, it's a significant disadvantage
compared to ASRO. On the other hand, if a machine can be demoted just
by kicking out R/W sessions, as ASRO currently does, then maybe
there's not that much difference. Or maybe both designs are subject to
improvement and we can do something even less invasive...

One thing I think people are going to want to do is have the master go
read-only if it loses communication to the rest of the network, to
avoid or at least mitigate split-brain. However, such network
interruptions are often transient, so it might not be uncommon to
briefly go read-only due to a network blip, but then recover quickly
and return to a read-write state. It doesn't seem to matter much
whether that read-only state is a new kind of normal operation (like
what ASRO would do) or whether we've actually returned to a recovery
state (as demote would do) but the collateral effects of the state
change do matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
 wrote:
> At some expense, Admin can already set the system as readonly from the
> application point of view, using:
>
>   alter system set default_transaction_read_only TO on;
>   select pg_reload_conf();
>
> Current RW xact will finish, but no other will be allowed.

That doesn't block all WAL generation, though:

rhaas=# alter system set default_transaction_read_only TO on;
ALTER SYSTEM
rhaas=# select pg_reload_conf();
 pg_reload_conf

 t
(1 row)
rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
rhaas=#

There's a bunch of other things it also doesn't block, too. If you're
trying to switch to a new primary, you really want to stop WAL
generation completely on the old one. Otherwise, you can't guarantee
that the machine you're going to promote is completely caught up,
which means you might lose some changes, and you might have to
pg_rewind the old master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Thu, 18 Jun 2020 10:52:49 -0400
Robert Haas  wrote:

[...]
> But what you want is that if it does happen to go down for some reason before
> all the WAL is streamed, you can bring it back up and finish streaming the
> WAL without generating any new WAL.

Thanks to cascading replication, it could be very possible without this READ
ONLY mode, just in recovery mode, isn't it?

Regards,

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 6:39 AM Simon Riggs  wrote:
> That doesn't appear to be very graceful. Perhaps objections could be assuaged 
> by having a smoother transition and perhaps not even a full barrier, 
> initially.

Yeah, it's not ideal, though still better than what we have now. What
do you mean by "a smoother transition and perhaps not even a full
barrier"? I think if you want to switch the primary to another machine
and make the old primary into a standby, you really need to arrest WAL
writes completely. It would be better to make existing write
transactions ERROR rather than FATAL, but there are some very
difficult cases there, so I would like to leave that as a possible
later improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 7:19 AM amul sul  wrote:
> Let me explain the case, if we do skip the end-of-recovery checkpoint while
> starting the system in read-only mode and then later changing the state to
> read-write and do a few write operations and online checkpoints, that will be
> fine? I am yet to explore those things.

I think we'd want the FIRST write operation to be the end-of-recovery
checkpoint, before the system is fully read-write. And then after that
completes you could do other things.

It would be good if we can get an opinion from Andres about this,
since I think he has thought about this stuff quite a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Robert Haas

On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila  wrote:
> For buffer replacement, many-a-times we have to also perform
> XLogFlush, what do we do for that?  We can't proceed without doing
> that and erroring out from there means stopping read-only query from
> the user perspective.

I think we should stop WAL writes, then XLogFlush() once, then declare
the system R/O. After that there might be more XLogFlush() calls but
there won't be any new WAL, so they won't do anything.

> > But there's no reason for the checkpointer to do it: it shouldn't try
> > to checkpoint, and therefore it shouldn't write dirty pages either.
>
> What is the harm in doing the checkpoint before we put the system into
> READ ONLY state?  The advantage is that we can at least reduce the
> recovery time if we allow writing checkpoint record.

Well, as Andres says in
http://postgr.es/m/20200617180546.yucxtiupvxghx...@alap3.anarazel.de
it can take a really long time.

> > Interesting question. I was thinking that we should probably teach the
> > autovacuum launcher to stop launching workers while the system is in a
> > READ ONLY state, but what about existing workers? Anything that
> > generates invalidation messages, acquires an XID, or writes WAL has to
> > be blocked in a read-only state; but I'm not sure to what extent the
> > first two of those things would be a problem for vacuuming an unlogged
> > table. I think you couldn't truncate it, at least, because that
> > acquires an XID.
> >
>
> If the truncate operation errors out, then won't the system will again
> trigger a new autovacuum worker for the same relation as we update
> stats at the end?

Not if we do what I said in that paragraph. If we're not launching new
workers we can't again trigger a worker for the same relation.

> Also, in general for regular tables, if there is an
> error while it tries to WAL, it could again trigger the autovacuum
> worker for the same relation.  If this is true then unnecessarily it
> will generate a lot of dirty pages and don't think it will be good for
> the system to behave that way?

I don't see how this would happen. VACUUM can't really dirty pages
without writing WAL, can it? And, anyway, if there's an error, we're
not going to try again for the same relation unless we launch new
workers.

> > What I think should happen is that the end-of-recovery checkpoint
> > should be skipped, and then if the system is put back into read-write
> > mode later we should do it then.
>
> But then if we have to perform recovery again, it will start from the
> previous checkpoint.  I think we have to live with it.

Yeah. I don't think it's that bad. The case where you shut down the
system while it's read-only should be a somewhat unusual one. Normally
you would mark it read only and then promote a standby and shut the
old master down (or demote it). But what you want is that if it does
happen to go down for some reason before all the WAL is streamed, you
can bring it back up and finish streaming the WAL without generating
any new WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Mark btree_gist functions as PARALLEL SAFE

2020-06-18 Thread Winfield, Steven

Hi,

On the back of this thread[1] over at pgsql-general, I've attached a patch that 
marks the functions in btree_gist as PARALLEL SAFE.
This is primarily to allow parallel plans to be considered when btree_gist's 
<-> operator is used in any context; for example in an expression
that will be evaluated at execution time, or in a functional column in btree or 
gist indexes.

In the latter example, despite the functions already being marked IMMUTABLE, 
attempts to retrieve precomputed values from a functional index during an index 
scan or index-only scan still require the function to be marked PARALLEL SAFE 
to prevent dropping down to a serial plan.

It requires btree_gist's version to be bumped to 1.6.

In line with this commit[2], and for the same reasons, all functions defined by 
btree_gist are being marked as safe:

---

"... Note that some of the markings added by this commit don't have any
effect; for example, gseg_picksplit() isn't likely to be mentioned
explicitly in a query and therefore it's parallel-safety marking will
never be consulted.  But this commit just marks everything for
consistency: if it were somehow used in a query, that would be fine as
far as parallel query is concerned, since it does not consult any
backend-private state, attempt to write data, etc."

---

I haven't added any more tests, but neither did I find any added with the above 
commit.

"CREATE EXTENSION btree_gist" runs successfully, as does "make check-world".

This is the first patch I've submitted, so if I've omitted something then 
please let me know.
Thanks for your time,
Steven.


[1] 
https://www.postgresql.org/message-id/DB7PR09MB2537E18FF90C1C1BBF49D628FD830%40DB7PR09MB2537.eurprd09.prod.outlook.com
[2] 
https://github.com/postgres/postgres/commit/2910fc8239fa501b662c5459d7ba16a4bc35e7e8


(Apologies if my company's footer appears here)

** Cantab Capital Partners LLP is now named GAM Systematic LLP. Please note 
that our email addresses have changed from @cantabcapital.com to @gam.com.**

This email was sent by and on behalf of GAM Investments. GAM Investments is the 
corporate brand for GAM Holding AG and its direct and indirect subsidiaries. 
These companies may be referred to as ‘GAM’ or ‘GAM Investments’. In the United 
Kingdom, the business of GAM Investments is conducted by GAM (U.K.) Limited 
(No. 01664573) or one or more entities under the control of GAM (U.K.) Limited, 
including the following entities authorised and regulated by the Financial 
Conduct Authority: GAM International Management Limited (No. 01802911), GAM 
London Limited (No. 00874802), GAM Sterling Management Limited (No. 01750352), 
GAM Unit Trust Management Company Limited (No. 2873560) and GAM Systematic LLP 
(No. OC317557). GAM (U.K.) Limited and its regulated entities are registered in 
England and Wales. The registered office and principal place of business of GAM 
(U.K.) Limited and its regulated entities is at 8 Finsbury Circus, London, 
England, EC2M 7GB. The registered office of GAM Systematic LLP is at City 
House, Hills Road, Cambridge, CB2 1RE. This email, and any attachments, is 
confidential and may be privileged or otherwise protected from disclosure. It 
is intended solely for the stated addressee(s) and access to it by any other 
person is unauthorised. If you are not the intended recipient, you must not 
disclose, copy, circulate or in any other way use or rely on the information 
contained herein. If you have received this email in error, please inform us 
immediately and delete all copies of it. See - 
https://www.gam.com/en/legal/email-disclosures-eu/ for further information on 
confidentiality, the risks of non-secure electronic communication, and certain 
disclosures which we are required to make in accordance with applicable 
legislation and regulations. If you cannot access this link, please notify us 
by reply message and we will send the contents to you. GAM Investments will 
collect and use information about you in the course of your interactions with 
us. Full details about the data types we collect and what we use this for and 
your related rights is set out in our online privacy policy at 
https://www.gam.com/en/legal/privacy-policy. Please familiarise yourself with 
this policy and check it from time to time for updates as it supplements this 
notice.
diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index a85db35e55..e92d974a1a 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -31,7 +31,8 @@ OBJS =  \
 EXTENSION = btree_gist
 DATA = btree_gist--1.0--1.1.sql \
btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \
-   btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql
+   btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \
+   btree_gist--1.5--1.6.sql
 PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes"
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
diff --git a/contrib/b

Re: Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread Bharath Rupireddy

On Thu, Jun 18, 2020 at 7:01 PM vignesh C  wrote:
>
> Hi,
>
> While checking copy from code I found that the function parameter
> column_no is not used in CopyReadBinaryAttribute. I felt this could be
> removed.
> Attached patch contains the changes for the same.
> Thoughts?
>

I don't see any problem in removing this extra parameter.

However another thought, can it be used to report a bit meaningful
error for field size < 0 check?

if (fld_size < 0)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
 errmsg("invalid field size for column %d", column_no)));

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-18 Thread Justin Pryzby

On Thu, Jun 18, 2020 at 03:37:21PM +1200, David Rowley wrote:
> Now that HashAgg can spill to disk, we see a few more details in
> EXPLAIN ANALYZE than we did previously, e.g. Peak Memory Usage, Disk
> Usage.  However, the new code neglected to make EXPLAIN ANALYZE show
> these new details for parallel workers.
> 
> Additionally, the new properties all were using
> ExplainPropertyInteger() which meant that we got output like:
> 
>Group Key: a
>Peak Memory Usage: 97 kB
>Disk Usage: 3928 kB
>HashAgg Batches: 798
> 
> Where all the properties were on a line by themselves.  This does not
> really follow what other nodes are doing, e.g sort:
> 
>->  Sort (actual time=47.500..59.344 rows=10 loops=1)
>  Sort Key: a
>  Sort Method: external merge  Disk: 2432kB
> 
> Where we stick all the properties on a single line.

Note that "incremental sort" is also new, and splits things up more than sort.

See in particular 6a918c3ac8a6b1d8b53cead6fcb7cbd84eee5750, which splits things
up even more.

->  Incremental Sort (actual rows=70 loops=1)
  Sort Key: t.a, t.b
  Presorted Key: t.a
- Full-sort Groups: 1 Sort Method: quicksort Memory: avg=NNkB peak=NNkB 
Presorted Groups: 5 Sort Methods: top-N heapsort, quicksort Memory: avg=NNkB 
peak=NNkB
+ Full-sort Groups: 1  Sort Method: quicksort  Average Memory: NNkB  
Peak Memory: NNkB
+ Pre-sorted Groups: 5  Sort Methods: top-N heapsort, quicksort  
Average Memory: NNkB  Peak Memory: NNkB

That's not really a "precedent" and I don't think that necessarily invalidates
your change.

-- 
Justin

Cleanup - Removal of unused function parameter from CopyReadBinaryAttribute

2020-06-18 Thread vignesh C

Hi,

While checking copy from code I found that the function parameter
column_no is not used in CopyReadBinaryAttribute. I felt this could be
removed.
Attached patch contains the changes for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From a26d6d45c6f7f8d9effb0965f8893725bda2a288 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 18 Jun 2020 18:45:52 +0530
Subject: [PATCH] Cleanup, removal of unused function parameter from
 CopyReadBinaryAttribute.

The function parameter column_no is not used in CopyReadBinaryAttribute, this
can be removed.
---
 src/backend/commands/copy.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6d53dc4..6b1fd6d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -367,8 +367,7 @@ static bool CopyReadLine(CopyState cstate);
 static bool CopyReadLineText(CopyState cstate);
 static int	CopyReadAttributesText(CopyState cstate);
 static int	CopyReadAttributesCSV(CopyState cstate);
-static Datum CopyReadBinaryAttribute(CopyState cstate,
-	 int column_no, FmgrInfo *flinfo,
+static Datum CopyReadBinaryAttribute(CopyState cstate, FmgrInfo *flinfo,
 	 Oid typioparam, int32 typmod,
 	 bool *isnull);
 static void CopyAttributeOutText(CopyState cstate, char *string);
@@ -3776,7 +3775,6 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 	 errmsg("row field count is %d, expected %d",
 			(int) fld_count, attr_count)));
 
-		i = 0;
 		foreach(cur, cstate->attnumlist)
 		{
 			int			attnum = lfirst_int(cur);
@@ -3784,9 +3782,7 @@ NextCopyFrom(CopyState cstate, ExprContext *econtext,
 			Form_pg_attribute att = TupleDescAttr(tupDesc, m);
 
 			cstate->cur_attname = NameStr(att->attname);
-			i++;
 			values[m] = CopyReadBinaryAttribute(cstate,
-i,
 &in_functions[m],
 typioparams[m],
 att->atttypmod,
@@ -4714,8 +4710,7 @@ endfield:
  * Read a binary attribute
  */
 static Datum
-CopyReadBinaryAttribute(CopyState cstate,
-		int column_no, FmgrInfo *flinfo,
+CopyReadBinaryAttribute(CopyState cstate, FmgrInfo *flinfo,
 		Oid typioparam, int32 typmod,
 		bool *isnull)
 {
-- 
1.8.3.1

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-18 Thread Bruce Momjian

On Thu, Jun 18, 2020 at 04:09:56PM +0530, Amit Kapila wrote:
> You are right and we are not going to claim that after this feature is
> committed.  This feature has independent use cases like it can allow
> parallel copy when foreign tables are involved once we have parallel
> copy and surely there will be more.  I think it is clear that we need
> atomic visibility (some way to ensure global consistency) to avoid the
> data inconsistency problems you and I are worried about and we can do
> that as a separate patch but at this stage, it would be good if we can
> have some high-level design of that as well so that if we need some
> adjustments in the design/implementation of this patch then we can do
> it now.  I think there is some discussion on the other threads (like
> [1]) about the kind of stuff we are worried about which I need to
> follow up on to study the impact.
> 
> Having said that, I don't think that is a reason to stop reviewing or
> working on this patch.

I think our first step is to allow sharding to work on read-only
databases, e.g. data warehousing.  Read/write will require global
snapshots.  It is true that 2PC is limited usefulness without global
snapshots, because, by definition, systems using 2PC are read-write
systems.   However, I can see cases where you are loading data into a
data warehouse but want 2PC so the systems remain consistent even if
there is a crash during loading.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Parallel copy

2020-06-18 Thread vignesh C

On Mon, Jun 15, 2020 at 4:39 PM Bharath Rupireddy
 wrote:
>
> The above tests were run with the configuration attached config.txt, which is 
> the same used for performance tests of csv/text files posted earlier in this 
> mail chain.
>
> Request the community to take this patch up for review along with the 
> parallel copy for csv/text file patches and provide feedback.
>

I had reviewed the patch, few comments:
+
+   /*
+* Parallel copy for binary formatted files
+*/
+   ParallelCopyDataBlock *curr_data_block;
+   ParallelCopyDataBlock *prev_data_block;
+   uint32 curr_data_offset;
+   uint32 curr_block_pos;
+   ParallelCopyTupleInfo  curr_tuple_start_info;
+   ParallelCopyTupleInfo  curr_tuple_end_info;
 } CopyStateData;

 The new members added should be present in ParallelCopyData

+   if (cstate->curr_tuple_start_info.block_id ==
cstate->curr_tuple_end_info.block_id)
+   {
+   elog(DEBUG1,"LEADER - tuple lies in a single data block");
+
+   line_size = cstate->curr_tuple_end_info.offset -
cstate->curr_tuple_start_info.offset + 1;
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].unprocessed_line_parts,
1);
+   }
+   else
+   {
+   uint32 following_block_id =
pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].following_block;
+
+   elog(DEBUG1,"LEADER - tuple is spread across data blocks");
+
+   line_size = DATA_BLOCK_SIZE -
cstate->curr_tuple_start_info.offset -
+
pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].skip_bytes;
+
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[cstate->curr_tuple_start_info.block_id].unprocessed_line_parts,
1);
+
+   while (following_block_id !=
cstate->curr_tuple_end_info.block_id)
+   {
+   line_size = line_size + DATA_BLOCK_SIZE -
pcshared_info->data_blocks[following_block_id].skip_bytes;
+
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
1);
+
+   following_block_id =
pcshared_info->data_blocks[following_block_id].following_block;
+
+   if (following_block_id == -1)
+   break;
+   }
+
+   if (following_block_id != -1)
+
pg_atomic_add_fetch_u32(&pcshared_info->data_blocks[following_block_id].unprocessed_line_parts,
1);
+
+   line_size = line_size + cstate->curr_tuple_end_info.offset + 1;
+   }

line_size can be set as and when we process the tuple from
CopyReadBinaryTupleLeader and this can be set at the end. That way the
above code can be removed.

+
+   /*
+* Parallel copy for binary formatted files
+*/
+   ParallelCopyDataBlock *curr_data_block;
+   ParallelCopyDataBlock *prev_data_block;
+   uint32 curr_data_offset;
+   uint32 curr_block_pos;
+   ParallelCopyTupleInfo  curr_tuple_start_info;
+   ParallelCopyTupleInfo  curr_tuple_end_info;
 } CopyStateData;

curr_block_pos variable is present in ParallelCopyShmInfo, we could
use it and remove from here.
curr_data_offset, similar variable raw_buf_index is present in
CopyStateData, we could use it and remove from here.

+ if (cstate->curr_data_offset + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 1))
+ {
+ ParallelCopyDataBlock *data_block = NULL;
+ uint8 movebytes = 0;
+
+ block_pos = WaitGetFreeCopyBlock(pcshared_info);
+
+ movebytes = DATA_BLOCK_SIZE - cstate->curr_data_offset;
+
+ cstate->curr_data_block->skip_bytes = movebytes;
+
+ data_block = &pcshared_info->data_blocks[block_pos];
+
+ if (movebytes > 0)
+ memmove(&data_block->data[0],
&cstate->curr_data_block->data[cstate->curr_data_offset],
+ movebytes);
+
+ elog(DEBUG1, "LEADER - field count is spread across data blocks -
moved %d bytes from current block %u to %u block",
+ movebytes, cstate->curr_block_pos, block_pos);
+
+ readbytes = CopyGetData(cstate, &data_block->data[movebytes], 1,
(DATA_BLOCK_SIZE - movebytes));
+
+ elog(DEBUG1, "LEADER - bytes read from file after field count is
moved to next data block %d", readbytes);
+
+ if (cstate->reached_eof)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("unexpected EOF in COPY data")));
+
+ cstate->curr_data_block = data_block;
+ cstate->curr_data_offset = 0;
+ cstate->curr_block_pos = block_pos;
+ }

This code is duplicate in CopyReadBinaryTupleLeader &
CopyReadBinaryAttributeLeader. We could make a function and re-use.

+/*
+ * CopyReadBinaryAttributeWorker - leader identifies boundaries/offsets
+ * for each attribute/column, it moves on to next data block if the
+ * attribute/column is spread across data blocks.
+ */
+static pg_attribute_always_inline Datum
+CopyReadBinaryAttributeWorker(CopyState cstate, int column_no,
+

Re: min_safe_lsn column in pg_replication_slots view

2020-06-18 Thread Amit Kapila

On Thu, Jun 18, 2020 at 11:52 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 17 Jun 2020 21:37:55 +0900, Fujii Masao  
> wrote in
> > On 2020/06/15 16:35, Kyotaro Horiguchi wrote:
> > Isn't it better to use 1 as the second argument of the above,
> > in order to address the issue that I reported upthread?
> > Otherwise, the WAL file name that pg_walfile_name(min_safe_lsn)
> > returns
> > would be confusing.
>
> Mmm. pg_walfile_name seems too specialize to
> pg_stop_backup(). (pg_walfile_name_offset() returns wrong result for
> segment boundaries.)  I'm not willing to do that only to follow such
> suspicious(?) specification, but surely it would practically be better
> doing that. Please find the attached first patch.
>

It is a little unclear to me how this or any proposed patch will solve
the original problem reported by Fujii-San?  Basically, the problem
arises because we don't have an interlock between when the checkpoint
removes the WAL segment and the view tries to acquire the same.  Am, I
missing something?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [patch] demote

2020-06-18 Thread Fujii Masao





On 2020/06/18 1:29, Robert Haas wrote:

On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais
 wrote:

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.


Cool! This was vaguely on my hit list, but neither I nor any of my
colleagues had gotten the time and energy to have a go at it.


Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?


I haven't looked at your code, but I think we should view the two
efforts as complementing each other, not competing. With both patches
in play, a clean switchover would look like this:

- first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
to make the primary read only, killing off write transactions
- next use pg_ctl promote to promote the standby
- finally use pg_ctl demote (or whatever we decide to call it) to turn
the read-only primary into a standby of the new primary


ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
   connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be streamed
   to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
   the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

[PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

2020-06-18 Thread Vyacheslav Makarov


Hello, hackers.

I would like to propose a patch, which allows passing one extra 
parameter to pg_create_physical_replication_slot() — restart_lsn. It 
could be very helpful if we already have some backup with STOP_LSN from 
a couple of hours in the past and we want to quickly verify wether it is 
possible to create a replica from this backup or not.


If the WAL segment for the specified restart_lsn (STOP_LSN of the 
backup) exists, then the function will create a physical replication 
slot and will keep all the WAL segments required by the replica to catch 
up with the primary. Otherwise, it returns error, which means that the 
required WAL segments have been already utilised, so we do need to take 
a new backup. Without passing this newly added parameter 
pg_create_physical_replication_slot() works as before.


What do you think about this?

--
Vyacheslav Makarov

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ea4c85e3959..c1f971c3fe8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1199,7 +1199,7 @@ AS 'pg_logical_slot_peek_binary_changes';
 
 CREATE OR REPLACE FUNCTION pg_create_physical_replication_slot(
 IN slot_name name, IN immediately_reserve boolean DEFAULT false,
-IN temporary boolean DEFAULT false,
+IN temporary boolean DEFAULT false, IN restart_lsn pg_lsn DEFAULT '0/0'::pg_lsn,
 OUT slot_name name, OUT lsn pg_lsn)
 RETURNS RECORD
 LANGUAGE INTERNAL
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f2fd8f336ed..bbe85974db2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,58 @@ ReplicationSlotReserveWal(void)
 	}
 }
 
+/*
+ * Similar to ReplicationSlotReserveWal, but not for the current LSN, but for
+ * the LSN from the past. Creates a physical replication slot if WAL segment
+ * with specified restart_lsn exists.
+ */
+void
+ReplicationSlotReserveHistoryWal(XLogRecPtr restart_lsn)
+{
+	XLogSegNo		segno;
+	XLogRecPtr		restartRedoPtr;
+	TimeLineID		restartTli;
+	char			xlogfname[MAXFNAMELEN];
+	char			*filename;
+	struct stat		buf;
+
+	Assert(MyReplicationSlot != NULL);
+	Assert(MyReplicationSlot->data.restart_lsn == InvalidXLogRecPtr);
+
+	if (!RecoveryInProgress() && !SlotIsLogical(MyReplicationSlot))
+	{
+		XLByteToSeg(restart_lsn, segno, wal_segment_size);
+		GetOldestRestartPoint(&restartRedoPtr, &restartTli);
+		XLogFileName(xlogfname, restartTli, segno, wal_segment_size);
+		filename = psprintf("%s/pg_wal/%s", DataDir, xlogfname);
+		if (stat(filename, &buf) != 0)
+		{
+			pfree(filename);
+			ReplicationSlotDropAcquired();
+			elog(ERROR, "WAL segment %s with specified LSN %X/%X is absent",
+xlogfname, (uint32)(restart_lsn >> 32), (uint32)restart_lsn);
+		}
+		else
+		{
+			SpinLockAcquire(&MyReplicationSlot->mutex);
+			MyReplicationSlot->data.restart_lsn = restart_lsn;
+			SpinLockRelease(&MyReplicationSlot->mutex);
+		}
+
+		/* prevent WAL removal as fast as possible */
+		ReplicationSlotsComputeRequiredLSN();
+
+		if (stat(filename, &buf) != 0)
+		{
+			pfree(filename);
+			ReplicationSlotDropAcquired();
+			elog(ERROR, "WAL segment with specified LSN %X/%X is absent",
+(uint32)(restart_lsn >> 32), (uint32)restart_lsn);
+		}
+		pfree(filename);
+	}
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 3f5944f2ad5..2cae02c06c6 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -44,7 +44,8 @@ check_permissions(void)
  */
 static void
 create_physical_replication_slot(char *name, bool immediately_reserve,
- bool temporary, XLogRecPtr restart_lsn)
+ bool temporary, XLogRecPtr restart_lsn,
+ bool historic_lsn)
 {
 	Assert(!MyReplicationSlot);
 
@@ -55,7 +56,12 @@ create_physical_replication_slot(char *name, bool immediately_reserve,
 	if (immediately_reserve)
 	{
 		/* Reserve WAL as the user asked for it */
-		if (XLogRecPtrIsInvalid(restart_lsn))
+		if (historic_lsn)
+		{
+			Assert(!XLogRecPtrIsInvalid(restart_lsn));
+			ReplicationSlotReserveHistoryWal(restart_lsn);
+		}
+		else if (XLogRecPtrIsInvalid(restart_lsn))
 			ReplicationSlotReserveWal();
 		else
 			MyReplicationSlot->data.restart_lsn = restart_lsn;
@@ -76,12 +82,16 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 	Name		name = PG_GETARG_NAME(0);
 	bool		immediately_reserve = PG_GETARG_BOOL(1);
 	bool		temporary = PG_GETARG_BOOL(2);
+	XLogRecPtr	restart_lsn = PG_GETARG_LSN(3);
 	Datum		values[2];
 	bool		nulls[2];
 	TupleDesc	tupdesc;
 	HeapTuple	tuple;
 	Datum		result;
 
+	if (restart_lsn != InvalidXLogRecPtr && !immediately_reserve)
+		elog(ERROR, "immediately_reserve should not be false when setting restart_lsn");
+
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_C

Re: factorial of negative numbers

2020-06-18 Thread Juan José Santamaría Flecha

On Thu, Jun 18, 2020 at 1:57 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-06-18 09:43, Juan José Santamaría Flecha wrote:
> > The gamma function from math.h returns a NaN for negative integer
> > values, the postgres factorial function returns a numeric, which allows
> > NaN. Raising an out-of-range error seems only reasonable for an integer
> > output.
>
> But this is not the gamma function.  The gamma function is undefined at
> zero, but factorial(0) returns 1.  So this is similar but not the same.
>

factorial(n) = gamma(n + 1)


> Moreover, functions such as log() also error out on unsupportable input
> values, so it's consistent with the spec.
>

If factorial() ever gets extended to other input types it might get
inconsistent, should !(-1.0) also raise an error?

Logarithm is just different case:

https://en.wikipedia.org/wiki/Logarithm#/media/File:Log4.svg

Regards,

Juan José Santamaría Flecha

Re: factorial of negative numbers

2020-06-18 Thread Peter Eisentraut

On 2020-06-18 09:43, Juan José Santamaría Flecha wrote:

On Thu, Jun 18, 2020 at 9:13 AM Peter Eisentraut 
> wrote:

On 2020-06-16 14:17, Dean Rasheed wrote:
 > I think you're probably right though. Raising an out-of-range error
 > seems like the best option.

committed as proposed then

The gamma function from math.h returns a NaN for negative integer 
values, the postgres factorial function returns a numeric, which allows 
NaN. Raising an out-of-range error seems only reasonable for an integer 
output.

But this is not the gamma function.  The gamma function is undefined at 
zero, but factorial(0) returns 1.  So this is similar but not the same.

Moreover, functions such as log() also error out on unsupportable input 
values, so it's consistent with the spec.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC and rebased patch for CSN based snapshots

2020-06-18 Thread Fujii Masao





On 2020/06/15 16:48, Fujii Masao wrote:



On 2020/06/12 18:41, movead...@highgo.ca wrote:

Hello hackers,

Currently, I do some changes based on the last version:
1. Catch up to the current  commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.


Probably it's not time to do the code review yet, but when I glanced the patch,
I came up with one question.

0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
(and inserts it into WAL buffers). Which means that new WAL record is generated
whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
really necessary for CSN?

BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
acceptable because spinlocks are intended for *very* short-term locks.
Secondly, I don't think that WAL generation during ProcArrayLock is good
design because ProcArrayLock is likely to be bottleneck and its term should
be short for performance gain.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread amul sul

On Thu, Jun 18, 2020 at 3:25 PM Amit Kapila  wrote:
>
> On Wed, Jun 17, 2020 at 8:12 PM Robert Haas  wrote:
> >
> > On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila  wrote:
> > > Do we prohibit the checkpointer to write dirty pages and write a
> > > checkpoint record as well?  If so, will the checkpointer process
> > > writes the current dirty pages and writes a checkpoint record or we
> > > skip that as well?
> >
> > I think the definition of this feature should be that you can't write
> > WAL. So, it's OK to write dirty pages in general, for example to allow
> > for buffer replacement so we can continue to run read-only queries.
> >
>
> For buffer replacement, many-a-times we have to also perform
> XLogFlush, what do we do for that?  We can't proceed without doing
> that and erroring out from there means stopping read-only query from
> the user perspective.
>
Read-only does not restrict XLogFlush().

> > But there's no reason for the checkpointer to do it: it shouldn't try
> > to checkpoint, and therefore it shouldn't write dirty pages either.
> >
>
> What is the harm in doing the checkpoint before we put the system into
> READ ONLY state?  The advantage is that we can at least reduce the
> recovery time if we allow writing checkpoint record.
>
The checkpoint could take longer, intending to quickly switch to the read-only
state.

> >
> > > What if vacuum is on an unlogged relation?  Do we allow writes via
> > > vacuum to unlogged relation?
> >
> > Interesting question. I was thinking that we should probably teach the
> > autovacuum launcher to stop launching workers while the system is in a
> > READ ONLY state, but what about existing workers? Anything that
> > generates invalidation messages, acquires an XID, or writes WAL has to
> > be blocked in a read-only state; but I'm not sure to what extent the
> > first two of those things would be a problem for vacuuming an unlogged
> > table. I think you couldn't truncate it, at least, because that
> > acquires an XID.
> >
>
> If the truncate operation errors out, then won't the system will again
> trigger a new autovacuum worker for the same relation as we update
> stats at the end?  Also, in general for regular tables, if there is an
> error while it tries to WAL, it could again trigger the autovacuum
> worker for the same relation.  If this is true then unnecessarily it
> will generate a lot of dirty pages and don't think it will be good for
> the system to behave that way?
>
No new autovacuum worker will be forked in the read-only state and existing will
have an error if they try to write WAL after barrier absorption.

> > > > Another part of the patch that quite uneasy and need a discussion is 
> > > > that when the
> > > > shutdown in the read-only state we do skip shutdown checkpoint and at a 
> > > > restart, first
> > > > startup recovery will be performed and latter the read-only state will 
> > > > be restored to
> > > > prohibit further WAL write irrespective of recovery checkpoint succeed 
> > > > or not. The
> > > > concern is here if this startup recovery checkpoint wasn't ok, then it 
> > > > will never happen
> > > > even if it's later put back into read-write mode.
> > >
> > > I am not able to understand this problem.  What do you mean by
> > > "recovery checkpoint succeed or not", do you add a try..catch and skip
> > > any error while performing recovery checkpoint?
> >
> > What I think should happen is that the end-of-recovery checkpoint
> > should be skipped, and then if the system is put back into read-write
> > mode later we should do it then.
> >
>
> But then if we have to perform recovery again, it will start from the
> previous checkpoint.  I think we have to live with it.
>
Let me explain the case, if we do skip the end-of-recovery checkpoint while
starting the system in read-only mode and then later changing the state to
read-write and do a few write operations and online checkpoints, that will be
fine? I am yet to explore those things.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Amit Kapila

On Wed, Jun 17, 2020 at 9:37 PM Robert Haas  wrote:
>
> On Wed, Jun 17, 2020 at 10:58 AM Tom Lane  wrote:
>
> > Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
> > involves just killing the primary server, not expecting that you can
> > leisurely issue some commands to it first.
>
> Yeah, that's exactly the problem I want to fix. If you kill the master
> server, then you have interrupted service, even for read-only queries.
>

Yeah, but if there is a synchronuos_standby (standby that provide sync
replication), user can always route the connections to it
(automatically if there is some middleware which can detect and route
the connection to standby)

> That sucks. Also, even if you don't care about interrupting service on
> the master, it's actually sorta hard to guarantee a clean switchover.
>

Fair enough.  However, it is not described in the initial email
(unless I have missed it; there is a mention that this patch is one
part of that bigger feature but no further explanation of that bigger
feature) how this feature will allow a clean switchover.  I think
before we put the system into READ ONLY state, there could be some WAL
which we haven't sent to standby, what we do we do for that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 112 matches

Mail list logo