date:20200512

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-05-12 Thread Dilip Kumar

On Tue, May 12, 2020 at 4:39 PM Amit Kapila  wrote:
>
> On Thu, May 7, 2020 at 6:17 PM Dilip Kumar  wrote:
> >
> > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar  wrote:
> >
> > I have fixed one more issue in 0010 patch.  The issue was that once
> > the transaction is serialized due to the incomplete toast after
> > streaming the serialized store was not cleaned up so it was streaming
> > the same tuple multiple times.
> >
>
> I have reviewed a few patches (003, 004, and 005) and below are my comments.

Thanks for the review, I am replying some of the comments where I have
confusion, others are fine.

>
> v20-0003-Extend-the-output-plugin-API-with-stream-methods
> 
> 1.
> +static void
> +pg_decode_stream_change(LogicalDecodingContext *ctx,
> + ReorderBufferTXN *txn,
> + Relation relation,
> + ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
> +
> +static void
> +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> +   int nrelations, Relation relations[],
> +   ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
>
> In the above and similar APIs, there are parameters like relation
> which are not used.  I think you should add some comments atop these
> APIs to explain why it is so? I guess it is because we want to keep
> them similar to non-stream version of APIs and we can't display
> relation or other information as the transaction is still in-progress.

I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here.  We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used.  Do you think we
still need to add comments?

> 4.
> +static void
> +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_start";
> + /* state.report_location = apply_lsn; */
>
> Why can't we supply the report_location here?  I think here we need to
> report txn->first_lsn if this is the very first stream and
> txn->final_lsn if it is any consecutive one.

I am not sure about this,  Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

>
> 11.
> - * HeapTupleSatisfiesHistoricMVCC.
> + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> + *
> + * We do build the hash table even if there are no CIDs. That's
> + * because when streaming in-progress transactions we may run into
> + * tuples with the CID before actually decoding them. Think e.g. about
> + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> + * yet when applying the INSERT. So we build a hash table so that
> + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> + *
> + * XXX We might limit this behavior to streaming mode, and just bail
> + * out when decoding transaction at commit time (at which point it's
> + * guaranteed to see all CIDs).
>   */
>  static void
>  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> *rb, ReorderBufferTXN *txn)
>   dlist_iter iter;
>   HASHCTL hash_ctl;
>
> - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> - return;
> -
>
> I don't understand this change.  Why would "INSERT followed by
> TRUNCATE" could lead to a tuple which can come for decode before its
> CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation.  e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple.  So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

  The patch has made changes based on this assumption in
> HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> behavior could be dependent on whether we are streaming the changes
> for in-progress xact or at the commit of a transaction.  We might want
> to generate a test to once validate this behavior.

We have already added the test case

Re: Event trigger code comment duplication

2020-05-12 Thread David G. Johnston

On Tuesday, May 12, 2020, Michael Paquier  wrote:

> On Tue, May 12, 2020 at 06:48:51PM -0700, David G. Johnston wrote:
> > Whether its a style thing, or some requirement of the C-language, I found
> > it odd that the four nearly identical checks were left inline in the
> > functions instead of being pulled out into a function.  I've attached a
> > conceptual patch that does just this and more clearly presents on my
> > thoughts on the topic.  In particular I tried to cleanup the quadruple
> > negative sentence (and worse for the whole paragraph) as part of the
> > refactoring of the currentEventTriggerState comment.
>
> You may want to check that your code compiles first :)


I said It was a conceptual patch...the inability to write correct C code
doesn’t wholly impact opinions of general code form.


> +   /*
> +* See EventTriggerDDLCommandStart for a discussion about why event
> +* triggers are disabled in single user mode.
> +*/
> +   if (!IsUnderPostmaster)
> +   return false;
> And here I am pretty sure that you don't want to remove the
> explanation why event triggers are disabled in standalone mode.


The full comment should have remained in the common function...so it moved
but wasn’t added or removed so not visible...in hindsight diff mode may
have been a less than ideal choice here.  Or I may have fat-fingered the
copy-paste...


>
> Note the reason why we don't expect a state being set for
> ddl_command_start is present in EventTriggerBeginCompleteQuery():
> /*
>  * Currently, sql_drop, table_rewrite, ddl_command_end events are the
> only
>  * reason to have event trigger state at all; so if there are none,
> don't
>  * install one.
>  */


Thanks


>
> Even with all that, I am not sure that we need to complicate further
> what we have here.  An empty currentEventTriggerState gets checks in
> three places, and each one of them has a slight different of the
> reason why we cannot process further, so I would prefer applying my
> previous, simple patch if there are no objections to remove the
> duplication about event triggers with standalone mode, keeping the
> explanations local to each event trigger type, and call it a day.
>

I’ll defer at this point - though maybe keep/improve the fix for the
quadruple negative and related commentary.

David J.

Re: PG 13 release notes, first draft

2020-05-12 Thread Fabien COELHO




Hello Bruce,


 * e1ff780485


I was told in this email thread to not include that one.


Ok.


 * 34a0a81bfb


We already have:

Reformat tables containing function information for better
clarity (Tom Lane)

so it seems it is covered as part of this.


AFAICR this one is not by the same author, and although the point was 
about better clarity, it was not about formating but rather about 
restructuring text vs binary string function documentations. Then Tom 
reformatted the result.



 * e829337d42


Uh, this is a doc link formatting addition.  I think this falls into the
error message logic, where it is nice when people want it, but they
don't need to know about it ahead of time.


Hmmm. ISTM that this is not really about "error message logic", it is 
about navigating to libpq functions when one is reference in the 
description of another to check what it does, which I had to do a lot 
while developing some performance testing code for a project.



 * "Document color support (Peter Eisentraut)"
   THIS WAS NOT DOCUMENTED BEFORE?

Not as such, AFAICR it was vaguely hinted about in the documentation of
command that would use it, but not even all of them. Now there is a new
specific section.


Again, this is the first hash you gave.


Possibly, but as the "THIS WAS NOT DOCUMENTED BEFORE?" question seemed to 
still be in the release notes, I gathered that the information had not 
reached its destination, hence the possible repetition. But maybe the 
issue is that this answer is not satisfactory. Sorry for the 
inconvenience.


--
Fabien.

Re: Event trigger code comment duplication

2020-05-12 Thread Michael Paquier

On Tue, May 12, 2020 at 06:48:51PM -0700, David G. Johnston wrote:
> Whether its a style thing, or some requirement of the C-language, I found
> it odd that the four nearly identical checks were left inline in the
> functions instead of being pulled out into a function.  I've attached a
> conceptual patch that does just this and more clearly presents on my
> thoughts on the topic.  In particular I tried to cleanup the quadruple
> negative sentence (and worse for the whole paragraph) as part of the
> refactoring of the currentEventTriggerState comment.

You may want to check that your code compiles first :)

+bool
+EventTriggerValidContext(bool requireState)
+{
[...]
-   if (!IsUnderPostmaster)
-   return;
+   if (!EventTriggerValidContext(true))
+   return
EventTriggerValidContext() should be static, and the code as written
simply would not compile.

+   if (requireState) {
+   /*
+   * Only move forward if our state is set up.  This is required
+   * to handle concurrency - if we proceed, without state already set 
up,
+   * and allow EventTriggerCommonSetup to run it may find triggers that
+   * didn't exist when the command started.
+   */
+   if (!currentEventTriggerState)
+   return false;
+   }
Comment format and the if block don't have a format consistent with
the project.

+   /*
+* See EventTriggerDDLCommandStart for a discussion about why event
+* triggers are disabled in single user mode.
+*/
+   if (!IsUnderPostmaster)
+   return false;
And here I am pretty sure that you don't want to remove the
explanation why event triggers are disabled in standalone mode.

Note the reason why we don't expect a state being set for
ddl_command_start is present in EventTriggerBeginCompleteQuery():
/*
 * Currently, sql_drop, table_rewrite, ddl_command_end events are the only
 * reason to have event trigger state at all; so if there are none, don't
 * install one.
 */

Even with all that, I am not sure that we need to complicate further
what we have here.  An empty currentEventTriggerState gets checks in
three places, and each one of them has a slight different of the
reason why we cannot process further, so I would prefer applying my
previous, simple patch if there are no objections to remove the
duplication about event triggers with standalone mode, keeping the
explanations local to each event trigger type, and call it a day.
--
Michael


signature.asc
Description: PGP signature

Re: Problem with logical replication

2020-05-12 Thread Dilip Kumar

On Wed, May 13, 2020 at 6:15 AM Euler Taveira
 wrote:
>
> On Tue, 12 May 2020 at 06:36, Masahiko Sawada 
>  wrote:
>>
>> On Mon, 11 May 2020 at 16:28, Michael Paquier  wrote:
>> >
>> > On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
>> > > I attached a patch with the described solution. I also included a test 
>> > > that
>> > > covers this scenario.
>> >
>> > -   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
>> > +   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));
>> >
>> > Not much a fan of adding a routine to relcache.c to do the work of two
>> > routines already present, so I think that we had better add an extra
>> > condition based on RelationGetPrimaryKeyIndex, and give up on
>> > GetRelationIdentityOrPK() in execReplication.c.
>>
> Although, I think this solution is fragile, I updated the patch accordingly.
> (When/If someone changed GetRelationIdentityOrPK() it will break this assert)
>
>>
>> In any case, it seems to me that the comment of
>> build_replindex_scan_key needs to be updated.
>>
>>  * This is not generic routine, it expects the idxrel to be replication
>>  * identity of a rel and meet all limitations associated with that.
>>
> It is implicit that a primary key can be a replica identity so I think this
> comment is fine.

I like your idea of modifying the assert instead of completely removing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Do I understand commit timestamps correctly?

2020-05-12 Thread Chapman Flack

On 03/23/18 11:40, Alvaro Herrera wrote:
> Chapman Flack wrote:
>> ? Given a base backup and a bunch of WAL from a cluster that had
>>   track_commit_timestamps turned off, is it possible (in principle?)
>>   to do a PITR with the switch turned on, and have the commit_ts
>>   cache get populated (at least from the transactions encountered
>>   in the WAL)? Could that be done by changing the setting in
>>   postgresql.conf for the recovery, or would it take something more
>>   invasive, like poking the value in pg_control? Or would that just
>>   make something fall over? Would it require dummying up some commit_ts
>>   files first?
> 
> I don't remember if this is explicitly supported, but yeah AFAIR it
> should work to just start the "promoted standby" (in your case just a
> recovered backup) on after setting the option in postgresql.conf.  This
> is because StartupCommitTs() activates the commit_ts module just before
> starting recovery.

Getting around to trying it out, simply changing the setting in
postgresql.conf before starting the server does not seem sufficient:
once it comes online, it has track_commit_timestamp on, but has not
populated the cache from transactions it applied during recovery.

On the other hand, changing the setting in postgresql.conf *and*
poking a 1 in the track_commit_timestamp byte in pg_control,
and fudging the CRC accordingly, *then* starting the server, does
seem to do just as I had hoped. Nothing seems to complain or fall over,
and the transactions recovered from WAL now have timestamps visible
with pg_xact_commit_timestamp().

Regards,
-Chap

Re: refactoring basebackup.c

2020-05-12 Thread Suraj Kharage

Hi,

Did some performance testing by varying TAR_SEND_SIZE with Robert's
refactor patch and without the patch to check the impact.

Below are the details:

*Backup type*: local backup using pg_basebackup
*Data size*: Around 200GB (200 tables - each table around 1.05 GB)
*different TAR_SEND_SIZE values*: 8kb, 32kb (default value), 128kB, 1MB (
1024kB)

*Server details:*
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit,
64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

8kb 32kb (default value) 128kB 1024kB
Without refactor patch real 10m22.718s
user 1m23.629s
sys 8m51.410s real 8m36.245s
user 1m8.471s
sys 7m21.520s real 6m54.299s
user 0m55.690s
sys 5m46.502s real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch) real 10m11.350s
user 1m25.038s
sys 8m39.226s real 8m56.226s
user 1m9.774s
sys 7m41.032s real 7m26.678s
user 0m54.833s
sys 6m20.057s real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a
good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

-- 
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.

Re: BUG #16419: wrong parsing BC year in to_date() function

2020-05-12 Thread Laurenz Albe

On Tue, 2020-05-12 at 18:09 -0700, David G. Johnston wrote:
> Redirecting to -hackers for visibility.  I feel there needs to be something 
> done here, even if just documentation (a bullet in the usage notes section - 
> and a code comment update for the macro)
> pointing this out and not changing any behavior.
> 
> David J.
> 
> On Wed, May 6, 2020 at 8:12 PM David G. Johnston  
> wrote:
> > ‪On Wed, May 6, 2020 at 6:31 PM ‫دار الآثار للنشر والتوزيع-صنعاء Dar 
> > Alathar-Yemen‬‎  wrote:‬
> > > Any one suppose that these functions return the same:
> > > make_date(-1,1,1)
> > > to_date('-1-01-01','-mm-dd')
> > > 
> > > But make_date will give 0001-01-01 BC
> > > 
> > > And to_date will give 0002-01-01 BC
> > > 
> > > 
> > > 
> > > 
> > 
> > Interesting...and a fair point.
> > 
> > What seems to be happening here is that to_date is trying to be helpful by 
> > doing:
> > 
> > select to_date('',''); // 0001-01-01 BC
> > 
> > It does this seemingly by subtracting one from the year, making it 
> > positive, then (I infer) appending "BC" to the result.  Thus for the year 
> > "-1" it yields "0002-01-01 BC"
> > 
> > make_date just chooses to reject the year 0 and treat the negative as an 
> > alternative to specifying BC
> > 
> > There seems to be zero tests for to_date involving negative years, and the 
> > documentation doesn't talk of them.
> > 
> > I'll let the -hackers speak up as to how they want to go about handling 
> > to_date (research how it behaves in the other database it tries to emulate 
> > and either document or possibly change the
> > behavior in v14) but do suggest that a simple explicit description of how 
> > to_date works in the presence of negative years be back-patched.  A bullet 
> > in the usage notes section probably suffices:
> > 
> > "If a  format string captures a negative year, or , it will treat 
> > it as a BC year after decreasing the value by one.  So  maps to 1 BC 
> > and -1 maps to 2 BC and so on."
> > 
> > So, no, make_date and to_date do not agree on this point; and they do not 
> > have to.  There is no way to specify "BC" in make_date function so using 
> > negative there makes sense.  You can specify BC
> > in the input string for to_date and indeed that is the only supported 
> > (documented) way to do so.
> > 
> > 
> 
>  
> [and the next email]
>  
> > Specifically:
> > 
> > https://github.com/postgres/postgres/blob/fb544735f11480a697fcab791c058adc166be1fa/src/backend/utils/adt/formatting.c#L236
> > 
> > /*
> >  * There is no 0 AD.  Years go from 1 BC to 1 AD, so we make it
> >  * positive and map year == -1 to year zero, and shift all negative
> >  * years up one.  For interval years, we just return the year.
> >  */
> > #define ADJUST_YEAR(year, is_interval) ((is_interval) ? (year) : ((year) <= 
> > 0 ? -((year) - 1) : (year)))
> > 
> > The code comment took me a bit to process - seems like the following would 
> > be better (if its right - I don't know why interval is a pure no-op while 
> > non-interval normalizes to a positive integer).
> > 
> > Years go from 1 BC to 1 AD, so we adjust the year zero, and all negative 
> > years, by shifting them away one year,  We then return the positive value 
> > of the result because the caller tracks the BC/AD
> > aspect of the year separately and only deals with positive year values 
> > coming out of this macro.  Intervals denote the distance away from 0 a year 
> > is so we can simply take the supplied value and
> > return it.  Interval processing code expects a negative result for 
> > intervals going into BC.
> > 
> > David J.

Since "to_date" is an Oracle compatibility function, here is what Oracle 18.4 
has to say to that:

SQL> SELECT to_date('', '') FROM dual;
SELECT to_date('', '') FROM dual
   *
ERROR at line 1:
ORA-01841: (full) year must be between -4713 and +, and not be 0


SQL> SELECT to_date('-0001', '') FROM dual;
SELECT to_date('-0001', '') FROM dual
   *
ERROR at line 1:
ORA-01841: (full) year must be between -4713 and +, and not be 0


SQL> SELECT to_date('-0001', 'S') FROM dual;

TO_DATE('-0001','S
--
0001-05-01 00:00:00 BC

Yours,
Laurenz Albe

Re: making update/delete of inheritance trees scale better

2020-05-12 Thread Amit Langote

On Tue, May 12, 2020 at 9:54 PM Ashutosh Bapat
 wrote:
> On Mon, May 11, 2020 at 8:11 PM Amit Langote  wrote:
> > > Per row overhead would be incurred for every row whereas the plan time
> > > overhead is one-time or in case of a prepared statement almost free.
> > > So we need to compare it esp. when there are 2000 partitions and all
> > > of them are being updated.
> >
> > I assume that such UPDATEs would be uncommon.
>
> Yes, 2000 partitions being updated would be rare. But many rows from
> the same partition being updated may not be that common. We have to
> know how much is that per row overhead and updating how many rows it
> takes to beat the planning time overhead. If the number of rows is
> very large, we are good.

Maybe I am misunderstanding you, but the more the rows to update, the
more overhead we will be paying with the new approach.

> > > Can we plan the scan query to add a sort node to order the rows by 
> > > tableoid?
> >
> > Hmm, I am afraid that some piece of partitioning code that assumes a
> > certain order of result relations, and that order is not based on
> > sorting tableoids.
>
> I am suggesting that we override that order (if any) in
> create_modifytable_path() or create_modifytable_plan() by explicitly
> ordering the incoming paths on tableoid. May be using MergeAppend.

So, we will need to do 2 things:

1. Implicitly apply an ORDER BY tableoid clause
2. Add result relation RTIs to ModifyTable.resultRelations in the
order of their RTE's relid.

Maybe we can do that as a separate patch.  Also, I am not sure if it
will get in the way of someone wanting to have ORDER BY LIMIT for
updates.

> > > * Tuple re-routing during UPDATE. For now it's disabled so your design
> > > should work. But we shouldn't design this feature in such a way that
> > > it comes in the way to enable tuple re-routing in future :).
> >
> > Sorry, what is tuple re-routing and why does this new approach get in its 
> > way?
>
> An UPDATE causing a tuple to move to a different partition. It would
> get in its way since the tuple will be located based on tableoid,
> which will be the oid of the old partition. But I think this approach
> has higher chance of being able to solve that problem eventually
> rather than the current approach.

Again, I don't think I understand.   We do currently (as of v11)
re-route tuples when UPDATE causes them to move to a different
partition, which, gladly, continues to work with my patch.

So how it works is like this: for a given "new" tuple, ExecUpdate()
checks if the tuple would violate the partition constraint of the
result relation that was passed along with the tuple.  If it does, the
new tuple will be moved, by calling ExecDelete() to delete it from the
current relation, followed by ExecInsert() to find the new home for
the tuple.  The only thing that changes with the new approach is how
ExecModifyTable() chooses a result relation to pass to ExecUpdate()
for a given "new" tuple it has fetched from the plan, which is quite
independent from the tuple re-routing mechanism proper.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2020-05-12 Thread Dilip Kumar

On Wed, May 13, 2020 at 1:56 AM Robert Haas  wrote:
>
> On Tue, May 12, 2020 at 4:32 AM Dilip Kumar  wrote:
> > Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
> > bbsink_libpq_begin_backup whereas others are calling SendCopy*
> > functions and therein those are calling pq_* functions.  I think
> > bbsink_libpq_* function can directly call pq_* functions instead of
> > adding one more level of the function call.
>
> I think all the helper functions have more than one caller, though.
> That's why I created them - to avoid duplicating code.

You are right, somehow I missed that part.  Sorry for the noise.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: new heapcheck contrib module

2020-05-12 Thread Peter Geoghegan

On Tue, May 12, 2020 at 7:07 PM Mark Dilger
 wrote:
> Thank you yet again for reviewing.  I really appreciate the feedback!

Happy to help. It's important work.

> Ok, I take your point that the code cannot soldier on after the first error 
> is returned.  I'll change that for v6 of the patch, moving on to the next 
> relation after hitting the first corruption in any particular index.  Do you 
> mind that I refactored the code to return the error rather than ereporting?

try/catch seems like the way to do it. Not all amcheck errors come
from amcheck -- some are things that the backend code does, that are
known to appear in amcheck from time to time. I'm thinking in
particular of the
table_index_build_scan()/heapam_index_build_range_scan() errors, as
well as the errors from _bt_checkpage().

> Yes, I agree that reindexing is the most sensible remedy.  I certainly have 
> no plans to implement some pg_fsck_index type tool.  Even for tables, I'm not 
> interested in creating such a tool. I just want a good tool for finding out 
> what the nature of the corruption is, as that might make it easier to debug 
> what went wrong.  It's not just for debugging production systems, but also 
> for chasing down problems in half-baked code prior to release.

All good goals.

>  * check_tuphdr_xids

> The point is that when checking the table for corruption I avoid calling 
> anything that might assert (or segfault, or whatever).

I don't think that you can expect to avoid assertion failures in
general. I'll stick with your example. You're calling
TransactionIdDidCommit() from check_tuphdr_xids(), which will
interrogate the commit log and pg_subtrans. It's just not under your
control. I'm sure that you could get an assertion failure somewhere in
there, and even if you couldn't that could change at any time.

You've quasi-duplicated some sensitive code to do that much, which
seems excessive. But it's also not enough.

> I'm starting to infer from your comments that you see the landmine detection 
> vehicle as also driving at high speed, detecting landmines on occasion by 
> seeing them first, but frequently by failing to see them and just blowing up.

That's not it. I would certainly prefer if the landmine detector
didn't blow up. Not having that happen is certainly a goal I share --
that's why PageGetItemIdCareful() exists. But not at any cost,
especially not when "blow up" means an assertion failure that users
won't actually see in production. Avoiding assertion failures like the
one you showed is likely to have a high cost (removing defensive
asserts in low level access method code) for a low benefit. Any
attempt to avoid having the checker itself blow up rather than throw
an error message needs to be assessed pragmatically, on a case-by-case
basis.

> One of the delays in submitting the most recent version of the patch is that 
> I was having trouble creating a reliable, portable btree corrupting 
> regression test.

To be clear, I think that corrupting data is very helpful with ad-hoc
testing during development.

> I did however address (some?) issues that you and others mentioned about the 
> table corrupting regression test.  Perhaps there are remaining issues that 
> will show up on machines with different endianness than I have thus far 
> tested, but I don't see that they will be insurmountable.  Are you 
> fundamentally opposed to that test framework?

I haven't thought about it enough just yet, but I am certainly suspicious of it.

-- 
Peter Geoghegan

Re: PG 13 release notes, first draft

2020-05-12 Thread Kyotaro Horiguchi

At Tue, 12 May 2020 16:38:09 -0400, Bruce Momjian  wrote in 
> On Tue, May 12, 2020 at 01:09:08PM +0900, Kyotaro Horiguchi wrote:
> > > > commit c6b9204
> > > > Author: Noah Misch 
> > > > AuthorDate: Sat Apr 4 12:25:34 2020 -0700
> > > > Commit: Noah Misch 
> > > > CommitDate: Sat Apr 4 12:25:34 2020 -0700
> > > > 
> > > > Skip WAL for new relfilenodes, under wal_level=minimal.
> > > > 
> > > > Until now, only selected bulk operations (e.g. COPY) did this.  If a
> > > > given relfilenode received both a WAL-skipping COPY and a WAL-logged
> > > > operation (e.g. INSERT), recovery could lose tuples from the COPY.  
> > > > See
> > > > src/backend/access/transam/README section "Skipping WAL for New
> > > > RelFileNode" for the new coding rules.  Maintainers of table access
> > > > methods should examine that section.
> > > 
> > > OK, so how do we want to document this?  Do I mention in the text below
> > > the WAL skipping item that this fixes a bug where a mix of simultaneous
> > > COPY and INSERT into a table could lose rows during crash recovery, or
> > > create a new item?
> > 
> > FWIW, as dicussed upthread, I suppose that the API change is not going
> > to be in relnotes.
> > 
> > something like this?
> > 
> > - Fix bug of WAL-skipping optimiazation 
> > 
> > Previously a trasaction doing both of COPY and a WAL-logged operations
> > like INSERT while wal_level=minimal can lead to loss of COPY'ed rows
> > through crash recovery.  Also this fix extends the WAL-skipping
> > optimiazation to all kinds of bulk insert operations.
> 
> Uh, that kind of mixes the bug fix and the feature in a way that it is
> hard to understand.  How about this?
> 
>   Allow skipping of WAL for new tables and indexes if wal_level is
>   'minimal' (Kyotaro Horiguchi)
> 
>   Relations larger than wal_skip_threshold will have their files
>   fsync'ed rather than writing their WAL records.  Previously this
>   was done only for COPY operations, but the implementation had a
>   bug that could cause data loss during crash recovery.

I see it. It is giving weight on improvement. Looks good the overall
structure of the description above.  However, wal-skipping is always
done regardless of table size. wal_skip_threshold is an optimization
to choose which to use fsync or FPI records (that is, not WAL records
in the common sense) at commit for speed.

So how about the following?

All kinds of bulk-insertion are not WAL-logged then fsync'ed at
commit.  Using FPI WAL records instead of fsync for relations smaller
than wal_skip_threshold. Previously this was done only for COPY
operations and always using fsync, but the implementation had a bug
that could cause data loss during crash recovery.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

Isaac Morland  writes:
> ...  I'm wondering because it seems
> like it might be helpful to have a system view which gives all the wait
> event types, names, and descriptions. Maybe even add a column for which
> extension (or core) it came from. The documentation could then just explain
> the general situation and point people at the system view to see exactly
> which wait types exist in their system.

There's certainly an argument for doing things like that, but I think it'd
be a net negative in terms of quality and consistency of documentation.
We'd basically be deciding those are non-goals.

Of course, ripping out table 27.4 altogether would be a simple solution
to the formatting problem I started with ;-).  But it doesn't really
seem like the answer we want.

> Of course if the names get passed in ad hoc then such a view could only
> show the types that happen to have been created up to the moment it is
> queried, which would defeat the purpose.

Yes, exactly.

I don't actually understand why the LWLock tranche mechanism is designed
the way it is.  It seems to be intended to support different backends
having different sets of LWLocks, but I fail to see why that's a good idea,
or even useful at all.  In any case, dynamically-created LWLocks are
clearly out of scope for the documentation.  The problem that I'm trying
to deal with right now is that even LWLocks that are hard-wired into the
backend code are difficult to enumerate.  That wasn't a problem before
we decided we needed to expose them all to user view; but now it is.

regards, tom lane

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

Thomas Munro  writes:
> On Wed, May 13, 2020 at 3:16 AM Tom Lane  wrote:
>> Hash/Batch/Allocating
>> Hash/Batch/Electing
>> Hash/Batch/Loading
>> Hash/GrowBatches/Allocating

> Perhaps we should also drop the 'ing' from the verbs, to be more like
> ...Read etc.

Yeah, that aspect was bothering me too.  Comparing these to other
wait event names, you could make a case for either "Allocate" or
"Allocation"; but there are no other names with -ing.

regards, tom lane

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Thomas Munro

On Wed, May 13, 2020 at 3:16 AM Tom Lane  wrote:
> And then somebody else, unwilling to use either of those styles,
> thought it would be cute to do
>
> Hash/Batch/Allocating
> Hash/Batch/Electing
> Hash/Batch/Loading
> Hash/GrowBatches/Allocating

Perhaps we should also drop the 'ing' from the verbs, to be more like
...Read etc.

Re: Inefficiency in SLRU stats collection

2020-05-12 Thread Kyotaro Horiguchi

At Tue, 12 May 2020 15:50:35 -0400, Tom Lane  wrote in 
> I happened to come across this code added by 28cac71bd:
> 
> static PgStat_MsgSLRU *
> slru_entry(SlruCtl ctl)
> {
> intidx = pgstat_slru_index(ctl->shared->lwlock_tranche_name);
> 
> Assert((idx >= 0) && (idx < SLRU_NUM_ELEMENTS));
> 
> return &SLRUStats[idx];
> }
> 
> which is invoked by all the pgstat_count_slru_XXX routines.
> This seems mightily inefficient --- the count functions are
> just there to increment integer counters, but first they
> have to do up to half a dozen strcmp's to figure out which
> counter to increment.
> 
> We could improve this by adding another integer field to
> SlruSharedData (which is already big enough that no one
> would notice) and recording the result of pgstat_slru_index()
> there as soon as the lwlock_tranche_name is set.  (In fact,
> it looks like we could stop saving the tranche name as such
> altogether, thus buying back way more shmem than the integer
> field would occupy.)

I noticed that while trying to move that stuff into shmem-stats patch.

I think we can get rid of SlruCtl->shared->lwlock_tranche_name since
the only user is the slru_entry() and no external modules don't look
into that depth and there's a substitute way to know the name for
them.

> This does require the assumption that all backends agree
> on the SLRU stats index for a particular SLRU cache.  But
> AFAICS we're locked into that already, since the backends
> use those indexes to tell the stats collector which cache
> they're sending stats for.
> 
> Thoughts?

AFAICS it is right and the change suggested looks reasonable to me.
One arguable point might be whether it is right that SlruData holds
pgstats internal index from the standpoint of modularity.  (It is one
of the reasons I didn't propose a patch for that..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: tablespace_map code cleanup

2020-05-12 Thread Amit Kapila

On Wed, May 13, 2020 at 1:54 AM Robert Haas  wrote:
>
> On Tue, May 12, 2020 at 2:23 AM Amit Kapila  wrote:
> > While looking at this, I noticed that caller (perform_base_backup) of
> > do_pg_start_backup, sets the backup phase as
> > PROGRESS_BASEBACKUP_PHASE_WAIT_CHECKPOINT whereas, in
> > do_pg_start_backup, we do collect the information about all
> > tablespaces after the checkpoint.  I am not sure if it is long enough
> > that we consider having a separate phase for it.   Without your patch,
> > it was covered under PROGRESS_BASEBACKUP_PHASE_ESTIMATE_BACKUP_SIZE
> > phase which doesn't appear to be a bad idea.
>
> Maybe I'm confused here, but I think the size estimation still *is*
> covered under PROGRESS_BASEBACKUP_PHASE_ESTIMATE_BACKUP_SIZE. It's
> just that now that happens a bit later.
>

There is no problem with this part.

> I'm assuming that listing the
> tablespaces is pretty cheap, but sizing them is expensive, as you'd
> have to iterate over all the files and stat() each one.
>

I was trying to say that tablespace listing will happen under
PROGRESS_BASEBACKUP_PHASE_WAIT_CHECKPOINT phase which could be a
problem if it is a costly operation but as you said it is pretty cheap
so I think we don't need to bother about that.

Apart from the above point which I think we don't need to bother, both
your patches look good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: new heapcheck contrib module

2020-05-12 Thread Mark Dilger

> On May 12, 2020, at 5:34 PM, Peter Geoghegan  wrote:
> 
> On Mon, May 11, 2020 at 10:21 AM Mark Dilger
>  wrote:
>> 2) amcheck's btree checking functions have been refactored to be able to 
>> operate in two modes; the original mode in which all errors are reported via 
>> ereport, and a new mode for returning errors as rows from a set returning 
>> function.

Thank you yet again for reviewing.  I really appreciate the feedback!

> Somebody suggested that I make amcheck work in this way during its
> initial development. I rejected that idea at the time, though. It
> seems hard to make it work because the B-Tree index scan is a logical
> order index scan. It's quite possible that a corrupt index will have
> circular sibling links, and things like that. Making everything an
> error removes that concern. There are clearly some failures that we
> could just soldier on from, but the distinction gets rather blurred.

Ok, I take your point that the code cannot soldier on after the first error is 
returned.  I'll change that for v6 of the patch, moving on to the next relation 
after hitting the first corruption in any particular index.  Do you mind that I 
refactored the code to return the error rather than ereporting?  If it offends 
your sensibilities, I could rip that back out, at the expense of having to use 
try/catch logic in some other places.  I prefer to avoid the try/catch stuff, 
but I'm not going to put up a huge fuss.

> I understand why you want to do it this way. It makes sense that the
> heap stuff would report all inconsistencies together, at the end. I
> don't think that that's really workable (or even desirable) in the
> case of B-Tree indexes, though. When an index is corrupt, the solution
> is always to do root cause analysis, to make sure that the issue does
> not recur, and then to REINDEX. There isn't really a question about
> doing data recovery of the index structure.

Yes, I agree that reindexing is the most sensible remedy.  I certainly have no 
plans to implement some pg_fsck_index type tool.  Even for tables, I'm not 
interested in creating such a tool. I just want a good tool for finding out 
what the nature of the corruption is, as that might make it easier to debug 
what went wrong.  It's not just for debugging production systems, but also for 
chasing down problems in half-baked code prior to release.

> Would it be possible to log the first B-Tree inconsistency, and then
> move on to the next high-level phase of verification? You don't have
> to throw an error, but it seems like a good idea for amcheck to still
> give up on further verification of the index.

Ok, good, it sounds like we're converging on the same idea.  I'm happy to do so.

> The assertion failure that you reported happens because of a generic
> assertion made from _bt_compare(). It doesn't have anything to do with
> amcheck (you'll see the same thing from regular index scans), really.

Oh, I know that already.  I could see that easily enough in the backtrace.  But 
if you look at the way I implemented verify_heapam, you might notice this:

/*
 * check_tuphdr_xids
 *
 *  Determine whether tuples are visible for verification.  Similar to
 *  HeapTupleSatisfiesVacuum, but with critical differences.
 *
 *  1) Does not touch hint bits.  It seems imprudent to write hint bits
 * to a table during a corruption check.
 *  2) Only makes a boolean determination of whether verification should
 * see the tuple, rather than doing extra work for vacuum-related
 * categorization.
 *
 *  The caller should already have checked that xmin and xmax are not out of
 *  bounds for the relation.
 */

The point is that when checking the table for corruption I avoid calling 
anything that might assert (or segfault, or whatever).  I was talking about 
refactoring the btree checking code to be similarly careful.

> I think that removing that assertion would be the opposite of
> hardening. Even if you removed it, the backend will still crash once
> you come up with a slightly more evil index tuple. Maybe *that* could
> be mostly avoided with widespread hardening; we could in principle
> perform cross-checks of varlena headers against the tuple or page
> layout at any point reachable from _bt_compare(). That seems like
> something that would have unacceptable overhead, because the cost
> would be imposed on everything. And even then you've only ameliorated
> the problem.

I think we may have different mental models of how this all works in practice.  
I am (or was) envisioning that the backend, during regular table and index 
scans, cannot afford to check for corruption at all steps along the way, and 
therefore does not, but that a corruption checking tool has a fundamentally 
different purpose, and can and should choose to operate in a way that won't 
blow up when checking a corrupt relation.  It's the difference between a car 
designed to drive down the highway at high speed vs. a military vehicle 
designed to drive over a

Re: Event trigger code comment duplication

2020-05-12 Thread David G. Johnston

On Mon, May 11, 2020 at 11:30 PM Michael Paquier 
wrote:

> The second point about the check with (!currentEventTriggerState) in
> EventTriggerTableRewrite() and EventTriggerDDLCommandEnd() shows that
> both comments share the same first sentence, but there is enough
> different context to just keep them as separate IMO.
>

Went back and looked this over - the comment differences in the check for
currentEventTriggerState boil down to:

the word "required" vs "important" - either works for both.

the fact that the DDLCommandEnd function probably wouldn't crash absent the
check - which while I do not know whether DDLTriggerRewrite would crash for
certain (hence the "required") as a practical matter it is required (and
besides if keeping note of which of these would crash or not is deemed
important that can be commented upon specifically in each - both
DDLCommandStart (which lacks the check altogether...) and SQLDrop both
choose not to elaborate on that point at all.

Whether its a style thing, or some requirement of the C-language, I found
it odd that the four nearly identical checks were left inline in the
functions instead of being pulled out into a function.  I've attached a
conceptual patch that does just this and more clearly presents on my
thoughts on the topic.  In particular I tried to cleanup the quadruple
negative sentence (and worse for the whole paragraph) as part of the
refactoring of the currentEventTriggerState comment.

David J.
diff --git a/src/backend/commands/event_trigger.c 
b/src/backend/commands/event_trigger.c
index 91800d1fac..5524535e53 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -662,6 +662,31 @@ EventTriggerCommonSetup(Node *parsetree,
return runlist;
 }
 
+bool
+EventTriggerValidContext(bool requireState)
+{
+   /*
+* See EventTriggerDDLCommandStart for a discussion about why event
+* triggers are disabled in single user mode.
+*/
+   if (!IsUnderPostmaster)
+   return false;
+
+   if (requireState) {
+   /*
+   * Only move forward if our state is set up.  This is required
+   * to handle concurrency - if we proceed, without state already 
set up,
+   * and allow EventTriggerCommonSetup to run it may find triggers 
that
+   * didn't exist when the command started.
+   */
+   if (!currentEventTriggerState)
+   return false;
+   }
+
+   return true;
+}
+
+
 /*
  * Fire ddl_command_start triggers.
  */
@@ -671,23 +696,8 @@ EventTriggerDDLCommandStart(Node *parsetree)
List   *runlist;
EventTriggerData trigdata;
 
-   /*
-* Event Triggers are completely disabled in standalone mode.  There are
-* (at least) two reasons for this:
-*
-* 1. A sufficiently broken event trigger might not only render the
-* database unusable, but prevent disabling itself to fix the situation.
-* In this scenario, restarting in standalone mode provides an escape
-* hatch.
-*
-* 2. BuildEventTriggerCache relies on systable_beginscan_ordered, and
-* therefore will malfunction if pg_event_trigger's indexes are damaged.
-* To allow recovery from a damaged index, we need some operating mode
-* wherein event triggers are disabled.  (Or we could implement
-* heapscan-and-sort logic for that case, but having disaster recovery
-* scenarios depend on code that's otherwise untested isn't appetizing.)
-*/
-   if (!IsUnderPostmaster)
+   /* We pass false for requireState since ... */
+   if (!EventTriggerValidContext(false))
return;
 
runlist = EventTriggerCommonSetup(parsetree,
@@ -719,24 +729,8 @@ EventTriggerDDLCommandEnd(Node *parsetree)
List   *runlist;
EventTriggerData trigdata;
 
-   /*
-* See EventTriggerDDLCommandStart for a discussion about why event
-* triggers are disabled in single user mode.
-*/
-   if (!IsUnderPostmaster)
-   return;
-
-   /*
-* Also do nothing if our state isn't set up, which it won't be if there
-* weren't any relevant event triggers at the start of the current DDL
-* command.  This test might therefore seem optional, but it's important
-* because EventTriggerCommonSetup might find triggers that didn't exist
-* at the time the command started.  Although this function itself
-* wouldn't crash, the event trigger functions would presumably call
-* pg_event_trigger_ddl_commands which would fail.  Better to do nothing
-* until the next command.
-*/
-   if (!currentEventTriggerState)
+   /* Seems not helpful to mention that this might to crash even if there 
is no state... */
+   if (!EventTriggerValidContext(true))
return;
 
runl

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Isaac Morland

On Tue, 12 May 2020 at 18:11, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > (Didn't we have a patch to generate the table programmatically?)
>
> Having now looked around a bit at where the names come from, I think
> such a patch would be impossible as things stand, which is a pity
> because as-things-stand is a totally unmaintainable situation.
> Anybody at all can call LWLockRegisterTranche and thereby create a new
> name that ought to be listed in the SGML table.  Apparently, somebody
> grepped for such calls and put all the ones that existed at the time
> into the table; unsurprisingly, the results are already out of date.
> Several of the hard-wired calls in RegisterLWLockTranches() are not
> reflected in the SGML table AFAICS.
>

I expect there is a reason why this hasn’t been suggested, but just in case
it is at all helpful:

When do these names get created? That is, do all the wait types get created
and registered on startup, or is it more like whenever something needs to
do something the name gets passed in ad hoc? I'm wondering because it seems
like it might be helpful to have a system view which gives all the wait
event types, names, and descriptions. Maybe even add a column for which
extension (or core) it came from. The documentation could then just explain
the general situation and point people at the system view to see exactly
which wait types exist in their system. This would require every instance
where a type is registered to pass an additional parameter — the
description, as currently seen in the table in the documentation.

Of course if the names get passed in ad hoc then such a view could only
show the types that happen to have been created up to the moment it is
queried, which would defeat the purpose. And I can think of a few potential
reasons why this might not work at all, starting with the need to re-write
every example of registering a new type to provide the documentation string
for the view.

Inspiration due to pg_setting, pg_config, pg_available_extensions and
pg_get_keywords ().

Re: BUG #16419: wrong parsing BC year in to_date() function

2020-05-12 Thread David G. Johnston

Redirecting to -hackers for visibility.  I feel there needs to be something
done here, even if just documentation (a bullet in the usage notes section
- and a code comment update for the macro) pointing this out and not
changing any behavior.

David J.

On Wed, May 6, 2020 at 8:12 PM David G. Johnston 
wrote:

> ‪On Wed, May 6, 2020 at 6:31 PM ‫دار الآثار للنشر والتوزيع-صنعاء Dar
> Alathar-Yemen‬‎  wrote:‬
>
>> Any one suppose that these functions return the same:
>> make_date(-1,1,1)
>> to_date('-1-01-01','-mm-dd')
>>
>> But make_date will give 0001-01-01 BC
>>
>> And to_date will give 0002-01-01 BC
>>
>>
>>
> Interesting...and a fair point.
>
> What seems to be happening here is that to_date is trying to be helpful by
> doing:
>
> select to_date('',''); // 0001-01-01 BC
>
> It does this seemingly by subtracting one from the year, making it
> positive, then (I infer) appending "BC" to the result.  Thus for the year
> "-1" it yields "0002-01-01 BC"
>
> make_date just chooses to reject the year 0 and treat the negative as an
> alternative to specifying BC
>
> There seems to be zero tests for to_date involving negative years, and the
> documentation doesn't talk of them.
>
> I'll let the -hackers speak up as to how they want to go about handling
> to_date (research how it behaves in the other database it tries to emulate
> and either document or possibly change the behavior in v14) but do suggest
> that a simple explicit description of how to_date works in the presence of
> negative years be back-patched.  A bullet in the usage notes section
> probably suffices:
>
> "If a  format string captures a negative year, or , it will treat
> it as a BC year after decreasing the value by one.  So  maps to 1 BC
> and -1 maps to 2 BC and so on."
>
> So, no, make_date and to_date do not agree on this point; and they do not
> have to.  There is no way to specify "BC" in make_date function so using
> negative there makes sense.  You can specify BC in the input string for
> to_date and indeed that is the only supported (documented) way to do so.
>
>
[and the next email]


> Specifically:
>
>
> https://github.com/postgres/postgres/blob/fb544735f11480a697fcab791c058adc166be1fa/src/backend/utils/adt/formatting.c#L236
>
> /*
>  * There is no 0 AD.  Years go from 1 BC to 1 AD, so we make it
>  * positive and map year == -1 to year zero, and shift all negative
>  * years up one.  For interval years, we just return the year.
>  */
> #define ADJUST_YEAR(year, is_interval) ((is_interval) ? (year) : ((year)
> <= 0 ? -((year) - 1) : (year)))
>
> The code comment took me a bit to process - seems like the following would
> be better (if its right - I don't know why interval is a pure no-op while
> non-interval normalizes to a positive integer).
>
> Years go from 1 BC to 1 AD, so we adjust the year zero, and all negative
> years, by shifting them away one year,  We then return the positive value
> of the result because the caller tracks the BC/AD aspect of the year
> separately and only deals with positive year values coming out of this
> macro.  Intervals denote the distance away from 0 a year is so we can
> simply take the supplied value and return it.  Interval processing code
> expects a negative result for intervals going into BC.
>
> David J.
>
>

Re: Problem with logical replication

2020-05-12 Thread Euler Taveira

On Tue, 12 May 2020 at 06:36, Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Mon, 11 May 2020 at 16:28, Michael Paquier  wrote:
> >
> > On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
> > > I attached a patch with the described solution. I also included a test
> that
> > > covers this scenario.
> >
> > -   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
> > +   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));
> >
> > Not much a fan of adding a routine to relcache.c to do the work of two
> > routines already present, so I think that we had better add an extra
> > condition based on RelationGetPrimaryKeyIndex, and give up on
> > GetRelationIdentityOrPK() in execReplication.c.
>
> Although, I think this solution is fragile, I updated the patch
accordingly.
(When/If someone changed GetRelationIdentityOrPK() it will break this
assert)


> In any case, it seems to me that the comment of
> build_replindex_scan_key needs to be updated.
>
>  * This is not generic routine, it expects the idxrel to be replication
>  * identity of a rel and meet all limitations associated with that.
>
> It is implicit that a primary key can be a replica identity so I think this
comment is fine.


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From a858f9253b4bb4ceb8a8c5cd93335499e3ad4673 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 12 May 2020 19:40:48 -0300
Subject: [PATCH] Fix assert failure with REPLICA IDENTITY FULL in the
 subscriber

This assertion failure can be reproduced using a replicated table in the
subscriber with REPLICA IDENTITY FULL. Since FindReplTupleInLocalRel
uses GetRelationIdentityOrPK to obtain an index and a few calls later
this index is used in build_replindex_scan_key, this function cannot
assert using a function other than GetRelationIdentityOrPK. Although,
GetRelationIdentityOrPK is a generic function that uses other 2
functions; we can the same logic from GetRelationIdentityOrPK into
Assert (thankfully the logic is simple).
---
 src/backend/executor/execReplication.c |  3 ++-
 src/test/subscription/t/001_rep_changes.pl | 12 +++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 1418746eb8..af8f0afb6f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -57,7 +57,8 @@ build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
 	int2vector *indkey = &idxrel->rd_index->indkey;
 	bool		hasnulls = false;
 
-	Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
+	Assert((RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)) ||
+			(RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel)));
 
 	indclassDatum = SysCacheGetAttr(INDEXRELID, idxrel->rd_indextuple,
 	Anum_pg_index_indclass, &isnull);
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 6da7f71ca3..7d9ea1f002 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -34,6 +34,8 @@ $node_publisher->safe_psql('postgres',
 $node_publisher->safe_psql('postgres',
 	"CREATE TABLE tab_include (a int, b text, CONSTRAINT covering PRIMARY KEY(a) INCLUDE(b))"
 );
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full_pk (a int primary key, b text)");
 # Let this table with REPLICA IDENTITY NOTHING, allowing only INSERT changes.
 $node_publisher->safe_psql('postgres', "CREATE TABLE tab_nothing (a int)");
 $node_publisher->safe_psql('postgres',
@@ -46,6 +48,9 @@ $node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int)");
 $node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
 $node_subscriber->safe_psql('postgres',
 	"CREATE TABLE tab_rep (a int primary key)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full_pk (a int primary key, b text)");
+$node_subscriber->safe_psql('postgres', "ALTER TABLE tab_full_pk REPLICA IDENTITY FULL");
 $node_subscriber->safe_psql('postgres', "CREATE TABLE tab_nothing (a int)");
 
 # different column count and order than on publisher
@@ -64,7 +69,7 @@ $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub_ins_only WITH (publish = insert)");
 $node_publisher->safe_psql('postgres',
-	"ALTER PUBLICATION tap_pub ADD TABLE tab_rep, tab_full, tab_full2, tab_mixed, tab_include, tab_nothing"
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_rep, tab_full, tab_full2, tab_mixed, tab_include, tab_nothing, tab_full_pk"
 );
 $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
@@ -102,6 +107,9 @@ $node_publisher->safe_psql('postgres', "UPDATE tab_rep SET a = -a");
 $node_publisher->safe_psql('postgres',
 	"

Re: new heapcheck contrib module

2020-05-12 Thread Peter Geoghegan

On Mon, May 11, 2020 at 10:21 AM Mark Dilger
 wrote:
> 2) amcheck's btree checking functions have been refactored to be able to 
> operate in two modes; the original mode in which all errors are reported via 
> ereport, and a new mode for returning errors as rows from a set returning 
> function.

Somebody suggested that I make amcheck work in this way during its
initial development. I rejected that idea at the time, though. It
seems hard to make it work because the B-Tree index scan is a logical
order index scan. It's quite possible that a corrupt index will have
circular sibling links, and things like that. Making everything an
error removes that concern. There are clearly some failures that we
could just soldier on from, but the distinction gets rather blurred.

I understand why you want to do it this way. It makes sense that the
heap stuff would report all inconsistencies together, at the end. I
don't think that that's really workable (or even desirable) in the
case of B-Tree indexes, though. When an index is corrupt, the solution
is always to do root cause analysis, to make sure that the issue does
not recur, and then to REINDEX. There isn't really a question about
doing data recovery of the index structure.

Would it be possible to log the first B-Tree inconsistency, and then
move on to the next high-level phase of verification? You don't have
to throw an error, but it seems like a good idea for amcheck to still
give up on further verification of the index.

The assertion failure that you reported happens because of a generic
assertion made from _bt_compare(). It doesn't have anything to do with
amcheck (you'll see the same thing from regular index scans), really.
I think that removing that assertion would be the opposite of
hardening. Even if you removed it, the backend will still crash once
you come up with a slightly more evil index tuple. Maybe *that* could
be mostly avoided with widespread hardening; we could in principle
perform cross-checks of varlena headers against the tuple or page
layout at any point reachable from _bt_compare(). That seems like
something that would have unacceptable overhead, because the cost
would be imposed on everything. And even then you've only ameliorated
the problem.

Code like amcheck's PageGetItemIdCareful() goes further than the
equivalent backend macro (PageGetItemId()) to avoid assertion failures
and crashes with corrupt data. I doubt that it is practical to take it
much further than that, though. It's subject to diminishing returns.
In general, _bt_compare() calls user-defined code that is usually
written in C. This C code could in principle feel entitled to do any
number of scary things when you corrupt the input data. The amcheck
module's dependency on user-defined operator code is totally
unavoidable -- it is the single source of truth for the nbtree checks.

It boils down to this: I think that regression tests that run on the
buildfarm and actually corrupt data are not practical, at least in the
case of the index checks -- though probably in all cases. Look at the
pageinspect "btree.out" test output file -- it's very limited, because
we have to work around a bunch of implementation details. It's no
accident that the bt_page_items() test shows a palindrome value in the
data column (the value is "01 00 00 00 00 00 00 01"). That's an
endianness workaround.

-- 
Peter Geoghegan

Re: pg13: xlogreader API adjust

2020-05-12 Thread Alvaro Herrera

On 2020-May-12, Kyotaro Horiguchi wrote:

> I'm not sure the reason for wal_segment_open and WalSndSegmentOpen
> being modified different way about error handling of BasicOpenFile, I
> prefer the WalSndSegmentOpen way.  However, that difference doesn't
> harm anything so I'm fine with the current patch.

Yeah, I couldn't decide which style I liked the most.  I used the one
you suggested.

> + fake_xlogreader.seg = *sendSeg;
> + fake_xlogreader.segcxt = *sendCxt;
> 
> fake_xlogreader.seg is a different instance from *sendSeg.  WALRead
> modifies fake_xlogreader.seg but does not modify *sendSeg. Thus the
> change doesn't persist.  On the other hand WalSndSegmentOpen reads
> *sendSeg, which is not under control of WALRead.
> 
> Maybe we had better to make fake_xlogreader be a global variable of
> walsender.c that covers the current sendSeg and sendCxt.

I tried that.  I was about to leave it at just modifying physical
walsender (simple enough, and it passed tests), but I noticed that
WalSndErrorCleanup() would be a problem because we don't know if it's
physical or logical walsender.  So in the end I added a global
'xlogreader' pointer in walsender.c -- logical walsender sets it to the
true xlogreader it has inside the logical decoding context, and physical
walsender sets it to its fake xlogreader.  That seems to work nicely.
sendSeg/sendCxt are gone entirely.  Logical walsender was doing
WALOpenSegmentInit() uselessly during InitWalSender(), since it was
using the separate sendSeg/sendCxt structs instead of the ones in its
xlogreader.  (Some mysteries become clearer!)  

It's slightly disquieting that the segment_close call in
WalSndErrorCleanup is not covered, but in any case this should work well
AFAICS.  I think this is simpler to understand than formerly.

Now the only silliness remaining is the fact that different users of the
xlogreader interface are doing different things about the TLI.
Hopefully we can unify everything to something sensible one day .. but
that's not going to happen in pg13.

I'll get this pushed tomorrow, unless there are further objections.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 63160d69f734bdb2ed45aa839ec2903f12194d50 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 12 May 2020 19:39:57 -0400
Subject: [PATCH] Adjust walsender usage of xlogreader, simplify APIs

Per comments from Kyotaro Horiguchi
---
 src/backend/access/transam/xlogreader.c | 35 +
 src/backend/access/transam/xlogutils.c  | 16 ++--
 src/backend/replication/walsender.c | 98 -
 src/bin/pg_waldump/pg_waldump.c | 16 ++--
 src/include/access/xlogreader.h | 20 ++---
 src/include/access/xlogutils.h  |  3 +-
 6 files changed, 83 insertions(+), 105 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7cee8b92c9..aae3fee24c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1044,14 +1044,12 @@ err:
 
 /*
  * Helper function to ease writing of XLogRoutine->page_read callbacks.
- * If this function is used, caller must supply an open_segment callback in
+ * If this function is used, caller must supply a segment_open callback in
  * 'state', as that is used here.
  *
  * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
  * fetched from timeline 'tli'.
  *
- * 'seg/segcxt' identify the last segment used.
- *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
@@ -1061,7 +1059,6 @@ err:
 bool
 WALRead(XLogReaderState *state,
 		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALOpenSegment *seg, WALSegmentContext *segcxt,
 		WALReadError *errinfo)
 {
 	char	   *p;
@@ -1078,34 +1075,36 @@ WALRead(XLogReaderState *state,
 		int			segbytes;
 		int			readbytes;
 
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+		startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
 		/*
 		 * If the data we want is not in a segment we have open, close what we
 		 * have (if anything) and open the next one, using the caller's
 		 * provided openSegment callback.
 		 */
-		if (seg->ws_file < 0 ||
-			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
-			tli != seg->ws_tli)
+		if (state->seg.ws_file < 0 ||
+			!XLByteInSeg(recptr, state->seg.ws_segno, state->segcxt.ws_segsize) ||
+			tli != state->seg.ws_tli)
 		{
 			XLogSegNo	nextSegNo;
 
-			if (seg->ws_file >= 0)
+			if (state->seg.ws_file >= 0)
 state->routine.segment_close(state);
 
-			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
-			seg->ws_file = state->routine.segment_open(state, nextSegNo,
-	   segcxt, &tli);
+			XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
+			state->routine.segment_open(state, nextSegNo, &tli);
+
+			/* This shouldn't ha

Re: making update/delete of inheritance trees scale better

2020-05-12 Thread David Rowley

On Wed, 13 May 2020 at 00:54, Ashutosh Bapat
 wrote:
>
> On Mon, May 11, 2020 at 8:11 PM Amit Langote  wrote:
> > > Per row overhead would be incurred for every row whereas the plan time
> > > overhead is one-time or in case of a prepared statement almost free.
> > > So we need to compare it esp. when there are 2000 partitions and all
> > > of them are being updated.
> >
> > I assume that such UPDATEs would be uncommon.
>
> Yes, 2000 partitions being updated would be rare. But many rows from
> the same partition being updated may not be that common. We have to
> know how much is that per row overhead and updating how many rows it
> takes to beat the planning time overhead. If the number of rows is
> very large, we are good.

Rows from a non-parallel Append should arrive in order. If you were
worried about the performance of finding the correct ResultRelInfo for
the tuple that we just got, then we could just cache the tableOid and
ResultRelInfo for the last row, and if that tableoid matches on this
row, just use the same ResultRelInfo as last time.   That'll save
doing the hash table lookup in all cases, apart from when the Append
changes to the next child subplan.  Not sure exactly how that'll fit
in with the foreign table discussion that's going on here though.
Another option would be to not use tableoid and instead inject an INT4
Const (0 to nsubplans) into each subplan's targetlist that serves as
the index into an array of ResultRelInfos.

As for which ResultRelInfos to initialize, couldn't we just have the
planner generate an OidList of all the ones that we could need.
Basically, all the non-pruned partitions. Perhaps we could even be
pretty lazy about building those ResultRelInfos during execution too.
We'd need to grab the locks first, but, without staring at the code, I
doubt there's a reason we'd need to build them all upfront.  That
would help in cases where pruning didn't prune much, but due to
something else in the WHERE clause, the results only come from some
small subset of partitions.

David

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

For the specific category of the heavyweight lock types, I'm
now thinking that we can't change the event names very much, because
those are also exposed in pg_locks' locktype column.  We can be
darn certain, for example, that changing the spelling of "relation"
in that column would break a lot of user queries.  Conceivably we
could decouple the wait event names from the locktype column, but
on the whole that doesn't seem like a great plan.

However, having said that, I remain on the warpath about "speculative
token".  That's an utterly horrid choice for both locktype and wait
event.  I also notice, with no amusement, that "speculative token"
is not documented in the pg_locks documentation.  So I think we should
change it ... but to what, exactly?  Looking at the other existing names:

const char *const LockTagTypeNames[] = {
"relation",
"extend",
"page",
"tuple",
"transactionid",
"virtualxid",
"speculative token",
"object",
"userlock",
"advisory"
};

I'm inclined to propose "spectoken".  I'd be okay with "spec_token" as
well, but there are not underscores in the longer-established names.

(Needless to say, this array is going to gain a comment noting that
there are two places to document any changes.  Also, if we split up
the wait_event table as discussed earlier, it might make sense for
pg_locks' documentation to cross-reference the sub-table for heavyweight
lock events, since that has some explanation of what the codes mean.)

regards, tom lane

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

Alvaro Herrera  writes:
> (Didn't we have a patch to generate the table programmatically?)

Having now looked around a bit at where the names come from, I think
such a patch would be impossible as things stand, which is a pity
because as-things-stand is a totally unmaintainable situation.
Anybody at all can call LWLockRegisterTranche and thereby create a new
name that ought to be listed in the SGML table.  Apparently, somebody
grepped for such calls and put all the ones that existed at the time
into the table; unsurprisingly, the results are already out of date.
Several of the hard-wired calls in RegisterLWLockTranches() are not
reflected in the SGML table AFAICS.

Or, if you don't want to call LWLockRegisterTranche, you can instead
call RequestNamedLWLockTranche.  Whee.  At least there are none of
those in the core code.  However, we do have both autoprewarm and
pg_stat_statements calling these things from contrib.

That raises a policy question, or really two of them: should contrib
code be held to the standards we're going to set forth for tranche
names chosen by core code?  And should contrib-created tranche names
be listed in chapter 27's table?  (If not, should the individual
contrib modules document their additions?)

We could make things a little better perhaps if we got rid of all the
cowboy calls to LWLockRegisterTranche and had RegisterLWLockTranches
make all of them using a single table of names, as it already does
with MainLWLockNames[] but not the other ones.  Then it'd be possible
to have documentation beside that table warning people to add entries
to the SGML docs; and even for the people who can't be bothered to
read comments, at least they'd be adding names to a list of names that
would give them some precedent and context for how to choose a new name.
I think 50% of the problem right now is that if you just write a
random new call to LWLockRegisterTranche in a random new place, you
have no context about what the tranche name should look like.

Even with all the names declared in some reasonably centralized
place(s), we'd be a long way from making the SGML tables programmatically,
because we'd not have text descriptions for the wait events.  I can
imagine extending the source-code conventions a bit to include those
strings there, but I'm doubtful that it's worth the trouble.

regards, tom lane

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Simon Riggs

On Tue, 12 May 2020 at 21:00, Tom Lane  wrote:

> Simon Riggs  writes:
> > On Tue, 12 May 2020 at 19:11, Tom Lane  wrote:
> >> Anyway, I was just throwing this idea out to see if there would be
> >> howls of "you can't rename anything" anguish.  Since there haven't
> >> been so far, I'll spend a bit more time and try to create a concrete
> >> list of possible changes.
>
> > If we add in extensions and lwlocks, will they show up as well?
>
> Yeah, I was just looking into that.  Part of the reason for the
> inconsistency is that we've exposed names that are passed to,
> eg, SimpleLruInit that previously were strictly internal debugging
> identifiers, so that approximately zero thought was put into them.
>
> We're going to have to document SimpleLruInit and similar functions
> along the lines of "The name you give here will be user-visible as
> a wait event.  Choose it with an eye to consistency with existing
> wait event names, and add it to the user-facing documentation."
> But that requirement isn't something I just invented, it was
> effectively created by whoever implemented things this way.
>
> Said user-facing documentation largely fails to explain that the
> set of wait events can be enlarged by extensions; that needs to
> be fixed, too.
>
> There isn't a lot we can do to force extensions to pick consistent
> names, but on the other hand we won't be documenting such names
> anyway, so for my immediate purposes it doesn't matter ;-)
>

 I think we need to plan the namespace with extensions in mind.

There are now dozens; some of them even help you view wait events...

We don't want the equivalent of the Dewey decimal system: 300 categories of
Exaggeration and one small corner for Science.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

Mission Critical Databases

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

Robert Haas  writes:
> On Tue, May 12, 2020 at 4:00 PM Tom Lane  wrote:
>> Said user-facing documentation largely fails to explain that the
>> set of wait events can be enlarged by extensions; that needs to
>> be fixed, too.

> Is that true? How can they do that? I thought they were stuck with
> PG_WAIT_EXTENSION.

Extensions can definitely add new LWLock tranches, and thereby
enlarge the set of names in that category.  I haven't figured out
whether there are other avenues.

regards, tom lane

Re: PG 13 release notes, first draft

2020-05-12 Thread Bruce Momjian

On Tue, May 12, 2020 at 01:47:38PM -0400, Alvaro Herrera wrote:
> On 2020-May-07, Bruce Momjian wrote:
> 
> > OK, I have moved her name to be first.  FYI, this commit was backpatched
> > back through PG 11, though the commit message doesn't mention that.
> 
> At some point I became an avid user of our src/tools/git_changelog, and
> then it stopped making sense for me to highlight in the commit message
> the fact that the commit is back-patched, since it's so obvious there.
> Maybe that's wrong and I should get back in the habit of mentioning it.

Uh, not sure.  I don't need it since, as you said,
src/tools/git_changelog covers it, but someone got confused since they
looked at just the commit message without looking at
src/tools/git_changelog.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: PG 13 release notes, first draft

2020-05-12 Thread Bruce Momjian

On Tue, May 12, 2020 at 09:43:10AM +0200, Fabien COELHO wrote:
> 
> Hello Bruce,
> 
> > OK, section and item added, patch attached,
> 
> Thanks!
> 
> Some items that might be considered for the added documentation section:
> 
>  * e1ff780485

I was told in this email thread to not include that one.

>  * 34a0a81bfb

We already have:

Reformat tables containing function information for better
clarity (Tom Lane)

so it seems it is covered as part of this.

>  * e829337d42

Uh, this is a doc link formatting addition.  I think this falls into the
error message logic, where it is nice when people want it, but they
don't need to know about it ahead of time.

>  * "Document color support (Peter Eisentraut)"
>THIS WAS NOT DOCUMENTED BEFORE?
> 
> Not as such, AFAICR it was vaguely hinted about in the documentation of
> command that would use it, but not even all of them. Now there is a new
> specific section.

Again, this is the first hash you gave.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)

2020-05-12 Thread Tom Lane

Alvaro Herrera  writes:
> Fixed one straggler in contrib, and while testing it I realized why
> ccache doesn't pay attention to the changes I was doing in the file:
> ccache compares the *preprocessed* version of the file and only if that
> differs from the version that was cached last, ccache sends the new one
> to the compiler; and of course these comments are not present in the
> preprocessed version, so changing only the comment accomplishes nothing.
> You have to touch one byte outside of any comments.

Ugh.  So the only way ccache could avoid this is to drop the
preprocessed-file comparison check if -Wimplicit-fallthrough is on.
Doesn't really sound like something we'd want to ask them to do.

> I bet this is going to bite someone ... maybe we'd be better off going
> all the way to -Wimplicit-fallthrough=5 and use the
> __attribute__((fallthrough)) stuff instead.

I'm not really in favor of the __attribute__ solution --- seems too
gcc-specific.  FALLTHROUGH-type comments are understood by other
sorts of tools besides gcc.

In practice, it doesn't seem like this'll be a huge problem once
we're past the initial fixup stage.  We can revisit it later if
that prediction proves wrong, of course.

regards, tom lane

Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)

2020-05-12 Thread Tom Lane

Alvaro Herrera  writes:
> I ended up using level 4 and dialling back to 3 for zic.c only
> (different make trickery though).

Meh ... if we're going to use level 4, I'm inclined to just change zic.c
to match.  It's not like we're using the upstream code verbatim anyway.
We could easily add s/fallthrough/FALLTHROUGH/ to the conversion recipe.

> I get no warnings with this (gcc 8), but ccache seems to save warnings
> in one run so that they can be thrown in a later one.  I'm not sure what
> to make of that, but ccache -d proved that beyond reasonable doubt and
> ccache -clear got rid of the lot.

Sounds like a ccache bug: maybe it's not accounting for different
fallthrough warning levels.  ccache knows a *ton* about gcc options,
so I'd not be surprised if it's doing something special with this one.

regards, tom lane

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

Robert Haas  writes:
> On Tue, May 12, 2020 at 11:16 AM Tom Lane  wrote:
>> I've been trying to reformat table 27.4 (wait events) to fit
>> into PDF output, which has caused me to study its contents
>> more than I ever had before.

> That reminds me that it might be easier to maintain that table if we
> broke it up into one table per major category - that is, one table for
> lwlocks, one table for IPC, one table for IO, etc. - instead of a
> single table with a row-span number that is large and frequently
> updated incorrectly.

Yeah, see my last attempt at

https://www.postgresql.org/message-id/26961.1589260206%40sss.pgh.pa.us

I'm probably going to go with that, but as given that patch conflicts
with my other pending patch to change the catalog description tables,
so I want to push that other one first and then clean up the wait-
event one.  In the meantime, I'm going to look at these naming issues,
which will also be changing that patch ...

regards, tom lane

Re: PG 13 release notes, first draft

2020-05-12 Thread Bruce Momjian

On Tue, May 12, 2020 at 01:09:08PM +0900, Kyotaro Horiguchi wrote:
> > > commit c6b9204
> > > Author: Noah Misch 
> > > AuthorDate: Sat Apr 4 12:25:34 2020 -0700
> > > Commit: Noah Misch 
> > > CommitDate: Sat Apr 4 12:25:34 2020 -0700
> > > 
> > > Skip WAL for new relfilenodes, under wal_level=minimal.
> > > 
> > > Until now, only selected bulk operations (e.g. COPY) did this.  If a
> > > given relfilenode received both a WAL-skipping COPY and a WAL-logged
> > > operation (e.g. INSERT), recovery could lose tuples from the COPY.  
> > > See
> > > src/backend/access/transam/README section "Skipping WAL for New
> > > RelFileNode" for the new coding rules.  Maintainers of table access
> > > methods should examine that section.
> > 
> > OK, so how do we want to document this?  Do I mention in the text below
> > the WAL skipping item that this fixes a bug where a mix of simultaneous
> > COPY and INSERT into a table could lose rows during crash recovery, or
> > create a new item?
> 
> FWIW, as dicussed upthread, I suppose that the API change is not going
> to be in relnotes.
> 
> something like this?
> 
> - Fix bug of WAL-skipping optimiazation 
> 
> Previously a trasaction doing both of COPY and a WAL-logged operations
> like INSERT while wal_level=minimal can lead to loss of COPY'ed rows
> through crash recovery.  Also this fix extends the WAL-skipping
> optimiazation to all kinds of bulk insert operations.

Uh, that kind of mixes the bug fix and the feature in a way that it is
hard to understand.  How about this?

Allow skipping of WAL for new tables and indexes if wal_level is
'minimal' (Kyotaro Horiguchi)

Relations larger than wal_skip_threshold will have their files
fsync'ed rather than writing their WAL records.  Previously this
was done only for COPY operations, but the implementation had a
bug that could cause data loss during crash recovery.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Peter Geoghegan

On Tue, May 12, 2020 at 8:16 AM Tom Lane  wrote:
> I'm not sure what our stance is on version-to-version consistency
> of these names, but I'd like to think that we are not stuck for
> all time with the results of these random coin tosses.

These names are fundamentally implementation details, and
implementation details are subject to change without too much warning.
I think it's okay to change the names for consistency along the lines
you propose. ISTM that it's worth going to a little bit of effort to
preserve any existing names. But not too much.

-- 
Peter Geoghegan

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Alvaro Herrera

On 2020-May-12, Robert Haas wrote:

> That reminds me that it might be easier to maintain that table if we
> broke it up into one table per major category - that is, one table for
> lwlocks, one table for IPC, one table for IO, etc. - instead of a
> single table with a row-span number that is large and frequently
> updated incorrectly.

(Didn't we have a patch to generate the table programmatically?)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: refactoring basebackup.c

2020-05-12 Thread Robert Haas

On Tue, May 12, 2020 at 4:32 AM Dilip Kumar  wrote:
> Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
> bbsink_libpq_begin_backup whereas others are calling SendCopy*
> functions and therein those are calling pq_* functions.  I think
> bbsink_libpq_* function can directly call pq_* functions instead of
> adding one more level of the function call.

I think all the helper functions have more than one caller, though.
That's why I created them - to avoid duplicating code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)

2020-05-12 Thread Alvaro Herrera

On 2020-May-12, Alvaro Herrera wrote:

> I get no warnings with this (gcc 8), but ccache seems to save warnings
> in one run so that they can be thrown in a later one.  I'm not sure what
> to make of that, but ccache -d proved that beyond reasonable doubt and
> ccache -clear got rid of the lot.

Fixed one straggler in contrib, and while testing it I realized why
ccache doesn't pay attention to the changes I was doing in the file:
ccache compares the *preprocessed* version of the file and only if that
differs from the version that was cached last, ccache sends the new one
to the compiler; and of course these comments are not present in the
preprocessed version, so changing only the comment accomplishes nothing.
You have to touch one byte outside of any comments.

I bet this is going to bite someone ... maybe we'd be better off going
all the way to -Wimplicit-fallthrough=5 and use the
__attribute__((fallthrough)) stuff instead.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: tablespace_map code cleanup

2020-05-12 Thread Robert Haas

On Tue, May 12, 2020 at 2:23 AM Amit Kapila  wrote:
> While looking at this, I noticed that caller (perform_base_backup) of
> do_pg_start_backup, sets the backup phase as
> PROGRESS_BASEBACKUP_PHASE_WAIT_CHECKPOINT whereas, in
> do_pg_start_backup, we do collect the information about all
> tablespaces after the checkpoint.  I am not sure if it is long enough
> that we consider having a separate phase for it.   Without your patch,
> it was covered under PROGRESS_BASEBACKUP_PHASE_ESTIMATE_BACKUP_SIZE
> phase which doesn't appear to be a bad idea.

Maybe I'm confused here, but I think the size estimation still *is*
covered under PROGRESS_BASEBACKUP_PHASE_ESTIMATE_BACKUP_SIZE. It's
just that now that happens a bit later. I'm assuming that listing the
tablespaces is pretty cheap, but sizing them is expensive, as you'd
have to iterate over all the files and stat() each one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PG 13 release notes, first draft

2020-05-12 Thread Bruce Momjian

On Mon, May 11, 2020 at 11:38:33PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, May 11, 2020 at 11:08:35PM -0400, Chapman Flack wrote:
> >> 'specify' ?
> 
> > I like that word if Tom prefers it.
> 
> 'specify' works for me.

Sure, done.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: No core file generated after PostgresNode->start

2020-05-12 Thread Robert Haas

On Mon, May 11, 2020 at 10:48 PM Tom Lane  wrote:
> I have a standing note to check the permissions on /cores after any macOS
> upgrade, because every so often Apple decides that that directory ought to
> be read-only.

Thanks, that was my problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)

2020-05-12 Thread Alvaro Herrera

On 2020-May-11, Julien Rouhaud wrote:

> On Mon, May 11, 2020 at 4:40 PM Tom Lane  wrote:
> >
> > Julien Rouhaud  writes:
> > > On Mon, May 11, 2020 at 3:41 PM Tom Lane  wrote:
> > >> Why?  It uses "fallthrough" which is a legal spelling per level 4.
> >
> > > GCC documentation mentions [ \t]*FALLTHR(OUGH|U)[ \t]* for level 4
> > > (out of the view other alternatives), which AFAICT is case sensitive
> > > (level 3 has fall(s | |-)?thr(ough|u)[ \t.!]*(-[^\n\r]*)?).
> >
> > Oh, I'd missed that that was case sensitive.  Ugh --- that seems
> > unreasonable.  Maybe we'd better settle for level 3 after all;
> > I don't think there's much room to doubt the intentions of a
> > comment spelled that way.
> 
> Agreed.

Pushed, thanks.

I ended up using level 4 and dialling back to 3 for zic.c only
(different make trickery though).  I also settled on FALLTHROUGH rather
than FALLTHRU because the latter seems ugly as a spelling to me.  I'm
not a fan of the uppercase, but the alternative would be to add a - or
@s.

I get no warnings with this (gcc 8), but ccache seems to save warnings
in one run so that they can be thrown in a later one.  I'm not sure what
to make of that, but ccache -d proved that beyond reasonable doubt and
ccache -clear got rid of the lot.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Robert Haas

On Tue, May 12, 2020 at 11:16 AM Tom Lane  wrote:
> I've been trying to reformat table 27.4 (wait events) to fit
> into PDF output, which has caused me to study its contents
> more than I ever had before.

That reminds me that it might be easier to maintain that table if we
broke it up into one table per major category - that is, one table for
lwlocks, one table for IPC, one table for IO, etc. - instead of a
single table with a row-span number that is large and frequently
updated incorrectly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Robert Haas

On Tue, May 12, 2020 at 4:00 PM Tom Lane  wrote:
> Said user-facing documentation largely fails to explain that the
> set of wait events can be enlarged by extensions; that needs to
> be fixed, too.

Is that true? How can they do that? I thought they were stuck with
PG_WAIT_EXTENSION.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Robert Haas

> There are a lot of other things that seem inconsistent, but I'm not sure
> how much patience people would have for judgment-call renamings.  An
> example is that "ProcSignalBarrier" is under IO, but why?  Shouldn't it
> be reclassified as IPC?

Hmm, that seems like a goof.

> Other than that, *almost* all the IO events
> are named SomethingRead, SomethingWrite, or SomethingSync, which
> makes sense to me ... should we insist they all follow that pattern?

Maybe, but sometimes module X does more than one kind of
read/write/sync, and I'm not necessarily keen on merging things
together. The whole point of this is to be able to tell where you're
stuck in the code, and the more you merge related things together, the
less you can actually tell that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

Simon Riggs  writes:
> On Tue, 12 May 2020 at 19:11, Tom Lane  wrote:
>> Anyway, I was just throwing this idea out to see if there would be
>> howls of "you can't rename anything" anguish.  Since there haven't
>> been so far, I'll spend a bit more time and try to create a concrete
>> list of possible changes.

> If we add in extensions and lwlocks, will they show up as well?

Yeah, I was just looking into that.  Part of the reason for the
inconsistency is that we've exposed names that are passed to,
eg, SimpleLruInit that previously were strictly internal debugging
identifiers, so that approximately zero thought was put into them.

We're going to have to document SimpleLruInit and similar functions
along the lines of "The name you give here will be user-visible as
a wait event.  Choose it with an eye to consistency with existing
wait event names, and add it to the user-facing documentation."
But that requirement isn't something I just invented, it was
effectively created by whoever implemented things this way.

Said user-facing documentation largely fails to explain that the
set of wait events can be enlarged by extensions; that needs to
be fixed, too.

There isn't a lot we can do to force extensions to pick consistent
names, but on the other hand we won't be documenting such names
anyway, so for my immediate purposes it doesn't matter ;-)

regards, tom lane

Inefficiency in SLRU stats collection

2020-05-12 Thread Tom Lane

I happened to come across this code added by 28cac71bd:

static PgStat_MsgSLRU *
slru_entry(SlruCtl ctl)
{
intidx = pgstat_slru_index(ctl->shared->lwlock_tranche_name);

Assert((idx >= 0) && (idx < SLRU_NUM_ELEMENTS));

return &SLRUStats[idx];
}

which is invoked by all the pgstat_count_slru_XXX routines.
This seems mightily inefficient --- the count functions are
just there to increment integer counters, but first they
have to do up to half a dozen strcmp's to figure out which
counter to increment.

We could improve this by adding another integer field to
SlruSharedData (which is already big enough that no one
would notice) and recording the result of pgstat_slru_index()
there as soon as the lwlock_tranche_name is set.  (In fact,
it looks like we could stop saving the tranche name as such
altogether, thus buying back way more shmem than the integer
field would occupy.)

This does require the assumption that all backends agree
on the SLRU stats index for a particular SLRU cache.  But
AFAICS we're locked into that already, since the backends
use those indexes to tell the stats collector which cache
they're sending stats for.

Thoughts?

regards, tom lane

Re: COPY, lock release and MVCC

2020-05-12 Thread Laurenz Albe

On Tue, 2020-05-12 at 11:50 -0400, Tom Lane wrote:
> Laurenz Albe  writes:
> > On Mon, 2020-05-11 at 15:43 -0400, Robert Haas wrote:
> > > On Fri, May 8, 2020 at 4:58 AM Laurenz Albe  
> > > wrote:
> > > > I happened to notice that COPY TO releases the ACCESS SHARE lock
> > > > on the table right when the command ends rather than holding it
> > > > until the end of the transaction:
> > > That seems inconsistent with what an INSERT statement would do, and thus 
> > > bad.
> > Well, should we fix the code or the documentation?
> 
> I'd agree with fixing the code.  Early lock release is something we do on
> system catalog accesses, and while it hasn't bitten us yet, I've been
> kind of expecting that someday it will.  We should not do it on SQL-driven
> accesses to user tables.
> 
> Having said that, I'd vote for just changing it in HEAD, not
> back-patching.  It's not clear that there are consequences bad enough
> to merit a back-patched behavior change.

Agreed.

Here is a patch.

Yours,
Laurenz Albe
From e75be9e8a649548918e675abc0b28a190d41ab8d Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Tue, 12 May 2020 21:47:40 +0200
Subject: [PATCH] Make COPY TO keep locks until transaction end

COPY TO released the ACCESS SHARE lock immediately when it
was done rather than holding on to it until the end of the
transaction.

This violates the two-phase locking protocol and breaks MCVV:
For example, a REPEATABLE READ transaction could see an
empty table if it repeats a COPY statement and somebody
truncated the table in the meantime.

Before 4dded12faad the lock was also released after
COPY FROM, but the commit failed to notice the irregularity
in COPY TO.

This is old behavior, but the bug doesn't seem important enough
to backpatch.

Author: Laurenz Albe
Reviewed-by: Robert Haas, Tom Lane
Discussion: https://postgr.es/m/7bcfc39d4176faf85ab317d0c26786953646a411.ca...@cybertec.at
---
 src/backend/commands/copy.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ac07f75bc3..0cd8820696 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1081,13 +1081,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 		EndCopyTo(cstate);
 	}
 
-	/*
-	 * Close the relation. If reading, we can release the AccessShareLock we
-	 * got; if writing, we should hold the lock until end of transaction to
-	 * ensure that updates will be committed before lock is released.
-	 */
-	if (rel != NULL)
-		table_close(rel, (is_from ? NoLock : AccessShareLock));
+	table_close(rel, NoLock);
 }
 
 /*
-- 
2.21.3

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Simon Riggs

On Tue, 12 May 2020 at 19:11, Tom Lane  wrote:


> Anyway, I was just throwing this idea out to see if there would be
> howls of "you can't rename anything" anguish.  Since there haven't
> been so far, I'll spend a bit more time and try to create a concrete
> list of possible changes.
>

If we add in extensions and lwlocks, will they show up as well?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

Mission Critical Databases

Re: effective_io_concurrency's steampunk spindle maths

2020-05-12 Thread Peter Geoghegan

On Sun, Mar 15, 2020 at 9:27 PM Thomas Munro  wrote:
> On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro  wrote:
> > Here's a patch set to remove the spindle stuff, add a maintenance
> > variant, and use the maintenance one in
> > heap_compute_xid_horizon_for_tuples().
>
> Pushed.

Shouldn't you close out the "Should we rename
effective_io_concurrency?" Postgres 13 open item now?

-- 
Peter Geoghegan

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-05-12 Thread Peter Geoghegan

On Tue, May 12, 2020 at 11:18 AM Tomas Vondra
 wrote:
> I've pushed both patches, fixing typos and explain format.

Thanks, Tomas.

-- 
Peter Geoghegan

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-05-12 Thread Tomas Vondra


On Sun, May 10, 2020 at 02:25:23PM +0200, Tomas Vondra wrote:

On Sat, May 09, 2020 at 06:48:02PM -0700, Peter Geoghegan wrote:

On Sat, May 9, 2020 at 3:19 PM Tomas Vondra
 wrote:

I'm generally OK with most of this - I'd probably keep the single-line
format, but I don't feel very strongly about that and if others think
using two lines is better ...

Barring objections I'll get this polished and pushed soon-ish (say,
early next week).


I see something about starting a new thread on the Open Items page.
Please CC me on this.

Speaking in my capacity as an RMT member: Glad to see that you plan to
close this item out next week. (I had planned on giving you a nudge
about this, but it looks like I don't really have to now.)



Not sure about about the new thread - the discussion continues on the
main incremental sort thread, I don't see any proposal to start a new
thread there. IMO it'd be pointless at this point.



I've pushed both patches, fixing typos and explain format.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

"Andrey M. Borodin"  writes:
> 3. I think names observed in wait_event and wait_event_type should not 
> duplicate information. i.e. "XidGenLock" is already "LWLock".

Yeah, I'd been wondering about that too: we could strip the "Lock" suffix
from all the names in the LWLock category, and make pg_stat_activity
output a bit narrower.

There are a lot of other things that seem inconsistent, but I'm not sure
how much patience people would have for judgment-call renamings.  An
example is that "ProcSignalBarrier" is under IO, but why?  Shouldn't it
be reclassified as IPC?  Other than that, *almost* all the IO events
are named SomethingRead, SomethingWrite, or SomethingSync, which
makes sense to me ... should we insist they all follow that pattern?

Anyway, I was just throwing this idea out to see if there would be
howls of "you can't rename anything" anguish.  Since there haven't
been so far, I'll spend a bit more time and try to create a concrete
list of possible changes.

regards, tom lane

Re: pgsql: Show opclass and opfamily related information in psql

2020-05-12 Thread Alvaro Herrera



I would appreciate opinions from the patch authors on this ordering
change (rationale in previous email).  I forgot to CC Sergei and Nikita.

> diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
> index 8dca6d8bb4..9bd0bf8356 100644
> --- a/src/bin/psql/describe.c
> +++ b/src/bin/psql/describe.c
> @@ -6288,7 +6288,11 @@ listOpFamilyOperators(const char 
> *access_method_pattern,
>   processSQLNamePattern(pset.db, &buf, family_pattern, 
> have_where, false,
> "nsf.nspname", 
> "of.opfname", NULL, NULL);
>  
> - appendPQExpBufferStr(&buf, "ORDER BY 1, 2, o.amopstrategy, 3;");
> + appendPQExpBufferStr(&buf, "ORDER BY 1, 2,\n"
> +  "  o.amoplefttype = 
> o.amoprighttype DESC,\n"
> +  "  
> pg_catalog.format_type(o.amoplefttype, NULL),\n"
> +  "  
> pg_catalog.format_type(o.amoprighttype, NULL),\n"
> +  "  o.amopstrategy;");
>  
>   res = PSQLexec(buf.data);
>   termPQExpBuffer(&buf);


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Andrey M. Borodin




> 12 мая 2020 г., в 20:16, Tom Lane  написал(а):
> 
> Thoughts?
> 

I've been coping with cognitive load of these names recently. 2 cents of my 
impressions:
1. Names are somewhat recognisable and seem to have some meaning. But there is 
not so much information about them in the Internet. But I did not try to Google 
them all, just a small subset.
2. Anyway, names should be grepable and googlable, i.e. unique amid identifiers.
3. I think names observed in wait_event and wait_event_type should not 
duplicate information. i.e. "XidGenLock" is already "LWLock".
4. It's hard to tell the difference between "buffer_content", "buffer_io", 
"buffer_mapping", "BufferPin", "BufFileRead", "BufFileWrite" and some others. 
"CLogControlLock" vs "clog"? I'm not sure good DBA can tell the difference 
without looking up into the code.
I hope some thoughts will be useful.

Best regards, Andrey Borodin.

Re: PG 13 release notes, first draft

2020-05-12 Thread Alvaro Herrera

On 2020-May-07, Bruce Momjian wrote:

> OK, I have moved her name to be first.  FYI, this commit was backpatched
> back through PG 11, though the commit message doesn't mention that.

At some point I became an avid user of our src/tools/git_changelog, and
then it stopped making sense for me to highlight in the commit message
the fact that the commit is back-patched, since it's so obvious there.
Maybe that's wrong and I should get back in the habit of mentioning it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] hs_standby_disallowed test fix

2020-05-12 Thread Tom Lane

Fujii Masao  writes:
> I just wonder why standbycheck regression test doesn't run by default
> in buildfarm. Which caused us not to notice this issue long time. Maybe
> because it's difficult to set up hot-standby environment in the
> regression test? If so, we might need to merge standbycheck test into
> TAP tests for recovery.

It seems likely to me that the standbycheck stuff has been completely
obsoleted by the TAP-based recovery tests.  We should get rid of it,
after adding any missing coverage to the TAP tests.

regards, tom lane

Re: gcov coverage data not full with immediate stop

2020-05-12 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, May 11, 2020 at 1:04 PM Tom Lane  wrote:
>> Yeah.  Traditionally we've waited till the start of the next commitfest
>> (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
>> differently).  But it seems like things are slow enough that perhaps
>> we could branch earlier, like June 1, and give the committers a chance
>> to deal with some of their own stuff before starting the CF.

> The RMT discussed this question informally yesterday. The consensus is
> that we should wait and see what the early feedback from Beta 1 is
> before making a final decision. An earlier June 1 branch date is an
> idea that certainly has some merit, but we'd like to put off making a
> final decision on that for at least another week, and possibly as long
> as two weeks.

> Can that easily be accommodated?

There's no real lead time needed AFAICS: when we are ready to branch,
we can just do it.  So sure, let's wait till the end of May to decide.
If things look bad then, we could reconsider again mid-June.

regards, tom lane

Re: gcov coverage data not full with immediate stop

2020-05-12 Thread Peter Geoghegan

On Mon, May 11, 2020 at 1:04 PM Tom Lane  wrote:
> Yeah.  Traditionally we've waited till the start of the next commitfest
> (which I'm assuming is July 1, for lack of an Ottawa dev meeting to decide
> differently).  But it seems like things are slow enough that perhaps
> we could branch earlier, like June 1, and give the committers a chance
> to deal with some of their own stuff before starting the CF.

The RMT discussed this question informally yesterday. The consensus is
that we should wait and see what the early feedback from Beta 1 is
before making a final decision. An earlier June 1 branch date is an
idea that certainly has some merit, but we'd like to put off making a
final decision on that for at least another week, and possibly as long
as two weeks.

Can that easily be accommodated?

-- 
Peter Geoghegan

Re: COPY, lock release and MVCC

2020-05-12 Thread Tom Lane

Laurenz Albe  writes:
> On Mon, 2020-05-11 at 15:43 -0400, Robert Haas wrote:
>> On Fri, May 8, 2020 at 4:58 AM Laurenz Albe  wrote:
>>> I happened to notice that COPY TO releases the ACCESS SHARE lock
>>> on the table right when the command ends rather than holding it
>>> until the end of the transaction:

>> That seems inconsistent with what an INSERT statement would do, and thus bad.

> Well, should we fix the code or the documentation?

I'd agree with fixing the code.  Early lock release is something we do on
system catalog accesses, and while it hasn't bitten us yet, I've been
kind of expecting that someday it will.  We should not do it on SQL-driven
accesses to user tables.

Having said that, I'd vote for just changing it in HEAD, not
back-patching.  It's not clear that there are consequences bad enough
to merit a back-patched behavior change.

regards, tom lane

Re: COPY, lock release and MVCC

2020-05-12 Thread Laurenz Albe

On Mon, 2020-05-11 at 15:43 -0400, Robert Haas wrote:
> On Fri, May 8, 2020 at 4:58 AM Laurenz Albe  wrote:
> > I happened to notice that COPY TO releases the ACCESS SHARE lock
> > on the table right when the command ends rather than holding it
> > until the end of the transaction:
> 
> That seems inconsistent with what an INSERT statement would do, and thus bad.

Well, should we fix the code or the documentation?

Yours,
Laurenz Albe

Re: Our naming of wait events is a disaster.

2020-05-12 Thread Jonah H. Harris

On Tue, May 12, 2020 at 11:16 AM Tom Lane  wrote:

> My inclination is to propose that we settle on the first style
> shown above, which is the majority case now, and rename the
> other events to fit that.  As long as we're breaking compatibility
> anyway, I'd also like to shorten one or two of the very longest
> names, because they're just giving me fits in fixing the PDF
> rendering.  (They would make a mess of the display of
> pg_stat_activity, too, anytime they come up in the field.)
>
> Thoughts?
>

+1

-- 
Jonah H. Harris

Our naming of wait events is a disaster.

2020-05-12 Thread Tom Lane

I've been trying to reformat table 27.4 (wait events) to fit
into PDF output, which has caused me to study its contents
more than I ever had before.  The lack of consistency, or
even any weak attempt at consistency, is not just glaring;
it's embarrassing.

We have a lot of wait event names like these:

XidGenLock
ProcArrayLock
SInvalReadLock
SInvalWriteLock
WALBufMappingLock
WALWriteLock

which are more or less fine; maybe one could wish for having
just one way of capitalizing acronyms not two, but I'll let
that pass.  But could we be satisfied with handling all multi
word names in that style?  Nope:

commit_timestamp
multixact_offset
multixact_member
wal_insert

(and in case you are wondering, yes, "WAL" is also spelled "Wal"
in yet other places.)

And then somebody else, unwilling to use either of those styles,
thought it would be cute to do

Hash/Batch/Allocating
Hash/Batch/Electing
Hash/Batch/Loading
Hash/GrowBatches/Allocating

and all alone in the remotest stretch of left field, we've got

speculative token

(yes, with a space in it).

Also, while the average length of these names exceeds 16 characters,
with such gems as SerializablePredicateLockListLock, think not that
prolixity is the uniform rule:

oldserxid
proc
tbm

Is it unreasonable of me to think that there should be *some*
amount of editorial control over these user-visible names?
At the rock bottom minimum, shouldn't we insist that they all
be legal identifiers?

I'm not sure what our stance is on version-to-version consistency
of these names, but I'd like to think that we are not stuck for
all time with the results of these random coin tosses.

My inclination is to propose that we settle on the first style
shown above, which is the majority case now, and rename the
other events to fit that.  As long as we're breaking compatibility
anyway, I'd also like to shorten one or two of the very longest
names, because they're just giving me fits in fixing the PDF
rendering.  (They would make a mess of the display of
pg_stat_activity, too, anytime they come up in the field.)

Thoughts?

regards, tom lane

Re: making update/delete of inheritance trees scale better

2020-05-12 Thread Tom Lane

Amit Langote  writes:
> On Tue, May 12, 2020 at 5:25 AM Robert Haas  wrote:
>> Ah, that makes sense. If we can invent dummy columns on the parent
>> rel, then most of what I was worrying about no longer seems very
>> worrying.

> IIUC, the idea is to have "dummy" columns in the top parent's
> reltarget for every junk TLE added to the top-level targetlist by
> child tables' FDWs that the top parent itself can't emit. But we allow
> these FDW junk TLEs to contain any arbitrary expression, not just
> plain Vars [1], so what node type are these dummy parent columns?

We'd have to group the children into groups that share the same
row-identity column type.  This is why I noted way-back-when that
it'd be a good idea to discourage FDWs from being too wild about
what they use for row identity.

(Also, just to be totally clear: I am *not* envisioning this as a
mechanism for FDWs to inject whatever computations they darn please
into query trees.  It's for the row identity needed by UPDATE/DELETE,
and nothing else.  That being the case, it's hard to understand why
the bottom-level Vars wouldn't be just plain Vars --- maybe "system
column" Vars or something like that, but still just Vars, not
expressions.)

regards, tom lane

Re: 2020-05-14 Press Release Draft

2020-05-12 Thread Jonathan S. Katz

Hi,

On 5/10/20 10:08 PM, Jonathan S. Katz wrote:
> Hi,
> 
> Attached is a draft of the press release for the 2020-05-14 cumulative
> update. Please let me know your feedback by 2020-05-13 :)

Thank you for the feedback. As per usual, I applied some combination of
{all, some, none}.

Please see v2.

Thanks again for the review!

Jonathan
2020-05-14 Cumulative Update Release

The PostgreSQL Global Development Group has released an update to all supported
versions of our database system, including 12.3, 11.8, 10.13, 9.6.18, and
9.5.22.  This release fixes more than 75 bugs reported over the last three
months.

Users should plan to update at their earliest convenience.

Bug Fixes and Improvements
--

This update also fixes over 75 bugs that were reported in the last several
months. Some of these issues affect only version 12, but may also affect all
supported versions.

Some of these fixes include:

* Several fixes for GENERATED columns, including an issue where it was possible
to crash or corrupt data in a table when the output of the generated column was
the exact copy of a physical column on the table, e.g. if the expression called
a function which could return its own input.
* Several fixes for `ALTER TABLE`, including ensuring the `SET STORAGE`
directive is propagated to a table's indexes.
* Fix a potential race condition when using `DROP OWNED BY` while another
session is deleting the same objects.
* Allow for a partition to be detached when it has inherited ROW triggers.
* Several fixes for `REINDEX CONCURRENTLY`, particularly with issues when a
`REINDEX CONCURRENTLY` operation fails.
* Fix crash when COLLATE is applied to an uncollatable type in a partition bound
expression.
* Fix performance regression in floating point overflow/underflow detection.
* Several fixes for full text search, particularly with phrase searching.
* Fix query-lifespan memory leak for a set-returning function used in a query's
FROM clause.
* Several reporting fixes for the output of `VACUUM VERBOSE`.
* Allow input of type circle to accept the format `(x,y),r`, which is specified
in the documentation.
* Allow for the `get_bit()` and `set_bit()` functions to not fail on `bytea`
strings longer than 256MB.
* Avoid premature recycling of WAL segments during crash recovery, which could
lead to WAL segments being recycled before being archived.
* Avoid attempting to fetch nonexistent WAL files from archive storage during
recovery by skipping irrelevant timelines.
* Several fixes for logical replication and replication slots.
* Fix several race conditions in synchronous standby management, including one
that occurred when changing the `synchronous_standby_names` setting.
* Several fixes for GSSAPI support, include a fix for a memory leak that
occurred when using GSSAPI encryption.
* Ensure that members of the `pg_read_all_stats` role can read all statistics
views.
* Fix performance regression in information_schema.triggers view.
* Fix memory leak in libpq when using `sslmode=verify-full`.
* Fix crash in `psql` when attempting to re-establish a failed connection.
* Allow tab-completion of the filename argument to `\gx` command in `psql`.
* Add `pg_dump` support for `ALTER ... DEPENDS ON EXTENSION`.
* Several other fixes for `pg_dump`, which include dumping comments on RLS
policies and postponing restore of event triggers until the end.
* Ensure `pg_basebackup` generates valid tar files.
* `pg_checksums` skips tablespace subdirectories that belong to a different
PostgreSQL major version
* Several Windows compatibility fixes

This update also contains tzdata release 2020a for DST law changes in Morocco
and the Canadian Yukon, plus historical corrections for Shanghai. The
America/Godthab zone has been renamed to America/Nuuk to reflect current
English usage; however, the old name remains available as a compatibility link.
This also updates initdb's list of known Windows time zone names to include
recent additions.

For the full list of changes available, please review the
[release notes](https://www.postgresql.org/docs/current/release.html).

Updating

All PostgreSQL update releases are cumulative. As with other minor releases,
users are not required to dump and reload their database or use `pg_upgrade` in
order to apply this update release; you may simply shutdown PostgreSQL and
update its binaries.

Users who have skipped one or more update releases may need to run additional,
post-update steps; please see the release notes for earlier versions for
details.

For more details, please see the
[release notes](https://www.postgresql.org/docs/current/release.html).

Links
-
* [Download](https://www.postgresql.org/download/)
* [Release Notes](https://www.postgresql.org/docs/current/release.html)
* [Security Page](https://www.postgresql.org/support/security/)
* [Versioning Policy](https://www.postgresql.org/support/versioning/)
* [Follow @postgresql on Twitter](https://

Re: making update/delete of inheritance trees scale better

2020-05-12 Thread Amit Langote

On Tue, May 12, 2020 at 5:25 AM Robert Haas  wrote:
> On Mon, May 11, 2020 at 4:22 PM Tom Lane  wrote:
> > Robert Haas  writes:
> > > If the parent is RTI 1, and the children are RTIs 2..6, what
> > > varno/varattno will we use in RTI 1's tlist to represent a column that
> > > exists in both RTI 2 and RTI 3 but not in RTI 1, 4, 5, or 6?
> >
> > Fair question.  We don't have any problem representing the column
> > as it exists in any one of those children, but we lack a notation
> > for the "union" or whatever you want to call it, except in the case
> > where the parent relation has a corresponding column.  Still, this
> > doesn't seem that hard to fix.  My inclination would be to invent
> > dummy parent-rel columns (possibly with negative attnums? not sure if
> > that'd be easier or harder than adding them in the positive direction)
> > to represent such "union" columns.
>
> Ah, that makes sense. If we can invent dummy columns on the parent
> rel, then most of what I was worrying about no longer seems very
> worrying.

IIUC, the idea is to have "dummy" columns in the top parent's
reltarget for every junk TLE added to the top-level targetlist by
child tables' FDWs that the top parent itself can't emit. But we allow
these FDW junk TLEs to contain any arbitrary expression, not just
plain Vars [1], so what node type are these dummy parent columns?  I
can see from add_vars_to_targetlist() that we allow only Vars and
PlaceHolderVars to be present in a relation's reltarget->exprs, but
neither of those seem suitable for the task.

Once we get something in the parent's reltarget->exprs representing
these child expressions, from there they go back into child
reltargets, so it would appear that our appendrel transformation code
must somehow be taught to deal with these dummy columns.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] 
https://www.postgresql.org/docs/current/fdw-callbacks.html#FDW-CALLBACKS-UPDATE

"...If the extra expressions are more complex than simple Vars, they
must be run through eval_const_expressions before adding them to the
targetlist."

Re: making update/delete of inheritance trees scale better

2020-05-12 Thread Ashutosh Bapat

On Mon, May 11, 2020 at 8:11 PM Amit Langote  wrote:
> > Per row overhead would be incurred for every row whereas the plan time
> > overhead is one-time or in case of a prepared statement almost free.
> > So we need to compare it esp. when there are 2000 partitions and all
> > of them are being updated.
>
> I assume that such UPDATEs would be uncommon.

Yes, 2000 partitions being updated would be rare. But many rows from
the same partition being updated may not be that common. We have to
know how much is that per row overhead and updating how many rows it
takes to beat the planning time overhead. If the number of rows is
very large, we are good.

>
> > But generally I agree that this would be a
> > better approach. It might help using PWJ when the result relation
> > joins with other partitioned table.
>
> It does, because the plan below ModifyTable is same as if the query
> were SELECT instead of UPDATE; with my PoC:
>
> explain (costs off) update foo set a = foo2.a + 1 from foo foo2 where
> foo.a = foo2.a;
> QUERY PLAN
> --
>  Update on foo
>Update on foo_1
>Update on foo_2
>->  Append
>  ->  Merge Join
>Merge Cond: (foo_1.a = foo2_1.a)
>->  Sort
>  Sort Key: foo_1.a
>  ->  Seq Scan on foo_1
>->  Sort
>  Sort Key: foo2_1.a
>  ->  Seq Scan on foo_1 foo2_1
>  ->  Merge Join
>Merge Cond: (foo_2.a = foo2_2.a)
>->  Sort
>  Sort Key: foo_2.a
>  ->  Seq Scan on foo_2
>->  Sort
>  Sort Key: foo2_2.a
>  ->  Seq Scan on foo_2 foo2_2
> (20 rows)

Wonderful. That looks good.


> > Can we plan the scan query to add a sort node to order the rows by tableoid?
>
> Hmm, I am afraid that some piece of partitioning code that assumes a
> certain order of result relations, and that order is not based on
> sorting tableoids.

I am suggesting that we override that order (if any) in
create_modifytable_path() or create_modifytable_plan() by explicitly
ordering the incoming paths on tableoid. May be using MergeAppend.


>
> > * Tuple re-routing during UPDATE. For now it's disabled so your design
> > should work. But we shouldn't design this feature in such a way that
> > it comes in the way to enable tuple re-routing in future :).
>
> Sorry, what is tuple re-routing and why does this new approach get in its way?

An UPDATE causing a tuple to move to a different partition. It would
get in its way since the tuple will be located based on tableoid,
which will be the oid of the old partition. But I think this approach
has higher chance of being able to solve that problem eventually
rather than the current approach.
-- 
Best Wishes,
Ashutosh Bapat

repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()

2020-05-12 Thread Joe Conway

I was doing some memory testing under fractional CPU allocations and it became
painfully obvious that the repeat() function needs CHECK_FOR_INTERRUPTS().

I exchanged a few emails offlist with Tom about it, and (at the risk of putting
words in his mouth) he agreed and felt it was a candidate for backpatching.

Very small patch attached. Quick and dirty performance test:

explain analyze SELECT repeat('A', 3);
explain analyze SELECT repeat('A', 3);
explain analyze SELECT repeat('A', 3);

With an -O2 optimized build:

Without CHECK_FOR_INTERRUPTS

 Planning Time: 1077.238 ms
 Execution Time: 0.016 ms

 Planning Time: 1080.381 ms
 Execution Time: 0.013 ms

 Planning Time: 1072.049 ms
 Execution Time: 0.013 ms

With CHECK_FOR_INTERRUPTS

 Planning Time: 1078.703 ms
 Execution Time: 0.013 ms

 Planning Time: 1077.495 ms
 Execution Time: 0.013 ms

 Planning Time: 1076.793 ms
 Execution Time: 0.013 ms


While discussing the above, Tom also wondered whether we should add unlikely()
to the CHECK_FOR_INTERRUPTS() macro.

Small patch for that also attached. I was not sure about the WIN32 stanza on
that (to do it or not; if so, what about the UNBLOCKED_SIGNAL_QUEUE() test).

I tested as above with unlikely() and did not see any discernible difference,
but the added check might improve other code paths.

Comments or objections?

Thanks,

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/backend/utils/adt/oracle_compat.c b/src/backend/utils/adt/oracle_compat.c
index 0fdfee5..ecd8268 100644
*** a/src/backend/utils/adt/oracle_compat.c
--- b/src/backend/utils/adt/oracle_compat.c
***
*** 19,25 
  #include "utils/builtins.h"
  #include "utils/formatting.h"
  #include "mb/pg_wchar.h"
! 
  
  static text *dotrim(const char *string, int stringlen,
  	const char *set, int setlen,
--- 19,25 
  #include "utils/builtins.h"
  #include "utils/formatting.h"
  #include "mb/pg_wchar.h"
! #include "miscadmin.h"
  
  static text *dotrim(const char *string, int stringlen,
  	const char *set, int setlen,
*** repeat(PG_FUNCTION_ARGS)
*** 1062,1067 
--- 1062,1068 
  	{
  		memcpy(cp, sp, slen);
  		cp += slen;
+ 		CHECK_FOR_INTERRUPTS();
  	}
  
  	PG_RETURN_TEXT_P(result);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 61a24c2..0467a8e 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*** extern void ProcessInterrupts(void);
*** 98,104 
  
  #define CHECK_FOR_INTERRUPTS() \
  do { \
! 	if (InterruptPending) \
  		ProcessInterrupts(); \
  } while(0)
  #else			/* WIN32 */
--- 98,104 
  
  #define CHECK_FOR_INTERRUPTS() \
  do { \
! 	if (unlikely(InterruptPending)) \
  		ProcessInterrupts(); \
  } while(0)
  #else			/* WIN32 */
*** do { \
*** 107,113 
  do { \
  	if (UNBLOCKED_SIGNAL_QUEUE()) \
  		pgwin32_dispatch_queued_signals(); \
! 	if (InterruptPending) \
  		ProcessInterrupts(); \
  } while(0)
  #endif			/* WIN32 */
--- 107,113 
  do { \
  	if (UNBLOCKED_SIGNAL_QUEUE()) \
  		pgwin32_dispatch_queued_signals(); \
! 	if (unlikely(InterruptPending)) \
  		ProcessInterrupts(); \
  } while(0)
  #endif			/* WIN32 */


signature.asc
Description: OpenPGP digital signature

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-05-12 Thread Amit Kapila

On Thu, May 7, 2020 at 6:17 PM Dilip Kumar  wrote:
>
> On Tue, May 5, 2020 at 7:13 PM Dilip Kumar  wrote:
>
> I have fixed one more issue in 0010 patch.  The issue was that once
> the transaction is serialized due to the incomplete toast after
> streaming the serialized store was not cleaned up so it was streaming
> the same tuple multiple times.
>

I have reviewed a few patches (003, 004, and 005) and below are my comments.

v20-0003-Extend-the-output-plugin-API-with-stream-methods

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used.  I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

2.
+   
+Similar to spill-to-disk behavior, streaming is triggered when the total
+amount of changes decoded from the WAL (for all in-progress transactions)
+exceeds limit defined by
logical_decoding_work_mem setting.
+At that point the largest toplevel transaction (measured by
amount of memory
+currently used for decoded changes) is selected and streamed.
+   

I think we need to explain here the cases/exception where we need to
spill even when stream is enabled and check if this is per latest
implementation, otherwise, update it.

3.
+ * To support streaming, we require change/commit/abort callbacks. The
+ * message callback is optional, similarly to regular output plugins.

/similarly/similar

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here?  I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */

Can't we report txn->final_lsn here?

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.


v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt

7.
+ /*
+ * We don't expect direct calls to table_tuple_get_latest_tid with valid
+ * CheckXidAlive  for catalog or regular tables.

There is an extra space between 'CheckXidAlive' and 'for'.  I can see
similar problems in other places as well where this comment is used,
fix those as well.

8.
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */

In this comment, there is an inconsistency in the space used after
completing the sentence. In the part "transaction. To", single space
is used whereas at other places two spaces are used after a full stop.

v20-0005-Implement-streaming

Re: Global snapshots

2020-05-12 Thread Andrey Lepikhov


Rebased onto current master (fb544735f1).

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 29183c42a8ae31b830ab5af0dfcfdaadd6229700 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 12 May 2020 08:29:54 +0500
Subject: [PATCH 1/3] GlobalCSNLog-SLRU-v3

---
 src/backend/access/transam/Makefile |   1 +
 src/backend/access/transam/global_csn_log.c | 439 
 src/backend/access/transam/twophase.c   |   1 +
 src/backend/access/transam/varsup.c |   2 +
 src/backend/access/transam/xlog.c   |  12 +
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/ipc/procarray.c |   3 +
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/tcop/postgres.c |   1 +
 src/backend/utils/misc/guc.c|   9 +
 src/backend/utils/probes.d  |   2 +
 src/bin/initdb/initdb.c |   3 +-
 src/include/access/global_csn_log.h |  30 ++
 src/include/storage/lwlock.h|   1 +
 src/include/utils/snapshot.h|   3 +
 15 files changed, 510 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..60ff8b141e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	global_csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/global_csn_log.c b/src/backend/access/transam/global_csn_log.c
new file mode 100644
index 00..6f7fded350
--- /dev/null
+++ b/src/backend/access/transam/global_csn_log.c
@@ -0,0 +1,439 @@
+/*-
+ *
+ * global_csn_log.c
+ *		Track global commit sequence numbers of finished transactions
+ *
+ * Implementation of cross-node transaction isolation relies on commit sequence
+ * number (CSN) based visibility rules.  This module provides SLRU to store
+ * CSN for each transaction.  This mapping need to be kept only for xid's
+ * greater then oldestXid, but that can require arbitrary large amounts of
+ * memory in case of long-lived transactions.  Because of same lifetime and
+ * persistancy requirements this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/global_csn_log.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "access/global_csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool track_global_snapshots;
+
+/*
+ * Defines for GlobalCSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0x,
+ * GlobalCSNLog page numbering also wraps around at
+ * 0x/GLOBAL_CSN_LOG_XACTS_PER_PAGE, and GlobalCSNLog segment numbering at
+ * 0x/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateGlobalCSNLog (see GlobalCSNLogPagePrecedes).
+ */
+
+/* We store the commit GlobalCSN for each xid */
+#define GCSNLOG_XACTS_PER_PAGE (BLCKSZ / sizeof(GlobalCSN))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData GlobalCSNLogCtlData;
+#define GlobalCsnlogCtl (&GlobalCSNLogCtlData)
+
+static int	ZeroGlobalCSNLogPage(int pageno);
+static bool GlobalCSNLogPagePrecedes(int page1, int page2);
+static void GlobalCSNLogSetPageStatus(TransactionId xid, int nsubxids,
+	  TransactionId *subxids,
+	  GlobalCSN csn, int pageno);
+static void GlobalCSNLogSetCSNInSlot(TransactionId xid, GlobalCSN csn,
+	  int slotno);
+
+/*
+ * GlobalCSNLogSetCSN
+ *
+ * Record GlobalCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should b

Re: No core file generated after PostgresNode->start

2020-05-12 Thread Andy Fan

On Tue, May 12, 2020 at 3:36 AM Robert Haas  wrote:

> On Sun, May 10, 2020 at 11:21 PM Andy Fan 
> wrote:
> > Looks this doesn't mean a crash.   If the test case(subscription/t/
> 013_partition.pl)
> > failed,  test framework kill some process, which leads the above
> message.  So you can
> > ignore this issue now.  Thanks
>
> I think there might be a real issue here someplace, though, because I
> couldn't get a core dump last week when I did have a crash happening
> locally.

I forget to say the failure happens on my modified version, I guess this is
what
happened in my case (subscription/t/013_partition.pl ).

1.  It need to read data from slave, however it get ERROR,  elog(ERROR, ..)
rather crash.
2.  The test framework knows the case failed, so it kill the primary in
some way.
3.  The primary raises the error below.

2020-05-11 09:37:40.778 CST [69541] sub_viaroot WARNING:  terminating
connection because of crash of another server process

2020-05-11 09:37:40.778 CST [69541] sub_viaroot DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit,
because another server process exited abnormally and possibly corrupted
shared memory.

Finally I get the root cause  by looking into the error log in slave.
After I fix
my bug, the issue gone.

Best Regards
Andy Fan

Re: Problem with logical replication

2020-05-12 Thread Masahiko Sawada

On Mon, 11 May 2020 at 16:28, Michael Paquier  wrote:
>
> On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
> > I attached a patch with the described solution. I also included a test that
> > covers this scenario.
>
> -   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
> +   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));
>
> Not much a fan of adding a routine to relcache.c to do the work of two
> routines already present, so I think that we had better add an extra
> condition based on RelationGetPrimaryKeyIndex, and give up on
> GetRelationIdentityOrPK() in execReplication.c.

+1

In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

 * This is not generic routine, it expects the idxrel to be replication
 * identity of a rel and meet all limitations associated with that.

For example, we can update the above:

 * This is not generic routine, it expects the idxrel to be replication
 * identity or primary key of a rel and meet all limitations associated
* with that.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: refactoring basebackup.c

2020-05-12 Thread Dilip Kumar

On Sat, May 9, 2020 at 2:23 AM Robert Haas  wrote:
>
> Hi,
>
> I'd like to propose a fairly major refactoring of the server's
> basebackup.c. The current code isn't horrific or anything, but the
> base backup mechanism has grown quite a few features over the years
> and all of the code knows about all of the features. This is going to
> make it progressively more difficult to add additional features, and I
> have a few in mind that I'd like to add, as discussed below and also
> on several other recent threads.[1][2] The attached patch set shows
> what I have in mind. It needs more work, but I believe that there's
> enough here for someone to review the overall direction, and even some
> of the specifics, and hopefully give me some useful feedback.
>
> This patch set is built around the idea of creating two new
> abstractions, a base backup sink -- or bbsink -- and a base backup
> archiver -- or bbarchiver.  Each of these works like a foreign data
> wrapper or custom scan or TupleTableSlot. That is, there's a table of
> function pointers that act like method callbacks. Every implementation
> can allocate a struct of sufficient size for its own bookkeeping data,
> and the first member of the struct is always the same, and basically
> holds the data that all implementations must store, including a
> pointer to the table of function pointers. If we were using C++,
> bbarchiver and bbsink would be abstract base classes.
>
> They represent closely-related concepts, so much so that I initially
> thought we could get by with just one new abstraction layer. I found
> on experimentation that this did not work well, so I split it up into
> two and that worked a lot better. The distinction is this: a bbsink is
> something to which you can send a bunch of archives -- currently, each
> would be a tarfile -- and also a backup manifest. A bbarchiver is
> something to which you send every file in the data directory
> individually, or at least the ones that are getting backed up, plus
> any that are being injected into the backup (e.g. the backup_label).
> Commonly, a bbsink will do something with the data and then forward it
> to a subsequent bbsink, or a bbarchiver will do something with the
> data and then forward it to a subsequent bbarchiver or bbsink. For
> example, there's a bbarchiver_tar object which, like any bbarchiver,
> sees all the files and their contents as input. The output is a
> tarfile, which gets send to a bbsink. As things stand in the patch set
> now, the tar archives are ultimately sent to the "libpq" bbsink, which
> sends them to the client.
>
> In the future, we could have other bbarchivers. For example, we could
> add "pax", "zip", or "cpio" bbarchiver which produces archives of that
> format, and any given backup could choose which one to use. Or, we
> could have a bbarchiver that runs each individual file through a
> compression algorithm and then forwards the resulting data to a
> subsequent bbarchiver. That would make it easy to produce a tarfile of
> individually compressed files, which is one possible way of creating a
> seekable achive.[3] Likewise, we could have other bbsinks. For
> example, we could have a "localdisk" bbsink that cause the server to
> write the backup somewhere in the local filesystem instead of
> streaming it out over libpq. Or, we could have an "s3" bbsink that
> writes the archives to S3. We could also have bbsinks that compresses
> the input archives using some compressor (e.g. lz4, zstd, bzip2, ...)
> and forward the resulting compressed archives to the next bbsink in
> the chain. I'm not trying to pass judgement on whether any of these
> particular things are things we want to do, nor am I saying that this
> patch set solves all the problems with doing them. However, I believe
> it will make such things a whole lot easier to implement, because all
> of the knowledge about whatever new functionality is being added is
> centralized in one place, rather than being spread across the entirety
> of basebackup.c. As an example of this, look at how 0010 changes
> basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer
> knows anything that is tar-specific, whereas right now it knows about
> tar-specific things in many places.
>
> Here's an overview of this patch set:
>
> 0001-0003 are cleanup patches that I have posted for review on
> separate threads.[4][5] They are included here to make it easy to
> apply this whole series if someone wishes to do so.
>
> 0004 is a minor refactoring that reduces by 1 the number of functions
> in basebackup.c that know about the specifics of tarfiles. It is just
> a preparatory patch and probably not very interesting.
>
> 0005 invents the bbsink abstraction.
>
> 0006 creates basebackup_libpq.c and moves all code that knows about
> the details of sending archives via libpq there. The functionality is
> exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq.
>
> 0007 creates basebackup_throttle.c and moves all cod

Re: PG 13 release notes, first draft

2020-05-12 Thread Fabien COELHO




Hello Bruce,


OK, section and item added, patch attached,


Thanks!

Some items that might be considered for the added documentation section:

 * e1ff780485
 * 34a0a81bfb
 * e829337d42

 * "Document color support (Peter Eisentraut)"
   THIS WAS NOT DOCUMENTED BEFORE?

Not as such, AFAICR it was vaguely hinted about in the documentation of 
command that would use it, but not even all of them. Now there is a new 
specific section.


--
Fabien!

75 matches

Mail list logo