Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-28 Thread Amit Kapila
On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
 wrote:
>
> On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
> >On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
> > wrote:
>
> I do think having a separate GUC is a must, irrespectedly of what other
> GUC (if any) is used as a default. You're right the maintenance_work_mem
> value might be too high (e.g. in cases with many subscriptions), but the
> same issue applies to work_mem - there's no guarantee work_mem is lower
> than maintenance_work_mem, and in analytics databases it may be set very
> high. So work_mem does not really solve the issue
>
> IMHO we can't really do without a new GUC. It's not difficult to create
> examples that would benefit from small/large memory limit, depending on
> the number of subscriptions etc.
>
> I do however agree the GUC does not have to be tied to any existing one,
> it was just an attempt to use a more sensible default value. I do think
> m_w_m would be fine, but I can live with using an explicit value.
>
> So that's what I did in the attached patch - I've renamed the GUC to
> logical_decoding_work_mem, detached it from m_w_m and set the default to
> 64MB (i.e. the same default as m_w_m).

Fair enough, let's not argue more on this unless someone else wants to
share his opinion.

> It should also fix all the issues
> from the recent reviews (at least I believe so).
>

Have you given any thought on creating a test case for this patch?  I
think you also told that you will test some worst-case scenarios and
report the numbers so that we are convinced that the current eviction
algorithm is good.

> I've realized that one of the subsequent patches allows overriding the
> limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
> I think it'd be good to move this bit forward, but I think it can be
> done in a separate patch.
>

Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter.  I have mentioned the
same in the previous email as well.

While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem

I guess this need to be updated.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-28 Thread Dilip Kumar
On Thu, Sep 26, 2019 at 11:38 PM Tomas Vondra
 wrote:
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong. I need to think
> about that for a while.
>
IIUC, for 2PC it's enough to check whether the main transaction is
aborted or not but for the in-progress transaction it's possible that
the current subtransaction might have done catalog changes and it
might get aborted when we are decoding.  So we need to extend an
infrastructure such that we can check the status of the transaction
for which we are decoding the change.  Also, I think we need to handle
the ERRCODE_TRANSACTION_ROLLBACK and ignore it.

I have attached a small patch to handle this which can be applied on
top of your patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


handle_concurrent_abort_for_in_progress_transaction.patch
Description: Binary data


Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread James Coleman
On Sat, Sep 28, 2019 at 9:56 PM Bruce Momjian  wrote:
>
> On Sat, Sep 28, 2019 at 09:54:48PM -0400, James Coleman wrote:
> > On Sat, Sep 28, 2019 at 9:22 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2019-Sep-28, Bruce Momjian wrote:
> > >
> > > > The CREATE INDEX docs already say:
> > > >
> > > > In a concurrent index build, the index is actually entered into
> > > > the system catalogs in one transaction, then two table scans occur 
> > > > in
> > > > two more transactions.  Before each table scan, the index build must
> > > > wait for existing transactions that have modified the table to 
> > > > terminate.
> > > > After the second scan, the index build must wait for any 
> > > > transactions
> > > > --> that have a snapshot (see ) predating the 
> > > > second
> > > > --> scan to terminate.  Then finally the index can be marked ready for 
> > > > use,
> > > >
> > > > So, having multiple concurrent index scans is just a special case of
> > > > having to "wait for any transactions that have a snapshot", no?  I am
> > > > not sure adding a doc mention of other index builds really is helpful.

While that may be technically true, as a co-worker of mine likes to
point out, being "technically correct" is the worst kind of correct.

Here's what I mean:

First, I believe the docs should aim to be as useful as possible to
even those with more entry-level understanding of PostgreSQL. The fact
the paragraph you cite actually links to the entire chapter on
concurrency control in Postgres demonstrates that there's some
not-so-immediate stuff here to consider. For one: is it obvious to all
users that the transaction held by CIC (or even that all transactions)
has an open snapshot?

Second, this is a difference from a regular CREATE INDEX, and we
already call out as caveats differences between CREATE INDEX
CONCURRENTLY and regular CREATE INDEX as I point out below re:
Alvaro's comment.

Third, related to the above point, many DDL commands only block DDL
against the table being operated on. The fact that CIC here is
different is, in my opinion, a fairly surprising break from that
pattern, and as such likely to catch users off guard. I can attest
that this surprised at least one entire database team a while back :)
including many people who've been operating Postgres at a large scale
for a long time.

I believe caveats like this are worth calling out rather than
expecting users to have to understand the implementation details an
work out the implications on their own.

> > > I always thought that create index concurrently was prevented from
> > > running concurrently in a table by the ShareUpdateExclusive lock that's
> > > held during the operation.
> >
> > You mean multiple CICs on a single table at the same time? Yes, that
> > (unfortunately) isn't possible, but I'm concerned in the patch with
> > the fact that CIC on table X blocks CIC on table Y.
>
> I think any open transaction will block CIC, which is my point.

I read Alvaro as referring to the fact that the docs already call out
the following:

> Regular index builds permit other regular index builds on the same table to 
> occur simultaneously, but only one concurrent index build can occur on a 
> table at a time.

James




Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread Bruce Momjian
On Sat, Sep 28, 2019 at 09:54:48PM -0400, James Coleman wrote:
> On Sat, Sep 28, 2019 at 9:22 PM Alvaro Herrera  
> wrote:
> >
> > On 2019-Sep-28, Bruce Momjian wrote:
> >
> > > The CREATE INDEX docs already say:
> > >
> > > In a concurrent index build, the index is actually entered into
> > > the system catalogs in one transaction, then two table scans occur in
> > > two more transactions.  Before each table scan, the index build must
> > > wait for existing transactions that have modified the table to 
> > > terminate.
> > > After the second scan, the index build must wait for any transactions
> > > --> that have a snapshot (see ) predating the second
> > > --> scan to terminate.  Then finally the index can be marked ready for 
> > > use,
> > >
> > > So, having multiple concurrent index scans is just a special case of
> > > having to "wait for any transactions that have a snapshot", no?  I am
> > > not sure adding a doc mention of other index builds really is helpful.
> >
> > I always thought that create index concurrently was prevented from
> > running concurrently in a table by the ShareUpdateExclusive lock that's
> > held during the operation.
> 
> You mean multiple CICs on a single table at the same time? Yes, that
> (unfortunately) isn't possible, but I'm concerned in the patch with
> the fact that CIC on table X blocks CIC on table Y.

I think any open transaction will block CIC, which is my point.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread James Coleman
On Sat, Sep 28, 2019 at 9:22 PM Alvaro Herrera  wrote:
>
> On 2019-Sep-28, Bruce Momjian wrote:
>
> > The CREATE INDEX docs already say:
> >
> > In a concurrent index build, the index is actually entered into
> > the system catalogs in one transaction, then two table scans occur in
> > two more transactions.  Before each table scan, the index build must
> > wait for existing transactions that have modified the table to 
> > terminate.
> > After the second scan, the index build must wait for any transactions
> > --> that have a snapshot (see ) predating the second
> > --> scan to terminate.  Then finally the index can be marked ready for use,
> >
> > So, having multiple concurrent index scans is just a special case of
> > having to "wait for any transactions that have a snapshot", no?  I am
> > not sure adding a doc mention of other index builds really is helpful.
>
> I always thought that create index concurrently was prevented from
> running concurrently in a table by the ShareUpdateExclusive lock that's
> held during the operation.

You mean multiple CICs on a single table at the same time? Yes, that
(unfortunately) isn't possible, but I'm concerned in the patch with
the fact that CIC on table X blocks CIC on table Y.

James




Re: PostgreSQL12 and older versions of OpenSSL

2019-09-28 Thread Michael Paquier
On Sat, Sep 28, 2019 at 10:52:18PM +0200, Peter Eisentraut wrote:
> committed with that

Thanks, LGTM.
--
Michael


signature.asc
Description: PGP signature


Re: Wrong results using initcap() with non normalized string

2019-09-28 Thread Alvaro Herrera
On 2019-Sep-22, Juan José Santamaría Flecha wrote:

> The attached patch addresses the comment about assuming UTF8.

The UTF8 bits looks reasonable to me.  I guess the other part of that
question is whether we support any other multibyte encoding that
supports combining characters.  Maybe for cases other than UTF8 we can
test for 0-width chars (using pg_encoding_dsplen() perhaps?) and drive
the upper/lower decision off that?  (For the UTF8 case, I don't know if
Juanjo's proposal is better than pg_encoding_dsplen.  Both seem to boil
down to a bsearch, though unicode_norm.c's table seems much larger than
wchar.c's).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread Alvaro Herrera
On 2019-Sep-28, Bruce Momjian wrote:

> The CREATE INDEX docs already say:
> 
> In a concurrent index build, the index is actually entered into
> the system catalogs in one transaction, then two table scans occur in
> two more transactions.  Before each table scan, the index build must
> wait for existing transactions that have modified the table to terminate.
> After the second scan, the index build must wait for any transactions
> --> that have a snapshot (see ) predating the second
> --> scan to terminate.  Then finally the index can be marked ready for use,
> 
> So, having multiple concurrent index scans is just a special case of
> having to "wait for any transactions that have a snapshot", no?  I am
> not sure adding a doc mention of other index builds really is helpful.

I always thought that create index concurrently was prevented from
running concurrently in a table by the ShareUpdateExclusive lock that's
held during the operation.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Consider low startup cost in add_partial_path

2019-09-28 Thread James Coleman
On Saturday, September 28, 2019, Tomas Vondra 
wrote:

> On Sat, Sep 28, 2019 at 12:16:05AM -0400, Robert Haas wrote:
>
>> On Fri, Sep 27, 2019 at 2:24 PM James Coleman  wrote:
>>
>>> Over in the incremental sort patch discussion we found [1] a case
>>> where a higher cost plan ends up being chosen because a low startup
>>> cost partial path is ignored in favor of a lower total cost partial
>>> path and a limit is a applied on top of that which would normal favor
>>> the lower startup cost plan.
>>>
>>> 45be99f8cd5d606086e0a458c9c72910ba8a613d originally added
>>> `add_partial_path` with the comment:
>>>
>>> > Neither do we need to consider startup costs:
>>> > parallelism is only used for plans that will be run to completion.
>>> > Therefore, this routine is much simpler than add_path: it needs to
>>> > consider only pathkeys and total cost.
>>>
>>> I'm not entirely sure if that is still true or not--I can't easily
>>> come up with a scenario in which it's not, but I also can't come up
>>> with an inherent reason why such a scenario cannot exist.
>>>
>>
>> I think I just didn't think carefully about the Limit case.
>>
>>
> Thanks! In that case I suggest we treat it as a separate patch/fix,
> independent of the incremental sort patch. I don't want to bury it in
> that patch series, it's already pretty large.
>

Now the trick is to figure out a way to demonstrate it in test :)

Basically we need:
Path A: Can short circuit with LIMIT but has high total cost
Path B: Can’t short circuit with LIMIT but has lower total cost

(Both must be parallel aware of course.)

Maybe ordering in B can be a sort node and A can be an index scan (perhaps
with very high random page cost?) and force choosing a parallel plan?

I’m trying to describe this to jog my thoughts (not in front of my laptop
right now so can’t try it out).

Any other ideas?

James


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-28 Thread Tomas Vondra

On Fri, Sep 27, 2019 at 01:50:30PM -0400, James Coleman wrote:

On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
 wrote:

The "patched" column means all developer GUCs disabled, so it's expected
to produce the same plan as master (and it is). And then there's one
column for each developer GUC. If the column is just TRUE it means the
GUC does not affect any of the synthetic queries. There are 4 of them:

- devel_add_paths_to_grouping_rel_parallel
- devel_create_partial_grouping_paths
- devel_gather_grouping_paths
- devel_standard_join_search

The places controlled by those GUCs are either useless, or the query
affected by them is not included in the list of queries.


I'd previously found (in my reverse engineering efforts) the query:

select *
from tenk1 t1
join tenk1 t2 on t1.hundred = t2.hundred
join tenk1 t3 on t1.hundred = t3.hundred
order by t1.hundred, t1.twenty
limit 50;

can change plans to use incremental sort when
generate_useful_gather_paths() is added to standard_join_search().
Specifically, we get a merge join between t1 and t3 as the top level
(besides limit) node where the driving side of the join is a gather
merge with incremental sort. This does rely on these gucs set in the
test harness:

set local max_parallel_workers_per_gather=4;
set local min_parallel_table_scan_size=0;
set local parallel_tuple_cost=0;
set local parallel_setup_cost=0;

So I think we can reduce the number of unused gucs to 3.



OK. I'll try extending the set of synthetic queries in [1] to also do
soemthing like this and generate similar plans.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-28 Thread Tomas Vondra

On Fri, Sep 27, 2019 at 08:31:30PM -0400, James Coleman wrote:

On Fri, Sep 13, 2019 at 10:51 AM Tomas Vondra
 wrote:


On Thu, Sep 12, 2019 at 12:49:29PM -0300, Alvaro Herrera wrote:
>On 2019-Jul-30, Tomas Vondra wrote:
>> I've decided to do a couple of experiments, trying to make my mind about
>> which modified places matter to diffrent queries. But instead of trying
>> to reverse engineer the queries, I've taken a different approach - I've
>> compiled a list of queries that I think are sensible and relevant, and
>> then planned them with incremental sort enabled in different places.
>
>[...]
>
>> The list of queries (synthetic, but hopefully sufficiently realistic)
>> and a couple of scripts to collect the plans is in this repository:
>>
>>  https://github.com/tvondra/incremental-sort-tests-2
>>
>> There's also a spreadsheet with a summary of results, with a visual
>> representation of which GUCs affect which queries.
>
>OK, so we have that now.  I suppose this spreadsheet now tells us which
>places are useful and which aren't, at least for the queries that you've
>tested.  Dowe that mean that we want to get the patch to consider adding
>paths only the places that your spreadsheet says are useful?  I'm not
>sure what the next steps are for this patch.
>

Yes. I think the spreadsheet call help us with answering two things:

1) places actually affecting the plan (all but three do)

2) redundant places (there are some cases where two GUCs produce the
same plan in the end)


To expand on this further, (1) should probably help us to be able to
write test cases.

Additionally, one big thing we still need that's somewhat external to
the patch is a good way to benchmark/a set of queries that we believe
are representative enough to be good benchmarks.

I'd really appreciate some input from you all on that particular
question; I feel like it's in some sense the biggest barrier to
getting the patch merged, but also the part where long experience in
the community/exposure to other use cases will probably be quite
valuable.



Hmmm. I don't think anyone will hand us a set of representative queries,
so I think we have two options:

1) Generate synthetic queries covering a wide range of cases (both when
incremental sort is expected to help and not). I think the script I've
used to determine which places do matter can be the starting point.

2) Look at some established benchmarks and see if some of the queries
could benefit from the incremental sort (possibly with some changes to
indexes in the usual schema). I plan to look at TPC-H / TPC-DS, but I
wonder if some OLTP benchmarks would be relevant too.


Of course, this does assume the query set makes sense and is somewhat
realistic, but I've tried to construct queries where that is true. We
may extend it over time, of course.

I think we've agreed to add incremental sort paths different places in
separate patches, to make review easier. So this may be a useful way to
decide which places to address first. I'd probably do it in this order:

- create_ordered_paths
- create_ordered_paths (parallel part)
- add_paths_to_grouping_rel
- ... not sure ...

but that's just a proposal. It'd give us most of the benefits, I think,
and we could also focus on the rest of the patch.


Certainly the first two seem like pretty obvious most necessary base
cases. I think supporting group bys also seems like a pretty standard
case, so at first glance I'd say this seems like a reasonable course
to me.



OK.


I'm going to start breaking up the patches in this thread into a
series in support of that. Since I've started a new thread with the
add_partial_path change, I'll include that patch here as part of this
series also. Do you think it's worth moving the tuplesort changes into
a standalone patch in the series also?



Probably. I'd do that at least for the review.


Attached is a rebased v31 now broken into the following:

- 001-consider-startup-cost-in-add-partial-path_v1.patch: From the
other thread (Tomas's patch unmodified)
- 002-incremental-sort_v31.patch: Updated base incremental sort patch

Besides rebasing, I've changed the enable_incrementalsort GUC to
prevent generating paths entirely rather than being cost-based, since
incremental sort is never absolutely necessary in the way regular sort
is.



OK, makes sense.


I'm hoping to add 003 soon with the initial parallel parts, but I'm
about out of time right now and wanted to get something out, so
sending this without that.

Side question: for the patch tester do I have to attach each part of
the series each time even if nothing's changed in several of them? And
does the vN number at the end need to stay the same for all of them?
My attachments to this email don't follow that... Also, since this
email changes patch naming, so I need to do anything to clear out the
old ones? (I suppose if not, then that would imply an answer to the
first question also.)



Please always send the whole patch series. Firstly, that's the only way
how t

Re: Consider low startup cost in add_partial_path

2019-09-28 Thread Tomas Vondra

On Sat, Sep 28, 2019 at 12:16:05AM -0400, Robert Haas wrote:

On Fri, Sep 27, 2019 at 2:24 PM James Coleman  wrote:

Over in the incremental sort patch discussion we found [1] a case
where a higher cost plan ends up being chosen because a low startup
cost partial path is ignored in favor of a lower total cost partial
path and a limit is a applied on top of that which would normal favor
the lower startup cost plan.

45be99f8cd5d606086e0a458c9c72910ba8a613d originally added
`add_partial_path` with the comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily
come up with a scenario in which it's not, but I also can't come up
with an inherent reason why such a scenario cannot exist.


I think I just didn't think carefully about the Limit case.



Thanks! In that case I suggest we treat it as a separate patch/fix,
independent of the incremental sort patch. I don't want to bury it in
that patch series, it's already pretty large.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory Accounting

2019-09-28 Thread Tomas Vondra

On Sun, Sep 29, 2019 at 12:12:49AM +0200, Tomas Vondra wrote:

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.


That was my conclusion, as well.



I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--

#ifdef CLOBBER_FREED_MEMORY
  wipe_mem(block, block->freeptr - ((char *) block));
#endif

  if (block != set->keeper)
  {
  context->mem_allocated -= block->endptr - ((char *) block);
  free(block);
  }

2) generation.c (GenerationReset)
-

#ifdef CLOBBER_FREED_MEMORY
  wipe_mem(block, block->blksize);
#endif
  context->mem_allocated -= block->blksize;


Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.



Oh, and one more thing - this probably needs to add at least some basic 
explanation of the accounting to src/backend/mmgr/README.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory Accounting

2019-09-28 Thread Tomas Vondra

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.


That was my conclusion, as well.



I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--

#ifdef CLOBBER_FREED_MEMORY
   wipe_mem(block, block->freeptr - ((char *) block));
#endif

   if (block != set->keeper)
   {
   context->mem_allocated -= block->endptr - ((char *) block);
   free(block);
   }

2) generation.c (GenerationReset)
-

#ifdef CLOBBER_FREED_MEMORY
   wipe_mem(block, block->blksize);
#endif
   context->mem_allocated -= block->blksize;


Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.

Aside from that, I've repeated the REINDEX benchmarks done by Robert in
[1] with different scales on two different machines, and I've measured
no difference. Both machines are x86_64, I don't have access to any
Power machine at the moment, unfortunately.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Standby accepts recovery_target_timeline setting?

2019-09-28 Thread Peter Eisentraut
On 2019-09-28 19:45, Tom Lane wrote:
> Maybe I'm misunderstanding, but I think that rather than adding error
> checks that were not there before, the right path to fixing this is
> to cause these settings to be ignored if we're doing crash recovery.

That makes sense to me.  Something like this (untested)?

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 0daab3ff4b..25cae57131 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5618,6 +5618,13 @@ recoveryStopsBefore(XLogReaderState *record)
TimestampTz recordXtime = 0;
TransactionId recordXid;

+   /*
+* Ignore recovery target settings when not in archive recovery (meaning
+* we are in crash recovery).
+*/
+   if (!InArchiveRecovery)
+   return false;
+
/* Check if we should stop as soon as reaching consistency */
if (recoveryTarget == RECOVERY_TARGET_IMMEDIATE && reachedConsistency)
{
@@ -5759,6 +5766,13 @@ recoveryStopsAfter(XLogReaderState *record)
uint8   rmid;
TimestampTz recordXtime;

+   /*
+* Ignore recovery target settings when not in archive recovery (meaning
+* we are in crash recovery).
+*/
+   if (!InArchiveRecovery)
+   return false;
+
info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
rmid = XLogRecGetRmid(record);



-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PostgreSQL12 and older versions of OpenSSL

2019-09-28 Thread Peter Eisentraut
On 2019-09-27 16:20, Michael Paquier wrote:
> On Fri, Sep 27, 2019 at 03:50:57PM +0200, Peter Eisentraut wrote:
>> On 2019-09-27 03:51, Michael Paquier wrote:
>>> Your patch does not issue a ereport(LOG/FATAL) in the event of a
>>> failure with SSL_CTX_set_max_proto_version(), which is something done
>>> when ssl_protocol_version_to_openssl()'s result is -1.  Wouldn't it be
>>> better to report that properly to the user?
>>
>> Our SSL_CTX_set_max_proto_version() is a reimplementation of a function
>> that exists in newer versions of OpenSSL, so it has a specific error
>> behavior.  Our implementation should probably not diverge from it too much.
> 
> I agree with this point.  Now my argument is about logging LOG or
> FATAL within be_tls_init() after the two OpenSSL functions (or our
> wrappers) SSL_CTX_set_min/max_proto_version are called.

committed with that

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Usage of the system truststore for SSL certificate validation

2019-09-28 Thread Bruce Momjian
On Thu, Sep 19, 2019 at 12:26:27PM -0400, Isaac Morland wrote:
> If we're going to open this up, can we add an option to say "this key is
> allowed to log in to this account", SSH style?
> 
> I like the idea of using keys rather than .pgpass, but I like the ~/.ssh/
> authorized_keys model and don't like the "set up an entire certificate
> infrastructure" approach.

This is actually a good question --- why does ssh do it that way and
Postgres does it another, more like a web server/client.  Maybe it is
because ssh allows the user to create one key pair, and use it for
several independent servers, while Postgres assumes the client will only
connect to multiple related servers controlled by the same CA.  With the
Postgres approach, you can change the client certificate with no changes
on the server, while with the ssh model, changing the client certificate
requires sending the public key to the ssh server to be added to
~/.ssh/authorized_keys.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-28 Thread Tomas Vondra

On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:

On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
 wrote:


On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
>On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
> wrote:
>>
>> On 1/3/18 14:53, Tomas Vondra wrote:
>> >> I don't see the need to tie this setting to maintenance_work_mem.
>> >> maintenance_work_mem is often set to very large values, which could
>> >> then have undesirable side effects on this use.
>> >
>> > Well, we need to pick some default value, and we can either use a fixed
>> > value (not sure what would be a good default) or tie it to an existing
>> > GUC. We only really have work_mem and maintenance_work_mem, and the
>> > walsender process will never use more than one such buffer. Which seems
>> > to be closer to maintenance_work_mem.
>> >
>> > Pretty much any default value can have undesirable side effects.
>>
>> Let's just make it an independent setting unless we know any better.  We
>> don't have a lot of settings that depend on other settings, and the ones
>> we do have a very specific relationship.
>>
>> >> Moreover, the name logical_work_mem makes it sound like it's a logical
>> >> version of work_mem.  Maybe we could think of another name.
>> >
>> > I won't object to a better name, of course. Any proposals?
>>
>> logical_decoding_[work_]mem?
>>
>
>Having a separate variable for this can give more flexibility, but
>OTOH it will add one more knob which user might not have a good idea
>to set.  What are the problems we see if directly use work_mem for
>this case?
>

IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.

It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.

FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).



I have the same concern for using maintenace_work_mem as Peter E.
which is that the value of maintenace_work_mem will generally be
higher which is suitable for its current purpose, but not for the
purpose this patch is using.  AFAIU, at this stage we want a better
memory accounting system for logical decoding and we are not sure what
is a good value for this variable.  So, I think using work_mem or
maintenace_work_mem should serve the purpose.  Later, if we have
requirements from people to have better control over the memory
required for this purpose then we can introduce a new variable.

I understand that currently work_mem is primarily tied with memory
used for query workspaces, but it might be okay to extend it for this
purpose.  Another point is that the default for that sound to be more
appealing for this case.  I can see the argument against it which is
having a separate variable will make the things look clean and give
better control.  So, if we can't convince ourselves for using
work_mem, we can introduce a new guc variable and keep the default as
4MB or work_mem.

I feel it is always tempting to introduce a new guc for the different
tasks unless there is an exact match, but OTOH, having lesser guc's
has its own advantage which is that people don't have to bother about
a new setting which they need to tune and especially for which they
can't decide with ease.  I am not telling that we should not introduce
new guc when it is required, but just to give more thought before
doing so.



I do think having a separate GUC is a must, irrespectedly of what other
GUC (if any) is used as a default. You're right the maintenance_work_mem
value might be too high (e.g. in cases with many subscriptions), but the
same issue applies to work_mem - there's no guarantee work_mem is lower
than maintenance_work_mem, and in analytics databases it may be set very
high. So work_mem does not really solve the issue

IMHO we can't really do without a new GUC. It's not difficult to create
examples that would benefit from small/large memory limit, depending on
the number of subscriptions etc.

I do however agree the GUC does not have to be tied to any existing one,
it was just an attempt to use a more sensible default value. I do think
m_w_m would be fine, but I can live with using an explicit value.

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m). It should also fix all the issues
from the recent reviews (at least I believe so).

I've realized that one of the subsequent patches allows overriding the
limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
I think it'd be good to move this bit forward, but I think it can be
done in a separate patch.


regards

--
Tomas Vondra  http://www.2ndQuadr

Re: Standby accepts recovery_target_timeline setting?

2019-09-28 Thread David Steele
On 9/28/19 1:26 PM, Fujii Masao wrote:
> On Sun, Sep 29, 2019 at 12:51 AM David Steele  wrote:
> 
> Yeah, more checks would be necessary. IMO easy fix is to forbid not only
> recovery target parameters but also any recovery parameters (specified
> in recovery.conf in previous versions) in crash recovery.
> 
> In v11 or before, any parameters in recovery.conf cannot take effect in
> crash recovery because crash recovery always starts without recovery.conf.
> But in v12, those parameters are specified in postgresql.conf,
> so they may take effect even in crash recovery (i.e., when both
> recovery.signal and standby.signal are missing). This would be the root
> cause of the problems that we are discussing, I think.
> 
> There might be some recovery parameters that we can safely use
> even in crash recovery, e.g., maybe recovery_end_command
> (now, you can see that recovery_end_command is executed in
> crash recovery in v12). But at this stage of v12, it's worth thinking to
> just cause crash recovery to exit with an error when any recovery
> parameter is set. Thought?

I dislike the idea of crash recovery throwing fatal errors because there
are recovery settings in postgresql.auto.conf.  Since there is no
defined mechanism for cleaning out old recovery settings we have to
assume that they will persist (and accumulate) more or less forever.

> Or if that change is overkill, alternatively we can make crash recovery
> "ignore" any recovery parameters, e.g., by forcibly disabling
> the parameters.

I'd rather load recovery settings *only* if recovery.signal or
standby.signal is present and do this only after crash recovery is
complete, i.e. in the absence of backup_label.

I think blindly loading recovery settings then trying to ignore them
later is pretty much why we are having these issues in the first place.
 I'd rather not extend that pattern if possible.

Regards,
-- 
-David
da...@pgmasters.net




Re: Standby accepts recovery_target_timeline setting?

2019-09-28 Thread Tom Lane
Fujii Masao  writes:
>> Agreed.  Seems like that could be added to the patch above easily
>> enough.  More checks would be needed to prevent the behaviors I've been
>> seeing in the other thread, but it should be possible to more or less
>> mimic the old behavior with sufficient checks.

> Yeah, more checks would be necessary. IMO easy fix is to forbid not only
> recovery target parameters but also any recovery parameters (specified
> in recovery.conf in previous versions) in crash recovery.

> In v11 or before, any parameters in recovery.conf cannot take effect in
> crash recovery because crash recovery always starts without recovery.conf.
> But in v12, those parameters are specified in postgresql.conf,
> so they may take effect even in crash recovery (i.e., when both
> recovery.signal and standby.signal are missing). This would be the root
> cause of the problems that we are discussing, I think.

So ... what I'm wondering about here is what happens during *actual* crash
recovery, eg a postmaster-driven restart of the startup process after
a backend crash in hot standby.  The direction you guys are going in
seems likely to cause the startup process to refuse to function until
those parameters are removed from postgresql.conf, which seems quite
user-unfriendly.

Maybe I'm misunderstanding, but I think that rather than adding error
checks that were not there before, the right path to fixing this is
to cause these settings to be ignored if we're doing crash recovery.
Not make the user take them out (and possibly later put them back).

regards, tom lane




Re: Unstable select_parallel regression output in 12rc1

2019-09-28 Thread Tom Lane
Christoph Berg  writes:
> Re: Tom Lane 2019-09-26 <12685.1569510...@sss.pgh.pa.us>
>> We haven't seen it in quite some time in HEAD, though I fear that's
>> just due to bad luck or change of timing of unrelated tests.

> The v13 package builds that are running every 6h here haven't seen a
> problem yet either, so the probability of triggering it seems very
> low. So it's not a pressing problem.

I've pushed some changes to try to ameliorate the issue.

> (There's some extension modules
> where the testsuite fails at a much higher rate, getting all targets
> to pass at the same time is next to impossible there :(. )

I feel your pain, believe me.  Used to fight the same kind of problems
when I was at Red Hat.  Are any of those extension modules part of
Postgres?

regards, tom lane




Re: max_parallel_workers question

2019-09-28 Thread Jeff Davis
On Sat, 2019-09-28 at 00:10 -0400, Robert Haas wrote:
> I intended it to mean "the entire cluster." Basically, how many
> workers out of max_worker_processes are you willing to use for
> parallel query, as opposed to other things. I agree that PGC_USERSET
> doesn't make any sense.

In that case, PGC_SIGHUP seems most appropriate.

It also might make more sense to rename it to reserved_worker_processes
and invert the meaning. To me, that would be more clear that it's
designed to prevent parallel query from interfering with other uses of
worker processes.

Another option would be to make it two pools, one for parallel workers
and one for everything else, and each one would be controlled by a
PGC_POSTMASTER setting. But it seems like some thought went into trying
to share the pool of workers[1], so I assume there was a good reason
you wanted to do that.

Regards,
Jeff Davis

[1] If I'm reading correctly, it uses both lock-free code and
intentional overflow.






Re: Instability of partition_prune regression test results

2019-09-28 Thread Tom Lane
Amit Langote  writes:
> On Sat, Sep 28, 2019 at 12:59 AM Tom Lane  wrote:
>> Amit Langote  writes:
>>> Isn't the point of using ANALYZE here to show that the exec-param
>>> based run-time pruning is working (those "never executed" strings)?

>> Hm.  Well, if you want to see those, we could do it as attached.

> Perfect, thanks.

OK, pushed that way.

regards, tom lane




Re: Standby accepts recovery_target_timeline setting?

2019-09-28 Thread Fujii Masao
On Sun, Sep 29, 2019 at 12:51 AM David Steele  wrote:
>
> On 9/28/19 10:54 AM, Fujii Masao wrote:
> > On Sat, Sep 28, 2019 at 2:01 AM David Steele  wrote:
> >> On 9/27/19 11:58 AM, Fujii Masao wrote:
> >>>
> >>> Yes, recovery target settings are used even when neither backup_label
> >>> nor recovery.signal exist, i.e., just a crash recovery, in v12. This is
> >>> completely different behavior from prior versions.
> >>
> >> I'm not able to reproduce this.  I only see recovery settings being used
> >> if backup_label, recovery.signal, or standby.signal is present.
> >>
> >> Do you have an example?
> >
> > Yes, here is the example:
> >
> > initdb -D data
> > pg_ctl -D data start
> > psql -c "select pg_create_restore_point('hoge')"
> > psql -c "alter system set recovery_target_name to 'hoge'"
> > psql -c "create table test as select num from generate_series(1, 100) num"
> > pg_ctl -D data -m i stop
> > pg_ctl -D data start
> >
> > After restarting the server at the above final step, you will see
> > the following log messages indicating that recovery stops at
> > recovery_target_name.
> >
> > 2019-09-28 22:42:04.849 JST [16944] LOG:  recovery stopping at restore
> > point "hoge", time 2019-09-28 22:42:03.86558+09
> > 2019-09-28 22:42:04.849 JST [16944] FATAL:  requested recovery stop
> > point is before consistent recovery point
>
> That's definitely not good behavior.
>
> >>> IMO, since v12 is RC1 now, it's not good idea to change the logic to new.
> >>> So at least for v12, we basically should change the recovery logic so that
> >>> it behaves in the same way as prior versions. That is,
> >>>
> >>> - Stop the recovery with an error if any recovery target is set in
> >>>crash recovery
> >>
> >> This seems reasonable.  I tried adding a recovery.signal and
> >> restore_command for crash recovery and I just got an error that it
> >> couldn't find 0002.history in the archive.
> >
> > You added recovery.signal, so it means that you started an archive recovery,
> > not crash recovery. Right?
>
> Correct.
>
> > Anyway I'm thinking to apply something like attached patch, to emit an error
> > if recovery target is set in crash recovery.
>
> The patch looks reasonable.
>
> >>> - Do not enter an archive recovery mode if recovery.signal is missing
> >>
> >> Agreed.  Perhaps it's OK to use restore_command if a backup_label is
> >> present
> >
> > Yeah, it's maybe OK, but differenet behavior from current version.
> > So, at least for v12, I'm inclined to prevent crash recovery with 
> > backup_label
> > from using restore_command, i.e., only WAL files in pg_wal will be replayed
> > in this case.
>
> Agreed.  Seems like that could be added to the patch above easily
> enough.  More checks would be needed to prevent the behaviors I've been
> seeing in the other thread, but it should be possible to more or less
> mimic the old behavior with sufficient checks.

Yeah, more checks would be necessary. IMO easy fix is to forbid not only
recovery target parameters but also any recovery parameters (specified
in recovery.conf in previous versions) in crash recovery.

In v11 or before, any parameters in recovery.conf cannot take effect in
crash recovery because crash recovery always starts without recovery.conf.
But in v12, those parameters are specified in postgresql.conf,
so they may take effect even in crash recovery (i.e., when both
recovery.signal and standby.signal are missing). This would be the root
cause of the problems that we are discussing, I think.

There might be some recovery parameters that we can safely use
even in crash recovery, e.g., maybe recovery_end_command
(now, you can see that recovery_end_command is executed in
crash recovery in v12). But at this stage of v12, it's worth thinking to
just cause crash recovery to exit with an error when any recovery
parameter is set. Thought?

Or if that change is overkill, alternatively we can make crash recovery
"ignore" any recovery parameters, e.g., by forcibly disabling
the parameters.

Regards,

-- 
Fujii Masao




Re: Document recovery_target_action behavior?

2019-09-28 Thread Jonathan S. Katz
On 9/28/19 12:00 PM, David Steele wrote:
> On 9/28/19 11:14 AM, Fujii Masao wrote:
>> On Sat, Sep 28, 2019 at 2:52 AM David Steele  wrote:
>>
>>> The question for the old versions: is this something that should be
>>> fixed in the code or in the documentation?
>>>
>>> My vote is to make this explicit in the documentation, since changing
>>> the recovery behavior in old versions could lead to nasty surprises.
>>
>> +1 to update the documentation.

FYI, documentation to compare, PG11:

https://www.postgresql.org/docs/11/recovery-target-settings.html#RECOVERY-TARGET-ACTION

PG12:

https://www.postgresql.org/docs/12/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY-TARGET

After reading through, yes, I agree that +1 we should modify the
documentation.

And +1 for not modifying the behavior in the supported PG < 12 versions,
that could certainly catch people by surprise.

> 
> OK, I'll put that on my list for after GA.  This has been the behavior
> since 9.1 so it hardly seems like an emergency.
> 
> The behavior change in 12 may be a surprise for users, though, perhaps
> we should add something to the Streaming Replication and Recovery
> changes section in the release notes?
> 
> Looping in Jonathan to see if he thinks that's a good idea.

I would suggest we add a bullet to the "E.1.2 Migration to Version
12"[1] section as one could see this behavior change as being
"incompatible" with older versions. Moving aside the "recovery.conf"
file change, if you did not specify your "recovery_target_action" but
expect your instance to be available (albeit paused), you may be in for
a surprise, especially if you have things automated.

I don't know if I would put it in the "E.1.3.2" section though, but I
could be convinced either way.

Do you have some suggested wording? I could attempt to cobble together a
quick patch.

Thanks,

Jonathan

[1] https://www.postgresql.org/docs/12/release-12.html



signature.asc
Description: OpenPGP digital signature


Re: [DOC] Document concurrent index builds waiting on each other

2019-09-28 Thread Bruce Momjian
On Wed, Sep 18, 2019 at 01:51:00PM -0400, James Coleman wrote:
> In my experience it's not immediately obvious (even after reading the
> documentation) the implications of how concurrent index builds manage
> transactions with respect to multiple concurrent index builds in
> flight at the same time.
> 
> Specifically, as I understand multiple concurrent index builds running
> at the same time will all return at the same time as the longest
> running one.
> 
> I've attached a small patch to call this caveat out specifically in
> the documentation. I think the description in the patch is accurate,
> but please let me know if there's some intricacies around how the
> various stages might change the results.

The CREATE INDEX docs already say:

In a concurrent index build, the index is actually entered into
the system catalogs in one transaction, then two table scans occur in
two more transactions.  Before each table scan, the index build must
wait for existing transactions that have modified the table to terminate.
After the second scan, the index build must wait for any transactions
--> that have a snapshot (see ) predating the second
--> scan to terminate.  Then finally the index can be marked ready for use,

So, having multiple concurrent index scans is just a special case of
having to "wait for any transactions that have a snapshot", no?  I am
not sure adding a doc mention of other index builds really is helpful.

---

> commit 9e28e704820eebb81ff94c1c3cbfb7db087b2c45
> Author: James Coleman 
> Date:   Wed Sep 18 13:36:22 2019 -0400
> 
> Document concurrent indexes waiting on each other
> 
> It's not immediately obvious that because concurrent index building
> waits on previously running transactions to complete, running multiple
> concurrent index builds at the same time will result in each of them
> taking as long to return as the longest takes, so, document this caveat.
> 
> diff --git a/doc/src/sgml/ref/create_index.sgml 
> b/doc/src/sgml/ref/create_index.sgml
> index 629a31ef79..35f15abb0e 100644
> --- a/doc/src/sgml/ref/create_index.sgml
> +++ b/doc/src/sgml/ref/create_index.sgml
> @@ -616,6 +616,13 @@ Indexes:
>  cannot.
> 
>  
> +   
> +Because the second table scan must wait for any transactions having a
> +snapshot preceding the start of that scan to finish before completing the
> +scan, concurrent index builds on multiple tables at the same time will
> +not return on any one table until all have completed.
> +   
> +
> 
>  Concurrent builds for indexes on partitioned tables are currently not
>  supported.  However, you may concurrently build the index on each


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Document recovery_target_action behavior?

2019-09-28 Thread David Steele
On 9/28/19 11:14 AM, Fujii Masao wrote:
> On Sat, Sep 28, 2019 at 2:52 AM David Steele  wrote:
> 
>> The question for the old versions: is this something that should be
>> fixed in the code or in the documentation?
>>
>> My vote is to make this explicit in the documentation, since changing
>> the recovery behavior in old versions could lead to nasty surprises.
> 
> +1 to update the documentation.

OK, I'll put that on my list for after GA.  This has been the behavior
since 9.1 so it hardly seems like an emergency.

The behavior change in 12 may be a surprise for users, though, perhaps
we should add something to the Streaming Replication and Recovery
changes section in the release notes?

Looping in Jonathan to see if he thinks that's a good idea.

Regards,
-- 
-David
da...@pgmasters.net




Re: Standby accepts recovery_target_timeline setting?

2019-09-28 Thread David Steele
On 9/28/19 10:54 AM, Fujii Masao wrote:
> On Sat, Sep 28, 2019 at 2:01 AM David Steele  wrote:
>> On 9/27/19 11:58 AM, Fujii Masao wrote:
>>>
>>> Yes, recovery target settings are used even when neither backup_label
>>> nor recovery.signal exist, i.e., just a crash recovery, in v12. This is
>>> completely different behavior from prior versions.
>>
>> I'm not able to reproduce this.  I only see recovery settings being used
>> if backup_label, recovery.signal, or standby.signal is present.
>>
>> Do you have an example?
> 
> Yes, here is the example:
> 
> initdb -D data
> pg_ctl -D data start
> psql -c "select pg_create_restore_point('hoge')"
> psql -c "alter system set recovery_target_name to 'hoge'"
> psql -c "create table test as select num from generate_series(1, 100) num"
> pg_ctl -D data -m i stop
> pg_ctl -D data start
> 
> After restarting the server at the above final step, you will see
> the following log messages indicating that recovery stops at
> recovery_target_name.
> 
> 2019-09-28 22:42:04.849 JST [16944] LOG:  recovery stopping at restore
> point "hoge", time 2019-09-28 22:42:03.86558+09
> 2019-09-28 22:42:04.849 JST [16944] FATAL:  requested recovery stop
> point is before consistent recovery point

That's definitely not good behavior.

>>> IMO, since v12 is RC1 now, it's not good idea to change the logic to new.
>>> So at least for v12, we basically should change the recovery logic so that
>>> it behaves in the same way as prior versions. That is,
>>>
>>> - Stop the recovery with an error if any recovery target is set in
>>>crash recovery
>>
>> This seems reasonable.  I tried adding a recovery.signal and
>> restore_command for crash recovery and I just got an error that it
>> couldn't find 0002.history in the archive.
> 
> You added recovery.signal, so it means that you started an archive recovery,
> not crash recovery. Right?

Correct.

> Anyway I'm thinking to apply something like attached patch, to emit an error
> if recovery target is set in crash recovery.

The patch looks reasonable.

>>> - Do not enter an archive recovery mode if recovery.signal is missing
>>
>> Agreed.  Perhaps it's OK to use restore_command if a backup_label is
>> present
> 
> Yeah, it's maybe OK, but differenet behavior from current version.
> So, at least for v12, I'm inclined to prevent crash recovery with backup_label
> from using restore_command, i.e., only WAL files in pg_wal will be replayed
> in this case.

Agreed.  Seems like that could be added to the patch above easily
enough.  More checks would be needed to prevent the behaviors I've been
seeing in the other thread, but it should be possible to more or less
mimic the old behavior with sufficient checks.

Regards,
-- 
-David
da...@pgmasters.net




default partitions can be partitioned and have default partitions?

2019-09-28 Thread Justin Pryzby
postgres=# CREATE TABLE t(i int)PARTITION BY RANGE(i);
CREATE TABLE
postgres=# CREATE TABLE t0 PARTITION OF t DEFAULT PARTITION BY RANGE(i);
CREATE TABLE
postgres=# CREATE TABLE t00 PARTITION OF t0 DEFAULT; -- oh yes
CREATE TABLE
...

Not sure how it could be useful to partition default into subpartitions of
lists, ranges, hashes.

Justin




Re: Document recovery_target_action behavior?

2019-09-28 Thread Fujii Masao
On Sat, Sep 28, 2019 at 2:52 AM David Steele  wrote:
>
> Hackers,
>
> In versions < PG12 recovery_target_action has a behavior that appears to
> be a bug, or is at least undocumented.  If hot_standby = off and
> recovery_target_action is not specified then the cluster will promote
> when the target is found rather than shutting down as the documentation
> seems to indicate.  If recovery_target_action is explicitly set to pause
> then the cluster will shutdown as expected.

Good catch!

> In PG12 the shutdown occurs even when recovery_target_action is not
> explicitly set.  This seems like good behavior and it matches the
> documentation as I read it.

Agreed.

> The question for the old versions: is this something that should be
> fixed in the code or in the documentation?
>
> My vote is to make this explicit in the documentation, since changing
> the recovery behavior in old versions could lead to nasty surprises.

+1 to update the documentation.

Regards,

-- 
Fujii Masao




Re: Standby accepts recovery_target_timeline setting?

2019-09-28 Thread Fujii Masao
On Sat, Sep 28, 2019 at 2:01 AM David Steele  wrote:
>
> On 9/27/19 11:58 AM, Fujii Masao wrote:
> > On Sat, Sep 28, 2019 at 12:14 AM David Steele  wrote:
> >>
> >> I think, at the very least, the fact that targeted recovery proceeds in
> >> the absence of recovery.signal represents a bug.
> >
> > Yes, recovery target settings are used even when neither backup_label
> > nor recovery.signal exist, i.e., just a crash recovery, in v12. This is
> > completely different behavior from prior versions.
>
> I'm not able to reproduce this.  I only see recovery settings being used
> if backup_label, recovery.signal, or standby.signal is present.
>
> Do you have an example?

Yes, here is the example:

initdb -D data
pg_ctl -D data start
psql -c "select pg_create_restore_point('hoge')"
psql -c "alter system set recovery_target_name to 'hoge'"
psql -c "create table test as select num from generate_series(1, 100) num"
pg_ctl -D data -m i stop
pg_ctl -D data start

After restarting the server at the above final step, you will see
the following log messages indicating that recovery stops at
recovery_target_name.

2019-09-28 22:42:04.849 JST [16944] LOG:  recovery stopping at restore
point "hoge", time 2019-09-28 22:42:03.86558+09
2019-09-28 22:42:04.849 JST [16944] FATAL:  requested recovery stop
point is before consistent recovery point

> > IMO, since v12 is RC1 now, it's not good idea to change the logic to new.
> > So at least for v12, we basically should change the recovery logic so that
> > it behaves in the same way as prior versions. That is,
> >
> > - Stop the recovery with an error if any recovery target is set in
> >crash recovery
>
> This seems reasonable.  I tried adding a recovery.signal and
> restore_command for crash recovery and I just got an error that it
> couldn't find 0002.history in the archive.

You added recovery.signal, so it means that you started an archive recovery,
not crash recovery. Right?

Anyway I'm thinking to apply something like attached patch, to emit an error
if recovery target is set in crash recovery.

> > - Use recovery target settings if set even when standby mode
>
> Yes, this is weird, but it is present in current versions.

Yes, and some users might be using this current behavior.
If we keep this behavior as it is in v12, the documentation
needs to be corrected.

> > - Do not enter an archive recovery mode if recovery.signal is missing
>
> Agreed.  Perhaps it's OK to use restore_command if a backup_label is
> present

Yeah, it's maybe OK, but differenet behavior from current version.
So, at least for v12, I'm inclined to prevent crash recovery with backup_label
from using restore_command, i.e., only WAL files in pg_wal will be replayed
in this case.

Regards,

-- 
Fujii Masao


error-if-recovery-taget-set-in-crash-recovery.patch
Description: Binary data


Re: Hooks for session start and end, take two

2019-09-28 Thread Fabrízio de Royes Mello
On Fri, Sep 27, 2019 at 4:26 PM legrand legrand 
wrote:
>
> OK I confirm:
> - "client backend" appears at session start and end hook,
> - "autovacuum worker" and "pg_background" only appears at session end hook
>   (backend_start can be retreived from pg_stat_activity),
> - "parallel workers" are not visible at all
>   (because extension cannot assign XIDs during a parallel operation)
>
> All seems fine to me.
>

Hi all,

First of all thanks Michael for bringing this to life again.

I poked a little with the patch and everything is ok. Your check for normal
backend on test_session_hooks is much simpler than I did before:

+/* just consider normal backends */
+if (MyBackendId == InvalidBackendId)
+return;

But one thing came to my mind, why not in this first version we hook just
normal backends?

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Re: Attempt to consolidate reading of XLOG page

2019-09-28 Thread Antonin Houska
Antonin Houska  wrote:

> Alvaro Herrera  wrote:
> 
> > BTW that tli_p business to the openSegment callback is horribly
> > inconsistent.  Some callers accept a NULL tli_p, others will outright
> > crash, even though the API docs say that the callback must determine the
> > timeline.  This is made more complicated by us having the TLI in "seg"
> > also.  Unless I misread, the problem is again that the walsender code is
> > doing nasty stuff with globals (endSegNo).  As a very minor stylistic
> > point, we prefer to have out params at the end of the signature.
> 
> XLogRead() tests for NULL so it should not crash but I don't insist on doing
> it this way. XLogRead() actually does not have to care whether the "open
> segment callback" determines the TLI or not, so it (XLogRead) can always
> receive a valid pointer to seg.ws_tli.

This is actually wrong - seg.ws_tli is not always the correct value to
pass. If seg.ws_tli refers to the segment from which data was read last time,
then XLogRead() still needs a separate argument to specify from which TLI the
current call should read. If these two differ, new file needs to be opened.

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Another approach is to add a boolean argument "check_tli", but that still
forces caller to pass some (random) value of the tli. The concept of
InvalidTimeLineID seems to me less disturbing than this.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Modest proposal for making bpchar less inconsistent

2019-09-28 Thread Bruce Momjian
On Fri, Sep 13, 2019 at 09:50:10PM +0200, Pavel Stehule wrote:
> 
> 
> Dne pá 13. 9. 2019 16:43 uživatel Tom Lane  napsal:
> 
> It struck me that the real reason that we keep getting gripes about
> the weird behavior of CHAR(n) is that these functions (and, hence,
> their corresponding operators) fail to obey the "trailing blanks
> aren't significant" rule:
> 
>                regprocedure                |        prosrc       
> ---+--
>  bpcharlike(character,text)                | textlike
>  bpcharnlike(character,text)               | textnlike
>  bpcharicregexeq(character,text)           | texticregexeq
>  bpcharicregexne(character,text)           | texticregexne
>  bpcharregexeq(character,text)             | textregexeq
>  bpcharregexne(character,text)             | textregexne
>  bpchariclike(character,text)              | texticlike
>  bpcharicnlike(character,text)             | texticnlike
> 
> They're just relying on binary compatibility of bpchar to text ...
> but of course textlike etc. think trailing blanks are significant.
> 
> Every other primitive operation we have for bpchar correctly ignores
> the trailing spaces.
> 
> We could fix this, and save some catalog space too, if we simply
> deleted these functions/operators and let such calls devolve
> into implicit casts to text.
> 
> This might annoy people who are actually writing trailing spaces
> in their patterns to make such cases work.  But I think there
> are probably not too many such people, and having real consistency
> here is worth something.
> 
> 
> has sense

Yes, I think this is a great idea!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-28 Thread Amit Kapila
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
 wrote:
>
> Hi,
>
> Attached is an updated patch series, rebased on current master. It does
> fix one memory accounting bug in ReorderBufferToastReplace (the code was
> not properly updating the amount of memory).
>

Few comments on 0001
1.
I am getting below linking error in pgoutput when compiling the patch
on my windows system:
pgoutput.obj : error LNK2001: unresolved external symbol _logical_work_mem

You need to use PGDLLIMPORT for logical_work_mem.

2. After, I fixed above and tried some basic test, it fails with below
callstack:
  postgres.exe!ExceptionalCondition(const char *
conditionName=0x00d92854, const char * errorType=0x00d928bc, const
char * fileName=0x00d92e60,
int lineNumber=2148)  Line 55
  postgres.exe!ReorderBufferChangeMemoryUpdate(ReorderBuffer *
rb=0x02693390, ReorderBufferChange * change=0x0269dd38, bool
addition=true)  Line 2148
  postgres.exe!ReorderBufferQueueChange(ReorderBuffer * rb=0x02693390,
unsigned int xid=525, unsigned __int64 lsn=36083720,
ReorderBufferChange
* change=0x0269dd38)  Line 635
  postgres.exe!DecodeInsert(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718)  Line 716 + 0x24 bytes C
  postgres.exe!DecodeHeapOp(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718)  Line 437 + 0xd bytes C
  postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx=0x0268ef80, XLogReaderState * record=0x0268f228)  Line 129
  postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo=0x02688680, bool confirm=true, bool binary=false)  Line 307
  postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo=0x02688680)  Line 376

Basically, the assert added by you in ReorderBufferChangeMemoryUpdate
failed.  Then, I explored a bit and it seems that you have missed
assigning a value to txn, a new variable added by this patch in
structure ReorderBufferChange:
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
  /* The type of change. */
  enum ReorderBufferChangeType action;

+ /* Transaction this change belongs to. */
+ struct ReorderBufferTXN *txn;


3.
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION subscription_name
 

+
+   
+work_mem (integer)
+
+ 
+  Limits the amount of memory used to decode changes on the
+  publisher.  If not specified, the publisher will use the default
+  specified by logical_work_mem.
+ 
+
+   

I don't see any explanation of how this will be useful?  How can a
subscriber predict the amount of memory required by a publisher for
decoding?  This is more unpredictable because when initially the
changes are recorded in ReorderBuffer, it doesn't even filter
corresponding to any publisher.  Do we really need this?  I think
giving more knobs to the user is helpful when they can someway know
how to use it.  In this case, it is not clear whether the user can
ever use this.

4. Can we some way expose the memory consumed by ReorderBuffer?  If
so, we might be able to write some tests covering new functionality.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pglz performance

2019-09-28 Thread Andrey Borodin
Oleg, Peter, thanks for looking into this!

I hope to benchmark decompression on Silesian corpus soon.

PFA v2 with better comments.

> 27 сент. 2019 г., в 14:41, Peter Eisentraut 
>  написал(а):
> 
> After reviewing this thread and more testing, I think
> 0001-Use-memcpy-in-pglz-decompression.patch appears to be a solid win
> and we should move ahead with it.
> 
> I don't, however, fully understand the code changes, and I think this
> could use more and better comments.  In particular, I wonder about
> 
> off *= 2;
I've changed this to
off += off;

> 
> This is new logic that isn't explained anywhere.
> 
> This whole function is commented a bit strangely.  It begins with
> "Otherwise", but there is nothing before it.  And what does "from OUTPUT
> to OUTPUT" mean?  There is no "output" variable.  We should make this
> match the code better.


I've added small example to illustrate what is going on.

Thanks!

--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud


v2-0001-Use-memcpy-in-pglz-decompression.patch
Description: Binary data


Re: Attempt to consolidate reading of XLOG page

2019-09-28 Thread Antonin Houska
Tom Lane  wrote:

> Alvaro Herrera  writes:
> > On 2019-Sep-27, Antonin Houska wrote:
> >>> You placed the errinfo in XLogRead's stack rather than its callers' ...
> >>> I don't think that works, because as soon as XLogRead returns that
> >>> memory is no longer guaranteed to exist.
> 
> >> I was aware of this problem, therefore I defined the field as static:
> >> 
> >> +XLogReadError *
> >> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
> >> +WALOpenSegment *seg, WALSegmentContext *segcxt,
> >> +WALSegmentOpen openSegment)
> >> +{
> >> +   char   *p;
> >> +   XLogRecPtr  recptr;
> >> +   Sizenbytes;
> >> +   static XLogReadError errinfo;
> 
> > I see.
> 
> That seems like an absolutely terrible "fix".  We don't really want
> XLogRead to be defined in a way that forces it to be non-reentrant do we?

Good point. I forgot that the XLOG reader can be used by frontends, so thread
safety is important here.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Attempt to consolidate reading of XLOG page

2019-09-28 Thread Antonin Houska
Alvaro Herrera  wrote:

> BTW that tli_p business to the openSegment callback is horribly
> inconsistent.  Some callers accept a NULL tli_p, others will outright
> crash, even though the API docs say that the callback must determine the
> timeline.  This is made more complicated by us having the TLI in "seg"
> also.  Unless I misread, the problem is again that the walsender code is
> doing nasty stuff with globals (endSegNo).  As a very minor stylistic
> point, we prefer to have out params at the end of the signature.

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli. However that in turn implies that
XLogRead() does not need the "tli" argument at all.

> > > Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
> > > make more sense to leave XLogRead be always responsible for setting
> > > these correctly, and remove those lines from ReadPageInternal?
> > 
> > I think there's no rule that ReadPageInternal() must use XLogRead(). If we 
> > do
> > what you suggest, we need make this responsibility documented. I'll consider
> > that.

I think now we should not add any responsibility to XLogPageReadCB or its
subroutines because some extensions might already have their implementation of
XLogPageReadCB w/o XLogRead, and this change would break them.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-28 Thread Amit Kapila
On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
 wrote:
>
> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
> > wrote:
> >>
> >> On 1/3/18 14:53, Tomas Vondra wrote:
> >> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> >> maintenance_work_mem is often set to very large values, which could
> >> >> then have undesirable side effects on this use.
> >> >
> >> > Well, we need to pick some default value, and we can either use a fixed
> >> > value (not sure what would be a good default) or tie it to an existing
> >> > GUC. We only really have work_mem and maintenance_work_mem, and the
> >> > walsender process will never use more than one such buffer. Which seems
> >> > to be closer to maintenance_work_mem.
> >> >
> >> > Pretty much any default value can have undesirable side effects.
> >>
> >> Let's just make it an independent setting unless we know any better.  We
> >> don't have a lot of settings that depend on other settings, and the ones
> >> we do have a very specific relationship.
> >>
> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> >> version of work_mem.  Maybe we could think of another name.
> >> >
> >> > I won't object to a better name, of course. Any proposals?
> >>
> >> logical_decoding_[work_]mem?
> >>
> >
> >Having a separate variable for this can give more flexibility, but
> >OTOH it will add one more knob which user might not have a good idea
> >to set.  What are the problems we see if directly use work_mem for
> >this case?
> >
>
> IMHO it's similar to autovacuum_work_mem - we have an independent
> setting, but most people use it as -1 so we use maintenance_work_mem as
> a default value. I think it makes sense to do the same thing here.
>
> It does ad an extra knob anyway (I don't think we should just use
> maintenance_work_mem directly, the user should have an option to
> override it when needed). But most users will not notice.
>
> FWIW I don't think we should use work_mem, maintenace_work_mem seems
> somewhat more appropriate here (not related to queries, etc.).
>

I have the same concern for using maintenace_work_mem as Peter E.
which is that the value of maintenace_work_mem will generally be
higher which is suitable for its current purpose, but not for the
purpose this patch is using.  AFAIU, at this stage we want a better
memory accounting system for logical decoding and we are not sure what
is a good value for this variable.  So, I think using work_mem or
maintenace_work_mem should serve the purpose.  Later, if we have
requirements from people to have better control over the memory
required for this purpose then we can introduce a new variable.

I understand that currently work_mem is primarily tied with memory
used for query workspaces, but it might be okay to extend it for this
purpose.  Another point is that the default for that sound to be more
appealing for this case.  I can see the argument against it which is
having a separate variable will make the things look clean and give
better control.  So, if we can't convince ourselves for using
work_mem, we can introduce a new guc variable and keep the default as
4MB or work_mem.

I feel it is always tempting to introduce a new guc for the different
tasks unless there is an exact match, but OTOH, having lesser guc's
has its own advantage which is that people don't have to bother about
a new setting which they need to tune and especially for which they
can't decide with ease.  I am not telling that we should not introduce
new guc when it is required, but just to give more thought before
doing so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Instability of partition_prune regression test results

2019-09-28 Thread Amit Langote
On Sat, Sep 28, 2019 at 12:59 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Fri, Sep 27, 2019 at 7:25 AM Tom Lane  wrote:
> >> I experimented with adjusting explain_parallel_append() to filter
> >> more fields, but soon realized that we'd have to filter out basically
> >> everything that makes it useful to run EXPLAIN ANALYZE at all.
> >> Therefore, I think it's time to give up this testing methodology
> >> as a bad idea, and fall back to the time-honored way of running a
> >> plain EXPLAIN and then the actual query, as per the attached patch.
>
> > Isn't the point of using ANALYZE here to show that the exec-param
> > based run-time pruning is working (those "never executed" strings)?
>
> Hm.  Well, if you want to see those, we could do it as attached.

Perfect, thanks.

Regards,
Amit