date:20190927

Re: pgbench - allow to create partitioned tables

2019-09-27 Thread Amit Kapila

On Fri, Sep 27, 2019 at 7:05 PM Alvaro Herrera  wrote:
>
> On 2019-Sep-27, Amit Kapila wrote:
>
> > The other thing is that the query used in patch to fetch partition
> > information seems correct to me, but maybe there is a better way to
> > get that information.
>
> I hadn't looked at that, but yeah it seems that it should be using
> pg_partition_tree().
>

I think we might also need to use pg_get_partkeydef along with
pg_partition_tree to fetch the partition method information.  However,
I think to find reloid of pgbench_accounts in the current search path,
we might need to use some part of query constructed by Fabien.

Fabien, what do you think about Alvaro's suggestion?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Consider low startup cost in add_partial_path

2019-09-27 Thread Robert Haas

On Fri, Sep 27, 2019 at 2:24 PM James Coleman  wrote:
> Over in the incremental sort patch discussion we found [1] a case
> where a higher cost plan ends up being chosen because a low startup
> cost partial path is ignored in favor of a lower total cost partial
> path and a limit is a applied on top of that which would normal favor
> the lower startup cost plan.
>
> 45be99f8cd5d606086e0a458c9c72910ba8a613d originally added
> `add_partial_path` with the comment:
>
> > Neither do we need to consider startup costs:
> > parallelism is only used for plans that will be run to completion.
> > Therefore, this routine is much simpler than add_path: it needs to
> > consider only pathkeys and total cost.
>
> I'm not entirely sure if that is still true or not--I can't easily
> come up with a scenario in which it's not, but I also can't come up
> with an inherent reason why such a scenario cannot exist.

I think I just didn't think carefully about the Limit case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: max_parallel_workers question

2019-09-27 Thread Robert Haas

On Fri, Sep 27, 2019 at 8:07 PM Jeff Davis  wrote:
> The current docs for max_parallel_workers start out:
>
> "Sets the maximum number of workers that the system can support for
> parallel operations..."
>
> In my interpretation, "the system" means the entire cluster, but the
> max_parallel_workers setting is PGC_USERSET. That's a bit confusing,
> because two different backends can have different settings for "the
> maximum number ... the system can support".

Oops.

I intended it to mean "the entire cluster." Basically, how many
workers out of max_worker_processes are you willing to use for
parallel query, as opposed to other things. I agree that PGC_USERSET
doesn't make any sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

contrib/bloom Valgrind error

2019-09-27 Thread Peter Geoghegan

My Valgrind test script reports the following error, triggered from
within contrib/bloom's regression test suite on master as of right
now:

""
2019-09-27 20:53:50.910 PDT 9740 DEBUG:  building index "bloomidx" on
table "tst" serially
2019-09-27 20:53:51.049 PDT 9740 DEBUG:  CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid:
20721/1/2
2019-09-27 20:53:51.052 PDT 9740 DEBUG:  StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2019-09-27 20:53:51.054 PDT 9740 LOG:  statement: ALTER INDEX bloomidx
SET (length=80);
==9740== VALGRINDERROR-BEGIN
==9740== Conditional jump or move depends on uninitialised value(s)
==9740==at 0x26D400: RangeVarGetRelidExtended (namespace.c:349)
==9740==by 0x33D084: AlterTableLookupRelation (tablecmds.c:3445)
==9740==by 0x4D0BC1: ProcessUtilitySlow (utility.c:)
==9740==by 0x4D0802: standard_ProcessUtility (utility.c:927)
==9740==by 0x4D083B: ProcessUtility (utility.c:360)
==9740==by 0x4CD0A4: PortalRunUtility (pquery.c:1175)
==9740==by 0x4CDBC0: PortalRunMulti (pquery.c:1321)
==9740==by 0x4CE7F9: PortalRun (pquery.c:796)
==9740==by 0x4CA3D9: exec_simple_query (postgres.c:1231)
==9740==by 0x4CB3BD: PostgresMain (postgres.c:4256)
==9740==by 0x4547DE: BackendRun (postmaster.c:4459)
==9740==by 0x4547DE: BackendStartup (postmaster.c:4150)
==9740==by 0x4547DE: ServerLoop (postmaster.c:1718)
==9740==by 0x455E2C: PostmasterMain (postmaster.c:1391)
==9740==by 0x3B94AC: main (main.c:210)
==9740==  Uninitialised value was created by a stack allocation
==9740==at 0x402F202: _PG_init (blutils.c:54)
==9740==
==9740== VALGRINDERROR-END
{
   
   Memcheck:Cond
   fun:RangeVarGetRelidExtended
   fun:AlterTableLookupRelation
   fun:ProcessUtilitySlow
   fun:standard_ProcessUtility
   fun:ProcessUtility
   fun:PortalRunUtility
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}
""

I suspect that the recent commit 69f94108 is involved here, but I
haven't confirmed that explanation myself.

-- 
Peter Geoghegan

Re: ECPG installcheck tests fail if PGDATABASE is set

2019-09-27 Thread Peter Geoghegan

On Fri, Sep 27, 2019 at 3:03 PM Tom Lane  wrote:
> > Please do.
>
> After looking closer at the code in pg_regress.c, I'm wondering a bit
> about PGSERVICE.  A setting for that might certainly bring in a value
> for the database name, but I don't think we can just summarily unset it.
> I don't plan to do anything about that right now, but conceivably it'd
> bite people someday.

At least there will be some breadcrumbs to follow in the archive.

> Another thing that looks a bit fishy here is that the set of variables
> that pg_regress.c unsets is very much smaller than the set that libpq
> reacts to --- we have added a ton of the latter without touching this
> list (much less the three or four other places that duplicate it).
> I wonder how problematic that is.

Only time will tell, I suspect.

-- 
Peter Geoghegan

Re: Change atoi to strtol in same place

2019-09-27 Thread Joe Nelson

Alvaro Herrera wrote:
> ... can we have a new patch?  Not only because there seems to have
> been some discussion points that have gone unaddressed (?)

Yes, I'll work on it over the weekend.

> but also because Appveyor complains really badly about this one.
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.58672

Note that it requires functions from str2int-10.patch, and will not
compile when applied to master by itself. I didn't want to duplicate
functionality from that other uncommitted patch in mine.

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-27 Thread Peter Geoghegan

On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova
 wrote:
> Attached is v19.

Cool.

> * By default deduplication is on for non-unique indexes and off for
> unique ones.

I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates. I have been benchmarking this using more or less
standard pgbench at scale 200, with one big difference -- I also
create an index on "pgbench_accounts (abalance)". This is a low
cardinality index, which ends up about 3x smaller with the patch, as
expected. It also makes all updates non-HOT updates, greatly
increasing index bloat in the primary key of the accounts table --
this is what I found really interesting about this workload.

The theory behind deduplication within unique indexes seems quite
different to the cases we've focussed on so far -- that's why my
working copy of the patch makes a few small changes to how
_bt_dedup_one_page() works with unique indexes specifically (more on
that later). With unique indexes, deduplication doesn't help by
creating space -- it helps by creating *time* for garbage collection
to run before the real "damage" is done -- it delays page splits. This
is only truly valuable when page splits caused by non-HOT updates are
delayed by so much that they're actually prevented entirely, typically
because the _bt_vacuum_one_page() stuff can now happen before pages
split, not after. In general, these page splits are bad because they
degrade the B-Tree structure, more or less permanently (it's certainly
permanent with this workload). Having a huge number of page splits
*purely* because of non-HOT updates is particular bad -- it's just
awful. I believe that this is the single biggest problem with the
Postgres approach to versioned storage (we know that other DB systems
have no primary key page splits with this kind of workload).

Anyway, if you run this pgbench workload without rate-limiting, so
that a patched Postgres does as much work as physically possible, the
accounts table primary key (pgbench_accounts_pkey) at least grows at a
slower rate -- the patch clearly beats master at the start of the
benchmark/test (as measured by index size). As the clients are ramped
up by my testing script, and as time goes on, eventually the size of
the pgbench_accounts_pkey index "catches up" with master. The patch
delays page splits, but ultimately the system as a whole cannot
prevent the page splits altogether, since the server is being
absolutely hammered by pgbench. Actually, the index is *exactly* the
same size for both the master case and the patch case when we reach
this "bloat saturation point". We can delay the problem, but we cannot
prevent it. But what about a more realistic workload, with
rate-limiting?

When I add some rate limiting, so that the TPS/throughput is at about
50% of what I got the first time around (i.e. 50% of what is
possible), or 15k TPS, it's very different. _bt_dedup_one_page() can
now effectively cooperate with _bt_vacuum_one_page(). Now
deduplication is able to "soak up all the extra garbage tuples" for
long enough to delay and ultimately *prevent* almost all page splits.
pgbench_accounts_pkey starts off at 428 MB for both master and patch
(CREATE INDEX makes it that size). After about an hour, the index is
447 MB with the patch. The master case ends up with a
pgbench_accounts_pkey size of 854 MB, though (this is very close to
857 MB, the "saturation point" index size from before).

This is a very significant improvement, obviously -- the patch has an
index that is ~52% of the size seen for the same index with the master
branch!

Here is how I changed _bt_dedup_one_page() for unique indexes to get
this result:

* We limit the size of posting lists to 5 heap TIDs in the
checkingunique case. Right now, we will actually accept a
checkingunique page split before we'll merge together items that
result in a posting list with more heap TIDs than that (not sure about
these details at all, though).

* Avoid creating a new posting list that caller will have to split
immediately anyway (this is based on details of _bt_dedup_one_page()
caller's newitem tuple).

(Not sure how much this customization contributes to this favorable
test result -- maybe it doesn't make that much difference.)

The goal here is for duplicates that are close together in both time
and space to get "clumped together" into their own distinct, small-ish
posting list tuples with no more than 5 TIDs. This is intended to help
_bt_vacuum_one_page(), which is known to be a very important mechanism
for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set
very frequently within _bt_check_unique()). The general idea is to
balance deduplication against LP_DEAD killing, and to increase
spatial/temporal locality within these smaller posting lists. If we
have one huge posting list for each value, then we can't set the
LP_DEAD bit on anything at all, which is very bad.

max_parallel_workers question

2019-09-27 Thread Jeff Davis

The current docs for max_parallel_workers start out:

"Sets the maximum number of workers that the system can support for
parallel operations..."

In my interpretation, "the system" means the entire cluster, but the
max_parallel_workers setting is PGC_USERSET. That's a bit confusing,
because two different backends can have different settings for "the
maximum number ... the system can support".

max_parallel_workers is compared against the total number of parallel
workers in the system, which appears to be why the docs are worded that
way. But it's still confusing to me.

If the purpose is to make sure parallel queries don't take up all of
the worker processes, perhaps we should rename the setting
reserved_worker_processes, and make it PGC_SUPERUSER.

If the purpose is to control execution within a backend, perhaps we
should just compare it to the count of parallel processes that the
backend is already using.

If the purpose is just to be a more flexible version of
max_worker_processes, maybe we should change it to PGC_SIGHUP?

If it has multiple purposes, perhaps we should have multiple GUCs?

Regards,
Jeff Davis

Re: ECPG installcheck tests fail if PGDATABASE is set

2019-09-27 Thread Tom Lane

Peter Geoghegan  writes:
> On Fri, Sep 27, 2019 at 2:39 PM Tom Lane  wrote:
>> I think I just forgot about this thread.  Shall we change it in HEAD
>> and see what happens?  Maybe backpatch, but not till after 12.0 is out.

> Please do.

After looking closer at the code in pg_regress.c, I'm wondering a bit
about PGSERVICE.  A setting for that might certainly bring in a value
for the database name, but I don't think we can just summarily unset it.
I don't plan to do anything about that right now, but conceivably it'd
bite people someday.

Another thing that looks a bit fishy here is that the set of variables
that pg_regress.c unsets is very much smaller than the set that libpq
reacts to --- we have added a ton of the latter without touching this
list (much less the three or four other places that duplicate it).
I wonder how problematic that is.

regards, tom lane

Re: ECPG installcheck tests fail if PGDATABASE is set

2019-09-27 Thread Andres Freund

On 2019-09-27 14:43:00 -0700, Peter Geoghegan wrote:
> On Fri, Sep 27, 2019 at 2:39 PM Tom Lane  wrote:
> > I think I just forgot about this thread.  Shall we change it in HEAD
> > and see what happens?  Maybe backpatch, but not till after 12.0 is out.
> 
> Please do.

+1

Re: Batch insert in CTAS/MatView code

2019-09-27 Thread Andres Freund

Hi,

On 2019-09-09 18:31:54 +0800, Paul Guo wrote:
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index e9544822bf..8a844b3b5f 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation, TupleTableSlot 
> **slots, int ntuples,
> CommandId cid, int options, BulkInsertState 
> bistate)
>  {
>   TransactionId xid = GetCurrentTransactionId();
> - HeapTuple  *heaptuples;
>   int i;
>   int ndone;
>   PGAlignedBlock scratch;
> @@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation, TupleTableSlot 
> **slots, int ntuples,
>   SizesaveFreeSpace;
>   boolneed_tuple_data = RelationIsLogicallyLogged(relation);
>   boolneed_cids = 
> RelationIsAccessibleInLogicalDecoding(relation);
> + /* Declare it as static to let this memory be not on stack. */
> + static HeapTupleheaptuples[MAX_MULTI_INSERT_TUPLES];
> +
> + Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
>  
>   /* currently not needed (thus unsupported) for heap_multi_insert() */
>   AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
> @@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation, TupleTableSlot 
> **slots, int ntuples,
>   
>HEAP_DEFAULT_FILLFACTOR);
>  
>   /* Toast and set header data in all the slots */
> - heaptuples = palloc(ntuples * sizeof(HeapTuple));
>   for (i = 0; i < ntuples; i++)
>   {
>   HeapTuple   tuple;

I don't think this is a good idea. We shouldn't unnecessarily allocate
8KB on the stack. Is there any actual evidence this is a performance
benefit? To me this just seems like it'll reduce the flexibility of the
API, without any benefit.  I'll also note that you've apparently not
updated tableam.h to document this new restriction.


Greetings,

Andres Freund

Re: Batch insert in CTAS/MatView code

2019-09-27 Thread Andres Freund

Hi,

On 2019-09-26 10:43:27 -0300, Alvaro Herrera wrote:
> On 2019-Sep-25, Asim R P wrote:
> 
> > I reviewed your patch today.  It looks good overall.  My concern is that
> > the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> > place such as createas.c, we should be using generic tableam API
> > only.

Indeed.

> > However, I can also see that there is no better alternative.  We need to
> > compute the size of accumulated tuples so far, in order to decide whether
> > to stop accumulating tuples.  There is no convenient way to obtain the
> > length of the tuple, given a slot.  How about making that decision solely
> > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> > altogether?
> 
> ... maybe we should add a new operation to slots, that returns the
> (approximate?) size of a tuple?

Hm, I'm not convinced that it's worth adding that as a dedicated
operation. It's not that clear what it'd exactly mean anyway - what
would it measure? As referenced in the slot? As if it were stored on
disk? etc?

I wonder if the right answer wouldn't be to just measure the size of a
memory context containing the batch slots, or something like that.

> That would make this easy.  (I'm not sure however what to do about
> TOAST considerations -- is it size in memory that we're worried
> about?)

The in-memory size is probably fine, because in all likelihood the
toasted cols are just going to point to on-disk datums, no?

> Also:
> 
> +   myState->mi_slots_size >= 65535)
> 
> This magic number should be placed in a define next to the other one,
> but I'm not sure that heapam.h is a good location, since surely this
> applies to matviews in other table AMs too.

Right. I think it'd be better to move this into an AM independent place.

Greetings,

Andres Freund

Re: ECPG installcheck tests fail if PGDATABASE is set

2019-09-27 Thread Peter Geoghegan

On Fri, Sep 27, 2019 at 2:39 PM Tom Lane  wrote:
> I think I just forgot about this thread.  Shall we change it in HEAD
> and see what happens?  Maybe backpatch, but not till after 12.0 is out.

Please do.

-- 
Peter Geoghegan

Re: ECPG installcheck tests fail if PGDATABASE is set

2019-09-27 Thread Tom Lane

Peter Geoghegan  writes:
> On Sun, Mar 18, 2018 at 5:28 PM Tom Lane  wrote:
>> Don't think I like ecpg's tests behaving differently in this respect
>> than the rest of them do; that seems like a recipe for unrecognized
>> security issues.
>> 
>> If nobody can think of a positive reason for pg_regress not to
>> unset PGDATABASE unconditionally, let's try that and see how it
>> goes.

> It would be nice to get this fixed. Several people have been confused
> by it at this point.

I think I just forgot about this thread.  Shall we change it in HEAD
and see what happens?  Maybe backpatch, but not till after 12.0 is out.

regards, tom lane

Re: ECPG installcheck tests fail if PGDATABASE is set

2019-09-27 Thread Peter Geoghegan

On Sun, Mar 18, 2018 at 5:28 PM Tom Lane  wrote:
> Don't think I like ecpg's tests behaving differently in this respect
> than the rest of them do; that seems like a recipe for unrecognized
> security issues.
>
> If nobody can think of a positive reason for pg_regress not to
> unset PGDATABASE unconditionally, let's try that and see how it
> goes.

It would be nice to get this fixed. Several people have been confused
by it at this point.


-- 
Peter Geoghegan

Re: [HACKERS] "may be unused" warnings for gcc

2019-09-27 Thread Andres Freund

Hi,

On 2017-02-22 09:26:10 -0500, Peter Eisentraut wrote:
> On 2/21/17 22:17, Andres Freund wrote:
> > I've not run comparisons this year, but late last year I was seeing > 5%
> > < 10% benefits - that seems plenty enough to care.
> 
> You mean the 5-minute benchmarks on my laptop are not representative? ;-)
> 
> Here is a patch that I had lying around that clears the compiler
> warnings under -O3 for me.  It seems that they are a subset of what you
> are seeing.  Plausibly, as compilers are doing more analysis in larger
> scopes, we can expect to see more of these.

I pushed the subset that I still see locally with gcc -O3.

Greetings,

Andres Freund

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread David Steele

Hi Peter,

On 9/27/19 3:35 PM, Peter Eisentraut wrote:
> On 2019-09-27 10:34, Fujii Masao wrote:
>>> Also, are you sure this is a new behavior?
>> In v11 or before, if backup_label exists but not recovery.conf,
>> the startup process doesn't enter an archive recovery mode. It starts
>> crash recovery in that case. So the bahavior is somewhat different
>> between versions.
> 
> Can you bisect this?  I have traced through xlog.c in both versions and
> I don't see how this logic is any different in any obvious way.

What I've been seeing is that the underlying logic isn't different but
there are more ways to get into it.

Previously, there was no archive/targeted recovery without
recovery.conf, but now there are several ways to get to archive/targeted
recovery, i.e., making the recovery settings GUCs has bypassed controls
that previously had limited how they could be used and when.

The issues on the other thread [1], at least, were all introduced in
2dedf4d9.

Regards,
-- 
-David
da...@pgmasters.net

[1]
https://www.postgresql.org/message-id/flat/e445616d-023e-a268-8aa1-67b8b335340c%40pgmasters.net

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-27 Thread Alvaro Herrera

On 2019-Sep-27, Paul Guo wrote:

> I'm using --no-ensure-shutdown in the new version unless there are better
> suggestions.

That sounds sufficiently good.  I pushed this patch, after fixing a few
smallish problems, such as an assertion failure because of the
terminating \n in the error message when "postgres --single" fails
(which I tested by introducing a typo in the command).  I also removed
the short option, because I doubt that this option is useful enough to
warrant using up such an important shorthand (Maybe if it had been
-\ or -% or -& I would have let it through, since I doubt anybody would
have wanted to use those for anything else).  But if somebody disagrees,
they can send a patch to restore it, and we can then discuss the merits
of individual chars to use.

I also added quotes to DEVNULL, because we do that everywhere.  Maybe
there exists a system somewhere that requires this ... !!??

Finally, I split out the command in the error message in case it fails.

Thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread Peter Eisentraut

On 2019-09-27 10:34, Fujii Masao wrote:
>> Also, are you sure this is a new behavior?
> In v11 or before, if backup_label exists but not recovery.conf,
> the startup process doesn't enter an archive recovery mode. It starts
> crash recovery in that case. So the bahavior is somewhat different
> between versions.

Can you bisect this?  I have traced through xlog.c in both versions and
I don't see how this logic is any different in any obvious way.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

2019-09-27 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Sep-27, Antonin Houska wrote:
>>> You placed the errinfo in XLogRead's stack rather than its callers' ...
>>> I don't think that works, because as soon as XLogRead returns that
>>> memory is no longer guaranteed to exist.

>> I was aware of this problem, therefore I defined the field as static:
>> 
>> +XLogReadError *
>> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
>> +WALOpenSegment *seg, WALSegmentContext *segcxt,
>> +WALSegmentOpen openSegment)
>> +{
>> +   char   *p;
>> +   XLogRecPtr  recptr;
>> +   Sizenbytes;
>> +   static XLogReadError errinfo;

> I see.

That seems like an absolutely terrible "fix".  We don't really want
XLogRead to be defined in a way that forces it to be non-reentrant do we?

regards, tom lane

Re: Hooks for session start and end, take two

2019-09-27 Thread legrand legrand

Michael Paquier-2 wrote
> On Thu, Sep 26, 2019 at 09:57:57AM -0700, legrand legrand wrote:
>> Does that mean that all processes seen in pg_stat_activity like
>> - autovacuum launcher
>> - logical replication launcher
>> - background writer
>> - checkpointer
>> - walwriter
>> ... 
>> - Parallel worker
>> are available with that hook (it seems not) ?
> 
> All processes using PostgresMain() for their startup take this code
> path like WAL senders and normal backend sessions, but not things
> going through StartChildProcess() (WAL receiver, bgwriter, etc.) or
> other processes like autovacuum processes which use a different start
> code path.

OK I confirm:
- "client backend" appears at session start and end hook,
- "autovacuum worker" and "pg_background" only appears at session end hook
  (backend_start can be retreived from pg_stat_activity),
- "parallel workers" are not visible at all
  (because extension cannot assign XIDs during a parallel operation)

All seems fine to me.

Regards
PAscal






--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Improving on MAX_CONVERSION_GROWTH

2019-09-27 Thread Tom Lane

Robert Haas  writes:
> On Fri, Sep 27, 2019 at 11:40 AM Andres Freund  wrote:
>> Note that one of the additional reasons for the 1GB limit is that it
>> protects against int overflows. I'm somewhat unconvinced that that's a
>> sensible approach, but ...

> It's not crazy. People using 'int' rather casually just as they use
> 'palloc' rather casually, without necessarily thinking about what
> could go wrong at the edges. I don't have any beef with that as a
> general strategy; I just think we should be trying to do better in the
> cases where it negatively affects the user experience.

A small problem with doing anything very interesting here is that the
int-is-enough-for-a-string-length approach is baked into the wire
protocol (read the DataRow message format spec and weep).

We could probably bend the COPY protocol enough to support multi-gig row
values --- dropping the rule that the backend doesn't split rows across
CopyData messages wouldn't break too many clients, hopefully.  That would
at least dodge some problems in dump/restore scenarios.

In the meantime, I still think we should commit what I proposed in the
other thread (<974.1569356...@sss.pgh.pa.us>), or something close to it.
Andres' proposal would perhaps be an improvement on that, but I don't
think it'll be ready anytime soon; and for sure we wouldn't risk
back-patching it, while I think we could back-patch what I suggested.
In any case, that patch is small enough that dropping it would be no big
loss if a better solution comes along.

Also, as far as the immediate subject of this thread is concerned,
I'm inclined to get rid of MAX_CONVERSION_GROWTH in favor of using
the target encoding's max char length.  The one use (in printtup.c)
where we don't know the target encoding could use MAX_MULTIBYTE_CHAR_LEN
instead.  Being smarter than that could help in some cases (mostly,
conversion of ISO encodings to UTF8), but it's not that big a win.
(I did some checks and found that some ISO encodings could provide a
max growth of 2x, but many are max 3x, so 4x isn't that far out of
line.)  If Andres' ideas don't pan out we could come back and work
harder on this, but for now something simple and back-patchable
seems like a useful stopgap improvement.

regards, tom lane

Re: Attempt to consolidate reading of XLOG page

2019-09-27 Thread Alvaro Herrera

On 2019-Sep-27, Antonin Houska wrote:

> > You placed the errinfo in XLogRead's stack rather than its callers' ...
> > I don't think that works, because as soon as XLogRead returns that
> > memory is no longer guaranteed to exist.
> 
> I was aware of this problem, therefore I defined the field as static:
> 
> +XLogReadError *
> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
> +WALOpenSegment *seg, WALSegmentContext *segcxt,
> +WALSegmentOpen openSegment)
> +{
> +   char   *p;
> +   XLogRecPtr  recptr;
> +   Sizenbytes;
> +   static XLogReadError errinfo;

I see.

> > You need to allocate the struct in the callers stacks and pass its address
> > to XLogRead.  XLogRead can return NULL if everything's okay or the pointer
> > to the errinfo struct.
> 
> I didn't choose this approach because that would add one more argument to the
> function.

Yeah, the signature does seem a bit unwieldy.  But I wonder if that's
too terrible a problem, considering that this code is incurring a bunch
of syscalls in the best case anyway.

BTW that tli_p business to the openSegment callback is horribly
inconsistent.  Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline.  This is made more complicated by us having the TLI in "seg"
also.  Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo).  As a very minor stylistic
point, we prefer to have out params at the end of the signature.

> > Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
> > make more sense to leave XLogRead be always responsible for setting
> > these correctly, and remove those lines from ReadPageInternal?
> 
> I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
> what you suggest, we need make this responsibility documented. I'll consider
> that.

Hmm.  Thanks.

> > (BTW "is called by the XLOG reader" is a bit strange in code that appears in
> > xlogreader.c).
> 
> ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if
> we'll eventually need this phrase in the comment at all.

I think that would be slightly clearer.  But if we can force this code
into actually making sense, that would be much better.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hooks for session start and end, take two

2019-09-27 Thread Mike Palmiotto

On Fri, Sep 27, 2019 at 12:54 AM Michael Paquier  wrote:
>
> On Thu, Sep 26, 2019 at 09:57:57AM -0700, legrand legrand wrote:
> > Does that mean that all processes seen in pg_stat_activity like
> > - autovacuum launcher
> > - logical replication launcher
> > - background writer
> > - checkpointer
> > - walwriter
> > ...
> > - Parallel worker
> > are available with that hook (it seems not) ?
>
> All processes using PostgresMain() for their startup take this code
> path like WAL senders and normal backend sessions, but not things
> going through StartChildProcess() (WAL receiver, bgwriter, etc.) or
> other processes like autovacuum processes which use a different start
> code path.

There is another patchset submitted to this CF
(https://commitfest.postgresql.org/24/2259/) that attempts to
centralize all process startup to make use of the StartChildProcess
entrypoint. The patchset then introduces a fork_process hook that
would be hit by every backend type immediately after forking. Looking
at your patch, it seems like there won't be too much conflict between
the two, but I wonder if we can collaborate on the hook portion to
achieve both of our goals?

As noted in the related thread, I'm currently breaking out the
centralization patch into separate pieces: one for the infrastructure
change (including MyAuxProc adoption of the centralized startup), one
for each process type, and one for the fork_process hook. The
substance of the patchset will remain mostly the same if you want to
take a look.

I look forward to hearing your thoughts on the matter.

Thanks,
-- 
Mike Palmiotto
https://crunchydata.com

Skip recovery/standby signal files in pg_basebackup

2019-09-27 Thread David Steele

Hackers,

Restoring these files could cause surprising behaviors so it seems best
to let the restore process create them when needed.

Patch is attached.

-- 
-David
da...@pgmasters.net
From 1b3d95607afeb4f1fa98c3ea006f310ab68252f7 Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Fri, 27 Sep 2019 14:49:07 -0400
Subject: [PATCH] Ignore recovery/standby signal files in pg_basebackup.

Restoring these files could cause surprising behaviors so it seems best
to let the restore process create them when needed.
---
 src/backend/replication/basebackup.c |  7 +++
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 14 +++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/basebackup.c 
b/src/backend/replication/basebackup.c
index d0f210de8c..65075b1770 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -183,6 +183,13 @@ static const char *const excludeFiles[] =
/* Skip relation cache because it is rebuilt on startup */
RELCACHE_INIT_FILENAME,
 
+   /*
+* Skip recovery/standby signal files. These files should be created 
after
+* restore if they are required.
+*/
+   RECOVERY_SIGNAL_FILE,
+   STANDBY_SIGNAL_FILE,
+
/*
 * If there's a backup_label or tablespace_map file, it belongs to a
 * backup started by the user with pg_start_backup().  It is *not* 
correct
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl 
b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..6bccaef15a 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 108;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -65,7 +65,8 @@ $node->restart;
 
 # Write some files to test that they are not copied.
 foreach my $filename (
-   qw(backup_label tablespace_map postgresql.auto.conf.tmp 
current_logfiles.tmp)
+   qw(backup_label tablespace_map postgresql.auto.conf.tmp 
current_logfiles.tmp
+   recovery.signal standby.signal)
   )
 {
open my $file, '>>', "$pgdata/$filename";
@@ -135,11 +136,18 @@ foreach my $dirname (
 # These files should not be copied.
 foreach my $filename (
qw(postgresql.auto.conf.tmp postmaster.opts postmaster.pid 
tablespace_map current_logfiles.tmp
-   global/pg_internal.init))
+   global/pg_internal.init recovery.signal standby.signal))
 {
ok(!-f "$tempdir/backup/$filename", "$filename not copied");
 }
 
+# Remove recovery/standby signal files so they don't break further testing.
+foreach my $filename (qw(recovery.signal standby.signal))
+{
+   unlink("$pgdata/$filename")
+ or BAIL_OUT("unable to unlink $pgdata/backup/$filename");
+}
+
 # Unlogged relation forks other than init should not be copied
 ok(-f "$tempdir/backup/${baseUnloggedPath}_init",
'unlogged init fork in backup');
-- 
2.20.1 (Apple Git-117)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-27 Thread James Coleman

On Mon, Sep 16, 2019 at 6:32 AM Tomas Vondra
 wrote:
>
> On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:
> >On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> > wrote:
> >> >> ...
> >> >>
> >> >> I think this may be a thinko, as this plan demonstrates - but I'm not
> >> >> sure about it. I wonder if this might be penalizing some other types of
> >> >> plans (essentially anything with limit + gather).
> >> >>
> >> >> Attached is a WIP patch fixing this by considering both startup and
> >> >> total cost (by calling compare_path_costs_fuzzily).
> >> >
> >> >It seems to me that this is likely a bug, and not just a changed
> >> >needed for this. Do you think it's better addressed in a separate
> >> >thread? Or retain it as part of this patch for now (and possibly break
> >> >it out later)? On the other hand, it's entirely possible that someone
> >> >more familiar with parallel plan limitations could explain why the
> >> >above comment holds true. That makes me lean towards asking in a new
> >> >thread.
> >> >
> >>
> >> Maybe. I think creating a separate thread would be useful, provided we
> >> manage to demonstrate the issue without an incremental sort.
> >
> >I did some more thinking about this, and I can't currently come up
> >with a way to reproduce this issue outside of this patch. It doesn't
> >seem reasonable to me to assume that there's anything inherent about
> >this patch that means it's the only way we can end up with a partial
> >path with a low startup cost we'd want to prefer.
> >
> >Part of me wants to pull it over to a separate thread just to get
> >additional feedback, but I'm not sure how useful that is given we
> >don't currently have an example case outside of this patch.
> >
>
> Hmm, I see.
>
> While I initially suggested to start a separate thread only if we have
> example not involving an incremental sort, that's probably not a hard
> requirement. I think it's fine to start a thead briefly explaining the
> issue, and pointing to incremental sort thread for actual example.
>
> >
> >One thing to note though: the current patch does not also modify
> >add_partial_path_precheck which also does not take into account
> >startup cost, so we probably need to update that for completeness's
> >sake.
> >
>
> Good point. It does indeed seem to make the same assumption about only
> comparing total cost before calling add_path_precheck.

I've started a new thread to discuss:
https://www.postgresql.org/message-id/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg%40mail.gmail.com

James

v12 relnotes: alter system tables

2019-09-27 Thread Justin Pryzby

https://www.postgresql.org/docs/12/release-12.html

|Allow modifications of system catalogs' options using ALTER TABLE (Peter 
Eisentraut)
|Modifications of catalogs' reloptions and autovacuum settings are now 
supported.

I wonder if that should say: "... WHEN ALLOW_SYSTEM_TABLE_MODS IS ENABLED."

Justin

Consider low startup cost in add_partial_path

2019-09-27 Thread James Coleman

Over in the incremental sort patch discussion we found [1] a case
where a higher cost plan ends up being chosen because a low startup
cost partial path is ignored in favor of a lower total cost partial
path and a limit is a applied on top of that which would normal favor
the lower startup cost plan.

45be99f8cd5d606086e0a458c9c72910ba8a613d originally added
`add_partial_path` with the comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily
come up with a scenario in which it's not, but I also can't come up
with an inherent reason why such a scenario cannot exist.

We could just continue to include this change as part of the
incremental sort patch itself, but it seemed worth it to me to break
it out for some more targeted discussion, and also include Robert as
the initial author of add_partial_path in the hopes that maybe we
could retrieve some almost 4-year-old memories on why this was
inherently true then, and maybe that would shed some light on whether
it's still inherently true.

I've attached a patch (by Tomas Vondra, also cc'd) to consider startup
cost in add_partial_path, but should we apply the patch we'll also
likely need to apply the same kind of change to
add_partial_path_precheck.

James Coleman

[1]: 
https://www.postgresql.org/message-id/20190720132244.3vgg2uynfpxh3me5%40development


consider-startup-cost-in-add-partial-path_v1.patch
Description: Binary data

Re: A problem presentaion about ECPG, DECLARE STATEMENT

2019-09-27 Thread Bruce Momjian

On Mon, Sep 16, 2019 at 01:12:17PM +0200, Peter Eisentraut wrote:
> Moreover, I've been wondering about the behavior detail given in the
> table at
> .
>  In scenario 3, the declare statement says con1 but the subsequent
> dynamic statement says con2, and as a result of that, con1 is used.
> This is not intuitive, I'd say, but given that there is no indication of
> where this statement came from or whose idea it follows, it's unclear
> how to evaluate that.

FYI, I was totally confused by this also when researching this for the
PG 12 release notes.  I am glad we are going to redo it for PG 13.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread David Steele

On 9/27/19 4:34 AM, Fujii Masao wrote:
> On Fri, Sep 27, 2019 at 3:36 AM David Steele  wrote:
>>
>> On 9/24/19 1:25 AM, Fujii Masao wrote:
>>>
>>> When backup_label exists, the startup process enters archive recovery mode
>>> even if recovery.signal file doesn't exist. In this case, the startup 
>>> process
>>> tries to retrieve WAL files by using restore_command. Then, at the beginning
>>> of the archive recovery, the contents of backup_label are copied to 
>>> pg_control
>>> and backup_label file is removed. This would be an intentional behavior.
>>
>>> But I think the problem is that, if the server shuts down during that
>>> archive recovery, the restart of the server may cause the recovery to fail
>>> because neither backup_label nor recovery.signal exist and the server
>>> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
>>>
>>> So the problematic scenario is;
>>>
>>> 1. the server starts with backup_label, but not recovery.signal.
>>> 2. the startup process enters an archive recovery mode because
>>> backup_label exists.
>>> 3. the contents of backup_label are copied to pg_control and
>>> backup_label is deleted.
>>
>> Do you mean deleted or renamed to backup_label.old?
> 
> Sorry for the confusing wording..
> I meant the following code that renames backup_label to .old, in 
> StartupXLOG().

Right, that makes sense.

>>
>> I assume you have a repro?  Can you give more details?
> 
> What I did is:
> 
> 1. Start PostgreSQL server with WAL archiving enabled.
> 2. Take an online backup by using pg_basebackup, for example,
>  $ pg_basebackup -D backup
> 3. Execute many write SQL to generate lots of WAL files. During that 
> execution,
> perform CHECKPOINT to remove some WAL files from pg_wal directory.
> You need to repeat these until you confirm that there are many WAL files
> that have already been removed from pg_wal but exist only in archive area.
>  4. Shutdown the server.
>  5. Remove PGDATA and restore it from backup.
>  6. Set up restore_command.
>  7. (Forget to put recovery.signal)
>  That is, in this scenario, you want to recover the database up to
>  the latest WAL records in the archive area. So you need to start archive
>  recovery by setting restore_command and putting recovery.signal.
>  But the problem happens when you forget to put recovery.signal.
>  8. Start PostgreSQL server.
>  9. Shutdown the server while it's restoring archived WAL files and replaying
>  them. At this point, you will notice that the archive recovery starts
>  even though recovery.signal doesn't exist. So even archived WAL files
>  are successfully restored at this step.
>  10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
> exist, crash recovery starts and fail to restore the archived WAL 
> files.
>So you fail to recover the database up to the latest WAL record
> in archive
>directory. The recovery will finish at early point.

Yes, I see it now.  I did not have enough WAL to make it work before, as
I suspected.

>>> One idea to fix this issue is to make the above step #3 remember that
>>> backup_label existed, in pg_control. Then we should make the subsequent
>>> recovery enter an archive recovery mode if pg_control indicates that
>>> even if neither backup_label nor recovery.signal exist. Thought?
>>
>> That seems pretty invasive to me at this stage.  I'd like to reproduce
>> it and see if there are alternatives.
>>
>> Also, are you sure this is a new behavior?
> 
> In v11 or before, if backup_label exists but not recovery.conf,
> the startup process doesn't enter an archive recovery mode. It starts
> crash recovery in that case. So the bahavior is somewhat different
> between versions.

Agreed.  Since recovery options can be used in the presence of
backup_label *or* recovery.signal (or standby.signal for that matter)
this does represent a change in behavior.  And it doesn't appear to be a
beneficial change.

Regards,
-- 
-David
da...@pgmasters.net

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread David Steele

On 9/27/19 4:41 AM, Fujii Masao wrote:
> On Fri, Sep 27, 2019 at 4:07 PM Masahiko Sawada  wrote:
>>
>> On Fri, Sep 27, 2019 at 3:36 AM David Steele  wrote:
>>>
>>> On 9/24/19 1:25 AM, Fujii Masao wrote:

 When backup_label exists, the startup process enters archive recovery mode
 even if recovery.signal file doesn't exist. In this case, the startup 
 process
 tries to retrieve WAL files by using restore_command. Then, at the 
 beginning
 of the archive recovery, the contents of backup_label are copied to 
 pg_control
 and backup_label file is removed. This would be an intentional behavior.
>>>
 But I think the problem is that, if the server shuts down during that
 archive recovery, the restart of the server may cause the recovery to fail
 because neither backup_label nor recovery.signal exist and the server
 doesn't enter an archive recovery mode. Is this intentional, too? Seems No.

 So the problematic scenario is;

 1. the server starts with backup_label, but not recovery.signal.
 2. the startup process enters an archive recovery mode because
 backup_label exists.
 3. the contents of backup_label are copied to pg_control and
 backup_label is deleted.
>>>
>>> Do you mean deleted or renamed to backup_label.old?
>>>
 4. the server shuts down..
>>>
>>> This happens after the cluster has reached consistency?
>>>
 5. the server is restarted. neither backup_label nor recovery.signal exist.
 6. the startup process starts just crash recovery because neither 
 backup_label
 nor recovery.signal exist. Since it cannot retrieve WAL files from 
 archival
 area, it may fail.
>>>
>>> I tried a few ways to reproduce this but was not successful without
>>> manually removing WAL.
>>
>> Hmm me too. I think that since we enter crash recovery at step #6 we
>> don't retrieve WAL files from archival area.
>>
>> But I reproduced the problem Fujii-san mentioned that the restart of
>> the server during archive recovery causes to the crash recovery
>> instead of resuming the archive recovery.
> 
> Yes, it's strange and unexpected to start crash recovery
> when restarting archive recovery. Archive recovery should
> start again in that case, I think.

+1

-- 
-David
da...@pgmasters.net

Re: Attempt to consolidate reading of XLOG page

2019-09-27 Thread Antonin Houska

Alvaro Herrera  wrote:

> On 2019-Sep-26, Antonin Houska wrote:

> You placed the errinfo in XLogRead's stack rather than its callers' ...
> I don't think that works, because as soon as XLogRead returns that
> memory is no longer guaranteed to exist.

I was aware of this problem, therefore I defined the field as static:

+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+WALOpenSegment *seg, WALSegmentContext *segcxt,
+WALSegmentOpen openSegment)
+{
+   char   *p;
+   XLogRecPtr  recptr;
+   Sizenbytes;
+   static XLogReadError errinfo;

> You need to allocate the struct in the callers stacks and pass its address
> to XLogRead.  XLogRead can return NULL if everything's okay or the pointer
> to the errinfo struct.

I didn't choose this approach because that would add one more argument to the
function.

> I've been wondering if it's really necessary to pass 'seg' to the
> openSegment() callback.  Only walsender wants that, and it seems ...
> weird.  Maybe that's not something for this patch series to fix, but it
> would be good to find a more decent way to do the TLI switch at some
> point.

Good point. Since walsender.c already has the "sendSeg" global variable, maybe
we can let WalSndSegmentOpen() use this one, and remove the "seg" argument
from the callback.

> > +   /*
> > +* If the function is called by the XLOG reader, the 
> > reader will
> > +* eventually set both "ws_segno" and "ws_off", however 
> > the XLOG
> > +* reader is not necessarily involved. Furthermore, we 
> > need to set
> > +* the current values for this function to work.
> > +*/
> > +   seg->ws_segno = nextSegNo;
> > +   seg->ws_off = 0;
> 
> Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
> make more sense to leave XLogRead be always responsible for setting
> these correctly, and remove those lines from ReadPageInternal?

I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
what you suggest, we need make this responsibility documented. I'll consider
that.

> (BTW "is called by the XLOG reader" is a bit strange in code that appears in
> xlogreader.c).

ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if
we'll eventually need this phrase in the comment at all.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: pg_upgrade issues

2019-09-27 Thread Bruce Momjian

On Mon, Sep  9, 2019 at 10:58:02AM -0400, Dave Cramer wrote:
> pgjdbc has a bug report which is as follows:
> 
> The database has a table that has a description and a constraint.
> The constraint also has a description.
> 
> somehow the constraint and the table end up with the same OID's after
> pg_upgrade. 
> 
> My understanding of pg_upgrade suggests that shouldn't happen ? I realize oids
> are not guaranteed to be unique, but this seems to be quite a coincidence.

Uh, the table and the table constraint have the same pg_description oid?
pg_upgrade just restores the pg_description descriptions and doesn't
modify them.  Do you get an error on restore because of the duplicate
pg_description oids?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Document recovery_target_action behavior?

2019-09-27 Thread David Steele

Hackers,

In versions < PG12 recovery_target_action has a behavior that appears to
be a bug, or is at least undocumented.  If hot_standby = off and
recovery_target_action is not specified then the cluster will promote
when the target is found rather than shutting down as the documentation
seems to indicate.  If recovery_target_action is explicitly set to pause
then the cluster will shutdown as expected.

In PG12 the shutdown occurs even when recovery_target_action is not
explicitly set.  This seems like good behavior and it matches the
documentation as I read it.

The question for the old versions: is this something that should be
fixed in the code or in the documentation?

My vote is to make this explicit in the documentation, since changing
the recovery behavior in old versions could lead to nasty surprises.

Regards,
-- 
-David
da...@pgmasters.net

Re: Minimal logical decoding on standbys

2019-09-27 Thread Robert Haas

On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar  wrote:
> Preferably I want wait_for_xmins() to get rid of the $node parameter,
> because we can deduce it using slot name. But that requires having
> get_node_from_slotname(). Your suggestion was to remove
> get_node_from_slotname() and add back the $node param so as to reduce
> duplicate code. Instead, how about keeping  wait_for_xmins() in the
> PostgresNode.pm() ? This way, we won't have duplication, and also we
> can get rid of param $node. This is just my preference; if you are
> quite inclined to not have get_node_from_slotname(), I will go with
> your suggestion.

I'd be inclined not to have it.  I think having a lookup function to
go from slot name -> node is strange; it doesn't really simplify
things that much for the caller, and it makes the logic harder to
follow. It would break outright if you had the same slot name on
multiple nodes, which is a perfectly reasonable scenario.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-27 Thread James Coleman

On Mon, Sep 9, 2019 at 5:55 PM Tomas Vondra
 wrote:
> The "patched" column means all developer GUCs disabled, so it's expected
> to produce the same plan as master (and it is). And then there's one
> column for each developer GUC. If the column is just TRUE it means the
> GUC does not affect any of the synthetic queries. There are 4 of them:
>
> - devel_add_paths_to_grouping_rel_parallel
> - devel_create_partial_grouping_paths
> - devel_gather_grouping_paths
> - devel_standard_join_search
>
> The places controlled by those GUCs are either useless, or the query
> affected by them is not included in the list of queries.

I'd previously found (in my reverse engineering efforts) the query:

select *
from tenk1 t1
join tenk1 t2 on t1.hundred = t2.hundred
join tenk1 t3 on t1.hundred = t3.hundred
order by t1.hundred, t1.twenty
limit 50;

can change plans to use incremental sort when
generate_useful_gather_paths() is added to standard_join_search().
Specifically, we get a merge join between t1 and t3 as the top level
(besides limit) node where the driving side of the join is a gather
merge with incremental sort. This does rely on these gucs set in the
test harness:

set local max_parallel_workers_per_gather=4;
set local min_parallel_table_scan_size=0;
set local parallel_tuple_cost=0;
set local parallel_setup_cost=0;

So I think we can reduce the number of unused gucs to 3.

James

Re: patch: psql - enforce constant width of last column

2019-09-27 Thread Bruce Momjian

On Tue, Sep 17, 2019 at 05:15:42PM +0200, Pavel Stehule wrote:
> 
> 
> út 17. 9. 2019 v 17:06 odesílatel Ahsan Hadi  napsal:
> 
> Hi Pavel,
> 
> I have been trying to reproduce the case of badly displaying last columns
> of a query result-set. I played around with the legal values for psql
> border variable but not able to find a case where last columns are badly
> displayed. Can you please share an example that I can use to reproduce 
> this
> problem. I will try out your patch once I am able to reproduce the 
> problem.
> 
> 
> you need to use pspg, and vertical cursor.
> 
> https://github.com/okbob/pspg
> vertical cursor should be active
> 
> \pset border 1
> \pset linestyle ascii
> \pset pager always
> 
> select * from generate_series(1,3);

I was able to reproduce the failure, but with a little more work:

$ export PSQL_PAGER='pspg --vertical-cursor'
$ psql test
\pset border 1
\pset linestyle ascii
\pset pager always
select * from generate_series(1,3);

Line '1' has highlighted trailing space.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Standby accepts recovery_target_timeline setting?

2019-09-27 Thread David Steele

On 9/27/19 11:58 AM, Fujii Masao wrote:
> On Sat, Sep 28, 2019 at 12:14 AM David Steele  wrote:
>>
>> I think, at the very least, the fact that targeted recovery proceeds in
>> the absence of recovery.signal represents a bug.
> 
> Yes, recovery target settings are used even when neither backup_label
> nor recovery.signal exist, i.e., just a crash recovery, in v12. This is
> completely different behavior from prior versions.

I'm not able to reproduce this.  I only see recovery settings being used
if backup_label, recovery.signal, or standby.signal is present.

Do you have an example?

> IMO, since v12 is RC1 now, it's not good idea to change the logic to new.
> So at least for v12, we basically should change the recovery logic so that
> it behaves in the same way as prior versions. That is,
> 
> - Stop the recovery with an error if any recovery target is set in
>crash recovery

This seems reasonable.  I tried adding a recovery.signal and
restore_command for crash recovery and I just got an error that it
couldn't find 0002.history in the archive.

> - Use recovery target settings if set even when standby mode

Yes, this is weird, but it is present in current versions.

> - Do not enter an archive recovery mode if recovery.signal is missing

Agreed.  Perhaps it's OK to use restore_command if a backup_label is
present, but we certainly should not be doing targeted recovery.

> If we want new behavior in recovery, we can change the logic for v13.

Agreed, but it's not at all clear to me how invasive these changes would be.

Regards,
-- 
-David
da...@pgmasters.net

Re: log bind parameter values on error

2019-09-27 Thread Alexey Bashtanov


Hello Alvaro,


I didn't like abusing testlibpq3.c for your new stuff, so I moved it off
to a new file testlibpq5.c.  I cleaned up a few other cosmetics things
about this -- v10 attached.

Thanks for doing this.


I eventually noticed that this patch fails
to initialize each param's textValue to NULL,
I've added the changes to set textValue to NULL for each parameter after 
calling makeParamList.
I think it's the best option, as it's natural to populate text 
representations (albeit with NULLs)

in the same loop as we write the other parameters attributes.

It still requires us to set hasTextValues afterwards, as it doesn't seem 
practical to me to make
two loops, first null them all and next populate them: if there's a 
parsing or converting problem

midway it's out of the feature scope and already being logged elsewhere.

Attaching v11, marking as waiting for review.

Best, Alex
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6612f95f9f..81dfee5fe7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6414,6 +6414,23 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
+ 
+  log_parameters_on_error (boolean)
+  
+   log_parameters_on_error configuration parameter
+  
+  
+  
+   
+Controls whether the statement is logged with bind parameter values. 
+It adds some overhead, as postgres will cache textual
+representations of parameter values in memory for all statements,
+even if they eventually do not get logged. The default is
+off. Only superusers can change this setting.
+   
+  
+ 
+
  
   log_statement (enum)
   
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 7e0a041fab..7692ea551b 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -406,7 +406,7 @@ EvaluateParams(PreparedStatement *pstmt, List *params,
 		prm->value = ExecEvalExprSwitchContext(n,
 			   GetPerTupleExprContext(estate),
 			   >isnull);
-
+		prm->textValue = NULL;
 		i++;
 	}
 
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 83337c2eed..92fcee8b19 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -925,6 +925,7 @@ postquel_sub_params(SQLFunctionCachePtr fcache,
 			prm->isnull = fcinfo->args[i].isnull;
 			prm->pflags = 0;
 			prm->ptype = fcache->pinfo->argtypes[i];
+			prm->textValue = NULL;
 		}
 	}
 	else
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 2c0ae395ba..f6b4246b47 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -2458,6 +2458,7 @@ _SPI_convert_params(int nargs, Oid *argtypes,
 			prm->isnull = (Nulls && Nulls[i] == 'n');
 			prm->pflags = PARAM_FLAG_CONST;
 			prm->ptype = argtypes[i];
+			prm->textValue = NULL;
 		}
 	}
 	else
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index cf4387e40f..bf8e3705af 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -45,6 +45,7 @@ makeParamList(int numParams)
 	retval->parserSetup = NULL;
 	retval->parserSetupArg = NULL;
 	retval->numParams = numParams;
+	retval->hasTextValues = false;
 
 	return retval;
 }
@@ -91,6 +92,9 @@ copyParamList(ParamListInfo from)
 			continue;
 		get_typlenbyval(nprm->ptype, , );
 		nprm->value = datumCopy(nprm->value, typByVal, typLen);
+
+		/* We don't copy text values, as no caller needs that at present. */
+		nprm->textValue = NULL;
 	}
 
 	return retval;
@@ -247,6 +251,9 @@ RestoreParamList(char **start_address)
 
 		/* Read datum/isnull. */
 		prm->value = datumRestore(start_address, >isnull);
+
+		/* We don't copy text values, as no caller needs that at present. */
+		prm->textValue = NULL;
 	}
 
 	return paramLI;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e8d8e6f828..ec1e4092f5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -86,6 +86,12 @@
  */
 const char *debug_query_string; /* client-supplied query string */
 
+/*
+ * The top-level portal that the client is immediately working with:
+ * creating, binding, executing, or all at one using simple protocol
+ */
+Portal current_top_portal = NULL;
+
 /* Note: whereToSendOutput is initialized for the bootstrap/standalone case */
 CommandDest whereToSendOutput = DestDebug;
 
@@ -1714,6 +1720,9 @@ exec_bind_message(StringInfo input_message)
 	else
 		portal = CreatePortal(portal_name, false, false);
 
+	Assert(current_top_portal == NULL);
+	current_top_portal = portal;
+
 	/*
 	 * Prepare to copy stuff into the portal's memory context.  We do all this
 	 * copying first, because it could possibly fail (out-of-memory) and we
@@ -1751,6 +1760,9 @@ exec_bind_message(StringInfo input_message)
 	 */
 	if (numParams > 0)
 	{
+		/* GUC value can change, so we remember its state to be consistent */
+		bool need_text_values =

Re: Minimal logical decoding on standbys

2019-09-27 Thread Amit Khandekar

On Fri, 27 Sep 2019 at 01:57, Robert Haas  wrote:
>
> On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar  wrote:
> > Actually in some of the conflict-recovery testcases, I am still using
> > wait_for_xmins() so that we could test the xmin values back after we
> > drop the slots. So xmin-related testing is embedded in these recovery
> > tests as well. We can move the wait_for_xmins() function to some
> > common file and then do the split of this file, but then effectively
> > some of the xmin-testing would go into the recovery-related test file,
> > which did not sound sensible to me. What do you say ?
>
> I agree we don't want code duplication, but I think we could reduce
> the code duplication to a pretty small amount with a few cleanups.
>
> I don't think wait_for_xmins() looks very well-designed. It goes to
> trouble of returning a value, but only 2 of the 6 call sites pay
> attention to the returned value.  I think we should change the
> function so that it doesn't return anything and have the callers that
> want a return value call get_slot_xmins() after wait_for_xmins().
Yeah, that can be done.

>
> And then I think we should turn around and get rid of get_slot_xmins()
> altogether. Instead of:
>
> my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot);
> is($xmin, '', "xmin null");
> is($catalog_xmin, '', "catalog_xmin null");
>
> We can write:
>
> my $slot = $node_master->slot($master_slot);
> is($slot->{'xmin'}, '', "xmin null");
> is($slot->{'catalog_xmin'}, '', "catalog xmin null");
>
> ...which is not really any longer or harder to read, but does
> eliminate the need for one function definition.
Agreed.

>
> Then I think we should change wait_for_xmins so that it takes three
> arguments rather than two: $node, $slotname, $check_expr.  With that
> and the previous change, we can get rid of get_node_from_slotname().
>
> At that point, the body of wait_for_xmins() would consist of a single
> call to $node->poll_query_until() or die(), which doesn't seem like
> too much code to duplicate into a new file.

Earlier it used to have 3 params, the same ones you mentioned. I
removed $node for caller convenience.

>
> Looking at it at a bit more, though, I wonder why the recovery
> conflict scenario is even using wait_for_xmins().  It's hard-coded to
> check the state of the master_physical slot, which isn't otherwise
> manipulated by the recovery conflict tests. What's the point of
> testing that a slot which had xmin and catalog_xmin NULL before the
> test started (line 414) and which we haven't changed since still has
> those values at two different points during the test (lines 432, 452)?
> Perhaps I'm missing something here, but it seems like this is just an
> inadvertent entangling of these scenarios with the previous scenarios,
> rather than anything that necessarily needs to be connected together.

In the "Drop slot" test scenario, we are testing that after we
manually drop the slot on standby, the master catalog_xmin should be
back to NULL. Hence, the call to wait_for_xmins().
And in the "Scenario 1 : hot_standby_feedback off", wait_for_xmins()
is called the first time only as a mechanism to ensure that
"hot_standby_feedback = off" has taken effect. At the end of this
test, wait_for_xmins() again is called only to ensure that
hot_standby_feedback = on has taken effect.

Preferably I want wait_for_xmins() to get rid of the $node parameter,
because we can deduce it using slot name. But that requires having
get_node_from_slotname(). Your suggestion was to remove
get_node_from_slotname() and add back the $node param so as to reduce
duplicate code. Instead, how about keeping  wait_for_xmins() in the
PostgresNode.pm() ? This way, we won't have duplication, and also we
can get rid of param $node. This is just my preference; if you are
quite inclined to not have get_node_from_slotname(), I will go with
your suggestion.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Removing unneeded self joins

2019-09-27 Thread Konstantin Knizhnik


Some more bug fixes in this patch.



--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02..dc8cb9c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2274,7 +2274,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_OID_FIELD(userid);
 	WRITE_BOOL_FIELD(useridiscurrent);
 	/* we don't try to print fdwroutine or fdw_private */
-	/* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */
+	WRITE_NODE_FIELD(unique_for_rels);
+	/* can't print non_unique_for_rels; BMSes aren't Nodes */
 	WRITE_NODE_FIELD(baserestrictinfo);
 	WRITE_UINT_FIELD(baserestrict_min_security);
 	WRITE_NODE_FIELD(joininfo);
@@ -2347,6 +2348,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node)
 }
 
 static void
+_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node)
+{
+	WRITE_NODE_TYPE("UNIQUERELINFO");
+
+	WRITE_BITMAPSET_FIELD(outerrelids);
+	if (node->index)
+	{
+		WRITE_OID_FIELD(index->indexoid);
+		WRITE_NODE_FIELD(column_values);
+	}
+}
+
+static void
 _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 {
 	/*
@@ -4122,6 +4136,9 @@ outNode(StringInfo str, const void *obj)
 			case T_StatisticExtInfo:
 _outStatisticExtInfo(str, obj);
 break;
+			case T_UniqueRelInfo:
+_outUniqueRelInfo(str, obj);
+break;
 			case T_ExtensibleNode:
 _outExtensibleNode(str, obj);
 break;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 37b257c..3d0d03b 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3564,7 +3564,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
  * relation_has_unique_index_for
  *	  Determine whether the relation provably has at most one row satisfying
  *	  a set of equality conditions, because the conditions constrain all
- *	  columns of some unique index.
+ *	  columns of some unique index. If index_info is not null, it is set to
+ *	  point to a new UniqueRelInfo containing the index and conditions.
  *
  * The conditions can be represented in either or both of two ways:
  * 1. A list of RestrictInfo nodes, where the caller has already determined
@@ -3585,12 +3586,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 bool
 relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 			  List *restrictlist,
-			  List *exprlist, List *oprlist)
+			  List *exprlist, List *oprlist,
+			  UniqueRelInfo **unique_info)
 {
 	ListCell   *ic;
 
 	Assert(list_length(exprlist) == list_length(oprlist));
 
+	if (unique_info)
+		*unique_info = NULL;
+
 	/* Short-circuit if no indexes... */
 	if (rel->indexlist == NIL)
 		return false;
@@ -3641,6 +3646,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
 		int			c;
+		List *column_values = NIL;
 
 		/*
 		 * If the index is not unique, or not immediately enforced, or if it's
@@ -3689,6 +3695,9 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 if (match_index_to_operand(rexpr, c, ind))
 {
 	matched = true; /* column is unique */
+	column_values = lappend(column_values, rinfo->outer_is_left
+			? get_leftop(rinfo->clause)
+			: get_rightop(rinfo->clause));
 	break;
 }
 			}
@@ -3731,7 +3740,22 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Matched all key columns of this index? */
 		if (c == ind->nkeycolumns)
+		{
+			if (unique_info)
+			{
+/* This may be called in GEQO memory context. */
+MemoryContext oldContext = MemoryContextSwitchTo(root->planner_cxt);
+*unique_info = makeNode(UniqueRelInfo);
+(*unique_info)->index = ind;
+(*unique_info)->column_values = list_copy(column_values);
+MemoryContextSwitchTo(oldContext);
+			}
+			if (column_values)
+list_free(column_values);
 			return true;
+		}
+		if (column_values)
+			list_free(column_values);
 	}
 
 	return false;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index dc28b56..450d2d4 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -176,7 +176,8 @@ add_paths_to_joinrel(PlannerInfo *root,
 	innerrel,
 	JOIN_INNER,
 	restrictlist,
-	false);
+	false,
+	NULL /*index_info*/);
 			break;
 		default:
 			extra.inner_unique = innerrel_is_unique(root,
@@ -185,7 +186,8 @@ add_paths_to_joinrel(PlannerInfo *root,
 	innerrel,
 	jointype,
 	restrictlist,
-	false);
+	false,
+	NULL /*index_info*/);
 			break;
 	}
 
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c

Re: Append with naive multiplexing of FDWs

2019-09-27 Thread Bruce Momjian

On Wed, Sep  4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:
> Hello,
> 
> A few years back[1] I experimented with a simple readiness API that
> would allow Append to start emitting tuples from whichever Foreign
> Scan has data available, when working with FDW-based sharding.  I used
> that primarily as a way to test Andres's new WaitEventSet stuff and my
> kqueue implementation of that, but I didn't pursue it seriously
> because I knew we wanted a more ambitious async executor rewrite and
> many people had ideas about that, with schedulers capable of jumping
> all over the tree etc.
> 
> Anyway, Stephen Frost pinged me off-list to ask about that patch, and
> asked why we don't just do this naive thing until we have something
> better.  It's a very localised feature that works only between Append
> and its immediate children.  The patch makes it work for postgres_fdw,
> but it should work for any FDW that can get its hands on a socket.
> 
> Here's a quick rebase of that old POC patch, along with a demo.  Since
> 2016, Parallel Append landed, but I didn't have time to think about
> how to integrate with that so I did a quick "sledgehammer" rebase that
> disables itself if parallelism is in the picture.

Yes, sharding has been waiting on parallel FDW scans.  Would this work
for parallel partition scans if the partitions were FDWs?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: WIP/PoC for parallel backup

2019-09-27 Thread Asif Rehman

Hi Robert,

Thanks for the feedback. Please see the comments below:

On Tue, Sep 24, 2019 at 10:53 PM Robert Haas  wrote:

> On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman 
> wrote:
> > - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
> > If the parallel option is there, then it will only do pg_start_backup,
> scans PGDATA and sends a list of file names.
>
> So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
> tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
> list of file names. I don't think that's a good approach. It's too
> confusing to have one replication command that returns totally
> different things depending on whether some option is given.
>

Sure. I will add a separate command (START_BACKUP)  for parallel.


> > - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given
> list.
> > pg_basebackup will then send back a list of filenames in this command.
> This commands will be send by each worker and that worker will be getting
> the said files.
>
> Seems reasonable, but I think you should just pass one file name and
> use the command multiple times, once per file.
>

I considered this approach initially,  however, I adopted the current
strategy to avoid multiple round trips between the server and clients and
save on query processing time by issuing a single command rather than
multiple ones. Further fetching multiple files at once will also aid in
supporting the tar format by utilising the existing ReceiveTarFile()
function and will be able to create a tarball for per tablespace per worker.


>
> > - STOP_BACKUP
> > when all workers finish then, pg_basebackup will send STOP_BACKUP
> command.
>
> This also seems reasonable, but surely the matching command should
> then be called START_BACKUP, not BASEBACKUP PARALLEL.
>
> > I have done a basic proof of concenpt (POC), which is also attached. I
> would appreciate some input on this. So far, I am simply dividing the list
> equally and assigning them to worker processes. I intend to fine tune this
> by taking into consideration file sizes. Further to add tar format support,
> I am considering that each worker process, processes all files belonging to
> a tablespace in its list (i.e. creates and copies tar file), before it
> processes the next tablespace. As a result, this will create tar files that
> are disjointed with respect tablespace data. For example:
>
> Instead of doing this, I suggest that you should just maintain a list
> of all the files that need to be fetched and have each worker pull a
> file from the head of the list and fetch it when it finishes receiving
> the previous file.  That way, if some connections go faster or slower
> than others, the distribution of work ends up fairly even.  If you
> instead pre-distribute the work, you're guessing what's going to
> happen in the future instead of just waiting to see what actually does
> happen. Guessing isn't intrinsically bad, but guessing when you could
> be sure of doing the right thing *is* bad.
>
> If you want to be really fancy, you could start by sorting the files
> in descending order of size, so that big files are fetched before
> small ones.  Since the largest possible file is 1GB and any database
> where this feature is important is probably hundreds or thousands of
> GB, this may not be very important. I suggest not worrying about it
> for v1.
>

Ideally, I would like to support the tar format as well, which would be
much easier to implement when fetching multiple files at once since that
would enable using the existent functionality to be used without much
change.

Your idea of sorting the files in descending order of size seems very
appealing. I think we can do this and have the file divided among the
workers one by one i.e. the first file in the list goes to worker 1, the
second to process 2, and so on and so forth.


>
> > Say, tablespace t1 has 20 files and we have 5 worker processes and
> tablespace t2 has 10. Ignoring all other factors for the sake of this
> example, each worker process will get a group of 4 files of t1 and 2 files
> of t2. Each process will create 2 tar files, one for t1 containing 4 files
> and another for t2 containing 2 files.
>
> This is one of several possible approaches. If we're doing a
> plain-format backup in parallel, we can just write each file where it
> needs to go and call it good. But, with a tar-format backup, what
> should we do? I can see three options:
>
> 1. Error! Tar format parallel backups are not supported.
>
> 2. Write multiple tar files. The user might reasonably expect that
> they're going to end up with the same files at the end of the backup
> regardless of whether they do it in parallel. A user with this
> expectation will be disappointed.
>
> 3. Write one tar file. In this design, the workers have to take turns
> writing to the tar file, so you need some synchronization around that.
> Perhaps you'd have N threads that read and buffer a file, and N+1
> buffers.  Then

Re: Instability of partition_prune regression test results

2019-09-27 Thread Tom Lane

Amit Langote  writes:
> On Fri, Sep 27, 2019 at 7:25 AM Tom Lane  wrote:
>> I experimented with adjusting explain_parallel_append() to filter
>> more fields, but soon realized that we'd have to filter out basically
>> everything that makes it useful to run EXPLAIN ANALYZE at all.
>> Therefore, I think it's time to give up this testing methodology
>> as a bad idea, and fall back to the time-honored way of running a
>> plain EXPLAIN and then the actual query, as per the attached patch.

> Isn't the point of using ANALYZE here to show that the exec-param
> based run-time pruning is working (those "never executed" strings)?

Hm.  Well, if you want to see those, we could do it as attached.

regards, tom lane

diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 82b68e7..12d6dfc 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1990,9 +1990,14 @@ explain (analyze, costs off, summary off, timing off) select * from list_part wh
 rollback;
 drop table list_part;
 -- Parallel append
--- Suppress the number of loops each parallel node runs for.  This is because
--- more than one worker may run the same parallel node if timing conditions
--- are just right, which destabilizes the test.
+-- Parallel queries won't necessarily get as many workers as the planner
+-- asked for.  This affects not only the "Workers Launched:" field of EXPLAIN
+-- results, but also row counts and loop counts for parallel scans, Gathers,
+-- and everything in between.  This function filters out the values we can't
+-- rely on to be stable.
+-- This removes enough info that you might wonder why bother with EXPLAIN
+-- ANALYZE at all.  The answer is that we need to see '(never executed)'
+-- notations because that's the only way to verify runtime pruning.
 create function explain_parallel_append(text) returns setof text
 language plpgsql as
 $$
@@ -2003,9 +2008,8 @@ begin
 execute format('explain (analyze, costs off, summary off, timing off) %s',
 $1)
 loop
-if ln like '%Parallel%' then
-ln := regexp_replace(ln, 'loops=\d*',  'loops=N');
-end if;
+ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N');
+ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
 return next ln;
 end loop;
 end;
@@ -2052,18 +2056,18 @@ execute ab_q4 (1, 8);
 select explain_parallel_append('execute ab_q4 (2, 2)');
 explain_parallel_append
 ---
- Finalize Aggregate (actual rows=1 loops=1)
-   ->  Gather (actual rows=3 loops=1)
+ Finalize Aggregate (actual rows=N loops=N)
+   ->  Gather (actual rows=N loops=N)
  Workers Planned: 2
- Workers Launched: 2
- ->  Partial Aggregate (actual rows=1 loops=3)
-   ->  Parallel Append (actual rows=0 loops=N)
+ Workers Launched: N
+ ->  Partial Aggregate (actual rows=N loops=N)
+   ->  Parallel Append (actual rows=N loops=N)
  Subplans Removed: 6
- ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=N)
+ ->  Parallel Seq Scan on ab_a2_b1 (actual rows=N loops=N)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
- ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=N)
+ ->  Parallel Seq Scan on ab_a2_b2 (actual rows=N loops=N)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
- ->  Parallel Seq Scan on ab_a2_b3 (actual rows=0 loops=N)
+ ->  Parallel Seq Scan on ab_a2_b3 (actual rows=N loops=N)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
 (13 rows)
 
@@ -2105,42 +2109,42 @@ execute ab_q5 (1, 2, 3);
 select explain_parallel_append('execute ab_q5 (1, 1, 1)');
 explain_parallel_append
 ---
- Finalize Aggregate (actual rows=1 loops=1)
-   ->  Gather (actual rows=3 loops=1)
+ Finalize Aggregate (actual rows=N loops=N)
+   ->  Gather (actual rows=N loops=N)
  Workers Planned: 2
- Workers Launched: 2
- ->  Partial Aggregate (actual rows=1 loops=3)
-   ->  Parallel Append (actual rows=0 loops=N)
+ Workers Launched: N
+ ->  Partial Aggregate (actual rows=N loops=N)
+   ->  Parallel Append (actual rows=N loops=N)
  Subplans Removed: 6
- ->  Parallel Seq Scan on ab_a1_b1 (actual rows=0 loops=N)
+ ->  Parallel Seq Scan on ab_a1_b1 (actual rows=N loops=N)
Filter: ((b < 4) AND (a = ANY

Re: Standby accepts recovery_target_timeline setting?

2019-09-27 Thread Fujii Masao

On Sat, Sep 28, 2019 at 12:14 AM David Steele  wrote:
>
> Hi Peter,
>
> On 9/27/19 10:36 AM, Peter Eisentraut wrote:
> > On 2019-09-26 23:02, David Steele wrote:
> >> On 9/26/19 4:48 PM, Peter Eisentraut wrote:
> >>
> >>> I don't know if recovery_target_timeline is actually useful to change in
> >>> standby mode.
> >
> > OK, I have committed your original documentation patch.
>
> Thanks, that's a good start.
>
> As Fujii noticed and I have demonstrated upthread, just about any target
> setting can be used in a standby restore.  This matches the behavior of
> prior versions so it's not exactly a regression, but the old docs made
> no claim that standby_mode disabled targeted restore.
>
> If fact, for both PG12 and before, setting a recovery target in standby
> mode causes the cluster to drop out of standby mode.
>
> Also, the presence or absence of recovery.signal does not seem to have
> any effect on how targeted recovery proceeds, except as Fujii has
> demonstrated in [1].
>
> I'm not sure what the best thing to do is.  The docs are certainly
> incorrect, but fixing them would be weird.  What do we say, setting
> targets will exit standby mode?  That certainly what happens, though.
>
> Also, the fact that target settings are being used when recovery.signal
> is missing is contrary to the docs, and this is a new behavior in PG12.
>  Prior to 12 you could not have target settings without recovery.conf
> being present by definition.
>
> I think, at the very least, the fact that targeted recovery proceeds in
> the absence of recovery.signal represents a bug.

Yes, recovery target settings are used even when neither backup_label
nor recovery.signal exist, i.e., just a crash recovery, in v12. This is
completely different behavior from prior versions.

IMO, since v12 is RC1 now, it's not good idea to change the logic to new.
So at least for v12, we basically should change the recovery logic so that
it behaves in the same way as prior versions. That is,

- Stop the recovery with an error if any recovery target is set in
   crash recovery
- Use recovery target settings if set even when standby mode
- Do not enter an archive recovery mode if recovery.signal is missing

Thought?

If we want new behavior in recovery, we can change the logic for v13.

Regards,

-- 
Fujii Masao

Re: Improving on MAX_CONVERSION_GROWTH

2019-09-27 Thread Robert Haas

On Fri, Sep 27, 2019 at 11:40 AM Andres Freund  wrote:
> Note that one of the additional reasons for the 1GB limit is that it
> protects against int overflows. I'm somewhat unconvinced that that's a
> sensible approach, but ...

It's not crazy. People using 'int' rather casually just as they use
'palloc' rather casually, without necessarily thinking about what
could go wrong at the edges. I don't have any beef with that as a
general strategy; I just think we should be trying to do better in the
cases where it negatively affects the user experience.

> It's worthwhile to note that additional passes over data are often quite
> expensive, memory latency hasn't shrunk that much in last decade or
> so. I have frequently seen all the memcpys from one StringInfo/char*
> into another StringInfo show up in profiles.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fix "Success" error messages

2019-09-27 Thread Bruce Momjian

On Tue, Sep  3, 2019 at 08:38:22AM +0200, Peter Eisentraut wrote:
> On 2019-08-27 08:27, Michael Paquier wrote:
> > Thanks for the new patch, and you are right that pg_checksums has been
> > slacking here.  There is the same issue with pg_verify_checksums in
> > 11.  Not sure that's worth a back-patch though.  Those parts could
> > find their way to v12 easily.
> 
> Committed to master and PG12 with your suggested changes.

This "Success" has happened so many times I think we should tell people
to report any such error message as a bug by emitting a special error
message line.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: errbacktrace

2019-09-27 Thread Alvaro Herrera

On 2019-Sep-13, Alvaro Herrera wrote:

> On 2019-Aug-20, Peter Eisentraut wrote:
> 
> > The memory management of that seems too complicated.  The "extra"
> > mechanism of the check/assign hooks only supports one level of malloc.
> > Using a List seems impossible.  I don't know if you can safely do a
> > malloc-ed array of malloc-ed strings either.
> 
> Here's an idea -- have the check/assign hooks create a different
> representation, which is a single guc_malloc'ed chunk that is made up of
> every function name listed in the GUC, separated by \0.  That can be
> scanned at error time comparing the function name with each piece.

Peter, would you like me to clean this up for commit, or do you prefer
to keep authorship and get it done yourself?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving on MAX_CONVERSION_GROWTH

2019-09-27 Thread Andres Freund

Hi,

On 2019-09-27 10:53:48 -0400, Robert Haas wrote:
> A lot of that is because they hit the 1GB allocation limit, and I
> wonder whether we shouldn't be trying harder to avoid imposing that
> limit in multiple places.

> It's reasonable - and necessary - to impose
> a limit on the size of an individual datum, but when that same limit
> is imposed on other things, like the worst-case size of the encoding
> conversion, the size of an individual message sent via the wire
> protocol, etc., you end up with a situation where users have trouble
> predicting what the behavior is going to be. >=1GB definitely won't
> work, but it'll probably break at some point before you even get that
> far depending on a bunch of complex factors that are hard to
> understand, not really documented, and mostly the result of applying
> 1GB limit to every single memory allocation across the whole backend
> without really thinking about what that does to the user-visible
> behavior.

+1 - that will be a long, piecemeal, project I think... But deciding
that we should do so is a good first step.

Note that one of the additional reasons for the 1GB limit is that it
protects against int overflows. I'm somewhat unconvinced that that's a
sensible approach, but ...

I wonder if we shouldn't make stringinfos use size_t lengths, btw. Only
supporting INT32_MAX (not even UINT32_MAX) seems weird these days. But
we'd presumably have to make it opt-in.

> One approach I think we should consider is, for larger strings,
> actually scan the string and figure out how much memory we're going to
> need for the conversion and then allocate exactly that amount (and
> fail if it's >=1GB). An extra scan over the string is somewhat costly,
> but allocating hundreds of megabytes of memory on the theory that we
> could hypothetically have needed it is costly in different way.

My proposal for this is something like
https://www.postgresql.org/message-id/20190924214204.mav4md77xg5u5wq6%40alap3.anarazel.de
which should avoid the overallocation without a second pass, and
hopefully without loosing much efficiency.

It's worthwhile to note that additional passes over data are often quite
expensive, memory latency hasn't shrunk that much in last decade or
so. I have frequently seen all the memcpys from one StringInfo/char*
into another StringInfo show up in profiles.

Greetings,

Andres Freund

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-27 Thread Alvaro Herrera

On 2019-Sep-27, Alexey Kondratov wrote:

> 1) Maybe I've missed it somewhere in the thread above, but currently
> pg_rewind allows to run itself with -R and --source-pgdata. In that case -R
> option is just swallowed and neither standby.signal, nor
> postgresql.auto.conf is written, which is reasonable though. Should it be
> stated somehow in the docs that -R option always has to go altogether with
> --source-server? Or should pg_rewind notify user that options are
> incompatible and no recovery configuration will be written?

Hmm I think it should throw an error, yeah.  Ignoring options is not
good.

> + # Now, when pg_rewind apparently succeeded with minimal 
> permissions,
> + # add REPLICATION privilege.  So we could test that new standby
> + # is able to connect to the new master with generated config.
> + $node_standby->psql(
> + 'postgres', "ALTER ROLE rewind_user WITH REPLICATION;");

I think this better use safe_psql.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Standby accepts recovery_target_timeline setting?

2019-09-27 Thread David Steele

Hi Peter,

On 9/27/19 10:36 AM, Peter Eisentraut wrote:
> On 2019-09-26 23:02, David Steele wrote:
>> On 9/26/19 4:48 PM, Peter Eisentraut wrote:
>>
>>> I don't know if recovery_target_timeline is actually useful to change in
>>> standby mode.
> 
> OK, I have committed your original documentation patch.

Thanks, that's a good start.

As Fujii noticed and I have demonstrated upthread, just about any target
setting can be used in a standby restore.  This matches the behavior of
prior versions so it's not exactly a regression, but the old docs made
no claim that standby_mode disabled targeted restore.

If fact, for both PG12 and before, setting a recovery target in standby
mode causes the cluster to drop out of standby mode.

Also, the presence or absence of recovery.signal does not seem to have
any effect on how targeted recovery proceeds, except as Fujii has
demonstrated in [1].

I'm not sure what the best thing to do is.  The docs are certainly
incorrect, but fixing them would be weird.  What do we say, setting
targets will exit standby mode?  That certainly what happens, though.

Also, the fact that target settings are being used when recovery.signal
is missing is contrary to the docs, and this is a new behavior in PG12.
 Prior to 12 you could not have target settings without recovery.conf
being present by definition.

I think, at the very least, the fact that targeted recovery proceeds in
the absence of recovery.signal represents a bug.

-- 
-David
da...@pgmasters.net

[1]
https://www.postgresql.org/message-id/CAHGQGwEYYg_Ng%2B03FtZczacCpYgJ2Pn%3DB_wPtWF%2BFFLYDgpa1g%40mail.gmail.com

Re: Improving on MAX_CONVERSION_GROWTH

2019-09-27 Thread Robert Haas

On Tue, Sep 24, 2019 at 5:15 PM Tom Lane  wrote:
> In any case, it seems likely that we could end up with a
> multiplier of 1, 2, or 3 rather than 4 in just about every
> case of practical interest.  That sure seems like a win
> when converting long strings.

+1. From what I've seen, I'd say this is a significant practical
problem for people who are trying to store large blobs of data in the
database.

A lot of that is because they hit the 1GB allocation limit, and I
wonder whether we shouldn't be trying harder to avoid imposing that
limit in multiple places. It's reasonable - and necessary - to impose
a limit on the size of an individual datum, but when that same limit
is imposed on other things, like the worst-case size of the encoding
conversion, the size of an individual message sent via the wire
protocol, etc., you end up with a situation where users have trouble
predicting what the behavior is going to be. >=1GB definitely won't
work, but it'll probably break at some point before you even get that
far depending on a bunch of complex factors that are hard to
understand, not really documented, and mostly the result of applying
1GB limit to every single memory allocation across the whole backend
without really thinking about what that does to the user-visible
behavior.

Now, that's not to say we should abandon MaxAllocSize, which I agree
serves as a useful backstop. But IMHO it would be smart to start with
the desired user-facing behavior -- we want to support datums up to X
size -- and then consider how we can get there while maintaining
MaxAllocSize as a general-purpose backstop. Our current strategy seems
to be mostly the reverse: write the code the way that feels natural,
enforce MaxAllocSize everywhere, and if that breaks things for a user,
well then that means - by definition - that the user was trying to do
something we don't support.

One approach I think we should consider is, for larger strings,
actually scan the string and figure out how much memory we're going to
need for the conversion and then allocate exactly that amount (and
fail if it's >=1GB). An extra scan over the string is somewhat costly,
but allocating hundreds of megabytes of memory on the theory that we
could hypothetically have needed it is costly in different way. Memory
is more abundant today than it's ever been, but there are still plenty
of systems where a couple of extra allocations in the multi-hundred-MB
range can make the whole thing fall over. And even if it doesn't make
the whole thing fall over, the CPU efficiency of avoiding an extra
pass over the string really ought to be compared with the memory
efficiency of allocating extra storage.  Getting down from a
worst-case multiple of 4 to 2 is a great idea, but it still means that
converting a 100MB string will allocate 200MB when what you need will
very often be between 100MB and 105MB.  That's not an insignificant
cost, even though it's much better than allocating 400MB.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Standby accepts recovery_target_timeline setting?

2019-09-27 Thread Peter Eisentraut

On 2019-09-26 23:02, David Steele wrote:
> On 9/26/19 4:48 PM, Peter Eisentraut wrote:
>> On 2019-09-25 22:21, David Steele wrote:
>>> While testing against PG12 I noticed the documentation states that
>>> recovery targets are not valid when standby.signal is present.
>>>
>>> But surely the exception is recovery_target_timeline?  My testing
>>> confirms that this works just as in prior versions with standy_mode=on.
>>
>> Or maybe we should move recovery_target_timeline to a different section?
>>   But which one?
> 
> Not sure.  I think just noting it as an exception is OK, if it is the 
> only exception.  But currently that does not seem to be the case.
> 
>> I don't know if recovery_target_timeline is actually useful to change in
>> standby mode.

OK, I have committed your original documentation patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: abort-time portal cleanup

2019-09-27 Thread Robert Haas

On Tue, Sep 24, 2019 at 4:45 PM Andres Freund  wrote:
> Hm. Doing that cleanup requires digging through all the portals etc. I'd
> rather rely on less state being correct than more during FATAL
> processing.

I agree with the principle, but the advantage of removing the hash
table entries is that it protects against any other code that might
try to access the portal machinery later; I thought that was
worthwhile enough to justify doing this here. I don't feel
super-strongly about it, but I do still like that approach.

> > The second wrinkle is that there might be an active portal.  Apart
> > from the FATAL case already mentioned, I think the only way this can
> > happen is some C code that calls purposefully calls AbortTransaction()
> > in the middle of executing a command.  It can't be an ERROR, because
> > then the portal would be marked as failed before we get here, and it
> > can't be an explicit ROLLBACK, because as noted above, that case was
> > changed 15 years ago.  It's got to be some other case where C code
> > calls AbortTransaction() voluntarily in the middle of a statement. For
> > over a decade, there were no cases at all of this type, but the code
> > in this function catered to hypothetical cases by marking the portal
> > failed. By 2016, Noah had figured out that this was bogus, and that
> > any future cases would likely require different handling, but Tom and
> > I shouted him down:
>
> Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
> need to do a ton of checks and special case hangups to make
> CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
> CIC, ...

Right... those are the kinds of cases I'm talking about here, just for
abort rather than commit.

> The cases where one can use AbortTransaction() (via
> AbortCurrentTransaction() presumably) are ones where either there's no
> surrounding code relying on the transaction (e.g. autovacuum,
> postgres.c), or where special care has been taken with portals
> (e.g. _SPI_rollback()).  We didn't have the pin mechanism back then, so
> I think even if we accept your/Tom's reasoning from back then (I don't
> really), it's outdated now that the pin mechanism exists.

It isn't, actually.  To respond to this and also your question below
about why I'm looking at active portal rather than pinned portals, try
adding this debugging code to AtAbort_Portals:

+if (portal->status == PORTAL_ACTIVE)
+elog(NOTICE, "this portal is ACTIVE and %spinned",
+ portal->portalPinned ? "" : "NOT ");

Then run 'make -C src/pl/plpgsql check' and check
src/pl/plpgsql/src/regression.diffs and you'll see a whole lot of
this:

+NOTICE:  this portal is ACTIVE and NOT pinned

The PLs pin the portals they generate internally, but they don't force
the surrounding portal in which the toplevel query is executing to be
pinned.  AFAICT, pinning is mostly guarding against explicit
user-initiated drops of portals that were automatically generated by a
PL, whereas the portal's state is about tracking what the system is
doing with the portal.

(I think this could be a lot better documented than it is, but looking
at the commit history, I'm fairly sure that's what is happening here.)

> I'd be happy if we added some defenses against such bogus cases being
> introduced (i.e. erroring out if we encounter an active portal during
> abort processing).

Erroring out during error handling is probably a bad idea, but also, see above.

> > Stepping back a bit, stored procedures are a good example of a command
> > that uses multiple transactions internally. We have a few others, such
> > as VACUUM, but at present, that case only commits transactions
> > internally; it does not roll them back internally.  If it did, it
> > would want the same thing that procedures want, namely, to leave the
> > active portal alone. It doesn't quite work to leave the active portal
> > completely alone, because the portal has got a pointer to a
> > ResourceOwner which is about to be freed by end-of-transaction
> > cleanup;
>
> Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
> which, via HoldPortal(), sets portal->resowner to = NULL.

Right, but it's still necessary for AtAbort_Portals() to do the same
thing, again because the top-level portal isn't pinned.  If you apply
my patch, comment out the line that does portal->resowner = NULL; for
an active portal, and run make -C src/pl/plpgsql check, it will seg
fault inside exec_simple_query -> PortalDrop -> ResourceOwnerRelease
-> etc.

> > The current code also calls PortalReleaseCachedPlan in this case; I'm
> > not 100% certain whether that's appropriate or not.
>
> I think it's appropriate, because we cannot guarantee that the plan is
> still usable. Besides normal plan invalidation issues, the catalog
> contents the plan might rely on might have only existed in the aborted
> transaction - which seems like a fatal problem. That's why holding
> portals persists the

Re: Support for jsonpath .datetime() method

2019-09-27 Thread Nikita Glukhov


On 25.09.2019 22:55, Alexander Korotkov wrote:


On Mon, Sep 23, 2019 at 10:05 PM Alexander Korotkov
 wrote:

I've reordered the patchset.  I moved the most debatable patch, which
introduces  and RR and changes parsing of YYY, YY and Y to the
end.  I think we have enough of time in this release cycle to decide
whether we want this.

Patches 0001-0005 looks quite mature for me.  I'm going to push this
if no objections.  After pushing them, I'm going to start discussion
related to RR, YY and friends in separate thread.

Pushed.  Remaining patch is attached.  I'm going to start the separate
thread with its detailed explanation.


Attached patch with refactoring of compareDatetime() according
to the complaints of Tom Lane in [1]:
 * extracted four subroutines for type conversions
 * extracted subroutine for error reporting
 * added default cases to all switches
 * have_error flag is expected to be not-NULL always
 * fixed errhint() message style

[1] https://www.postgresql.org/message-id/32308.1569455803%40sss.pgh.pa.us

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
>From 81d8de2f1d0e0d4ec44729d3d2976b1e63834b14 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Thu, 26 Sep 2019 17:52:40 +0300
Subject: [PATCH] Refactor jsonpath's compareDatime()

---
 src/backend/utils/adt/jsonpath_exec.c| 181 ++-
 src/test/regress/expected/jsonb_jsonpath.out |  30 ++---
 2 files changed, 110 insertions(+), 101 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index a35f718..7e540e3 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -2298,7 +2298,7 @@ compareItems(int32 op, JsonbValue *jb1, JsonbValue *jb2, bool useTz)
 			break;
 		case jbvDatetime:
 			{
-bool		have_error = false;
+bool		have_error;
 
 cmp = compareDatetime(jb1->val.datetime.value,
 	  jb1->val.datetime.typid,
@@ -2571,15 +2571,72 @@ wrapItemsInArray(const JsonValueList *items)
 	return pushJsonbValue(, WJB_END_ARRAY, NULL);
 }
 
+/* Check if the timezone required for casting from type1 to type2 is used */
+static void
+checkTimezoneIsUsedForCast(bool useTz, const char *type1, const char *type2)
+{
+	if (!useTz)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot convert value from %s to %s without timezone usage",
+		type1, type2),
+ errhint("Use *_tz() function for timezone support.")));
+}
+
+/* Convert date datum to timestamp datum */
+static Datum
+castDateToTimestamp(Datum dt, bool *have_error)
+{
+	Timestamp	ts = date2timestamp_opt_error(DatumGetDateADT(dt), have_error);
+
+	return TimestampGetDatum(ts);
+}
+
+/* Convert date datum to timestamptz datum */
+static Datum
+castDateToTimestampTz(Datum date, bool useTz, bool *have_error)
+{
+	TimestampTz tstz;
+
+	checkTimezoneIsUsedForCast(useTz, "date", "timestamptz");
+	tstz = date2timestamptz_opt_error(DatumGetDateADT(date), have_error);
+
+	return TimestampTzGetDatum(tstz);
+}
+
+/* Convert time datum to timetz datum */
+static Datum
+castTimeToTimeTz(Datum time, bool useTz)
+{
+	checkTimezoneIsUsedForCast(useTz, "time", "timetz");
+
+	return DirectFunctionCall1(time_timetz, time);
+}
+
+/* Convert timestamp datum to timestamptz datum */
+static Datum
+castTimestampToTimestampTz(Timestamp ts, bool useTz, bool *have_error)
+{
+	TimestampTz tstz;
+
+	checkTimezoneIsUsedForCast(useTz, "timestamp", "timestamptz");
+	tstz = timestamp2timestamptz_opt_error(DatumGetTimestamp(ts), have_error);
+
+	return TimestampTzGetDatum(tstz);
+}
+
 /*
  * Cross-type comparison of two datetime SQL/JSON items.  If items are
- * uncomparable, 'error' flag is set.
+ * uncomparable or there is an error during casting, 'have_error' flag is set.
+ * If the cast requires timezone and it is not used, then hard error is thrown.
  */
 static int
 compareDatetime(Datum val1, Oid typid1, Datum val2, Oid typid2,
 bool useTz, bool *have_error)
 {
-	PGFunction cmpfunc = NULL;
+	PGFunction cmpfunc;
+
+	*have_error = false;
 
 	switch (typid1)
 	{
@@ -2588,35 +2645,26 @@ compareDatetime(Datum val1, Oid typid1, Datum val2, Oid typid2,
 			{
 case DATEOID:
 	cmpfunc = date_cmp;
-
 	break;
 
 case TIMESTAMPOID:
-	val1 = TimestampGetDatum(date2timestamp_opt_error(DatumGetDateADT(val1), have_error));
-	if (have_error && *have_error)
-		return 0;
+	val1 = castDateToTimestamp(val1, have_error);
 	cmpfunc = timestamp_cmp;
-
 	break;
 
 case TIMESTAMPTZOID:
-	if (!useTz)
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot convert value from %s to %s without timezone usage",
-		"date", "timestamptz"),
- errhint("use *_tz() function for timezone support")));
-	val1 = TimestampTzGetDatum(date2timestamptz_opt_error(DatumGetDateADT(val1), have_error));
-	if (have_error && *have_error)

Re: PostgreSQL12 and older versions of OpenSSL

2019-09-27 Thread Michael Paquier

On Fri, Sep 27, 2019 at 03:50:57PM +0200, Peter Eisentraut wrote:
> On 2019-09-27 03:51, Michael Paquier wrote:
>> Your patch does not issue a ereport(LOG/FATAL) in the event of a
>> failure with SSL_CTX_set_max_proto_version(), which is something done
>> when ssl_protocol_version_to_openssl()'s result is -1.  Wouldn't it be
>> better to report that properly to the user?
> 
> Our SSL_CTX_set_max_proto_version() is a reimplementation of a function
> that exists in newer versions of OpenSSL, so it has a specific error
> behavior.  Our implementation should probably not diverge from it too much.

I agree with this point.  Now my argument is about logging LOG or
FATAL within be_tls_init() after the two OpenSSL functions (or our
wrappers) SSL_CTX_set_min/max_proto_version are called.
--
Michael


signature.asc
Description: PGP signature

Re: PostgreSQL12 and older versions of OpenSSL

2019-09-27 Thread Peter Eisentraut

On 2019-09-27 03:51, Michael Paquier wrote:
> I have tested compilation of REL_12_STABLE with the top of OpenSSL
> 0.9.8, 1.0.0, 1.0.1, 1.0.2, 1.1.0 and 1.1.1.  Our SSL tests also pass
> in all the setups I have tested.

great

> Your patch does not issue a ereport(LOG/FATAL) in the event of a
> failure with SSL_CTX_set_max_proto_version(), which is something done
> when ssl_protocol_version_to_openssl()'s result is -1.  Wouldn't it be
> better to report that properly to the user?

Our SSL_CTX_set_max_proto_version() is a reimplementation of a function
that exists in newer versions of OpenSSL, so it has a specific error
behavior.  Our implementation should probably not diverge from it too much.

> Some more nits about the patch I have.  Would it be worth copying the
> comment from min_proto_version() to SSL_CTX_set_max_proto_version()?
> I would add a newline before the comment block as well.

ok

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cleanup code related to OpenSSL <= 0.9.6 in fe/be-secure-openssl.c

2019-09-27 Thread Peter Eisentraut

On 2019-09-27 05:23, Michael Paquier wrote:
> While reviewing the area, I have bumped into the following bit in
> fe-secure-openssl.c and be-secure-openssl.c:
> -/* OpenSSL 0.96 does not support X509_V_FLAG_CRL_CHECK */
> -#ifdef X509_V_FLAG_CRL_CHECK
> [...  stuff ...]
> 
> I think that this did not get removed because of the incorrect version
> number in the comment, which should have been 0.9.6 from the start.
> 
> Anyway, let's clean up this code as per the attached.  This set of
> flags indeed exists since 0.9.7.  Any thoughts or objections?

Yes, it seems OK to clean this up in master.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Shared Memory: How to use SYSV rather than MMAP ?

2019-09-27 Thread Alvaro Herrera

Hi Tony,

On 2019-Sep-27, REIX, Tony wrote:

> Hello Thomas, Alvaro,
> 
> Sorry for the late answer, I missed your message of September 10. (I'm 
> working on several different projects in parallel.)
> Let me talk with Sylvie ASAP and see when I will be able to test it, probably 
> next week, Tuesday. Is that OK for you?

Sure -- I'm inclined to push this patch in state Needs Review to the
November commitfest in this case.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench - allow to create partitioned tables

2019-09-27 Thread Alvaro Herrera

On 2019-Sep-27, Amit Kapila wrote:

> Thanks, Alvaro, both seem like good suggestions to me.  However, there
> are a few more things where your feedback can help:
> a.  With new options, we will partition pgbench_accounts and the
> reason is that because that is the largest table.  Do we need to be
> explicit about the reason in docs?

Hmm, I would document what is it that we do, and stop there without
explaining why.  Unless you have concrete reasons to want the reason
documented?

> b.  I am not comfortable with test modification in
> 001_pgbench_with_server.pl.  Basically, it doesn't seem like we should
> modify the existing test to use non-default tablespaces as part of
> this patch.  It might be a good idea in general, but I am not sure
> doing as part of this patch is a good idea as there is no big value
> addition with that modification as far as this patch is concerned.
> OTOH, as such there is no harm in testing with non-default
> tablespaces.

Yeah, this change certainly is out of place in this patch.

> The other thing is that the query used in patch to fetch partition
> information seems correct to me, but maybe there is a better way to
> get that information.

I hadn't looked at that, but yeah it seems that it should be using
pg_partition_tree().

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

2019-09-27 Thread Alvaro Herrera

On 2019-Sep-26, Antonin Houska wrote:

> One comment on the remaining part of the series:
> 
> Before this refactoring, the walsender.c:XLogRead() function contained these
> lines
> 
>/*
> * After reading into the buffer, check that what we read was valid. 
> We do
> * this after reading, because even though the segment was present 
> when we
> * opened it, it might get recycled or removed while we read it. The
> * read() succeeds in that case, but the data we tried to read might
> * already have been overwritten with new WAL records.
> */
>XLByteToSeg(startptr, segno, segcxt->ws_segsize);
>CheckXLogRemoved(segno, ThisTimeLineID);
> 
> but they don't fit into the new, generic implementation, so I copied these
> lines to the two places right after the call of the new XLogRead(). However I
> was not sure if ThisTimeLineID was ever correct here. It seems the original
> walsender.c:XLogRead() implementation did not update ThisTimeLineID (and
> therefore neither the new callback WalSndSegmentOpen() does), so both
> logical_read_xlog_page() and XLogSendPhysical() could read the data from
> another (historic) timeline. I think we should check the segment we really
> read data from:
> 
>   CheckXLogRemoved(segno, sendSeg->ws_tli);

Hmm, okay.  I hope we can get rid of ThisTimeLineID one day.

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist.  You need to allocate the
struct in the callers stacks and pass its address to XLogRead.  XLogRead
can return NULL if everything's okay or the pointer to the errinfo
struct.

I've been wondering if it's really necessary to pass 'seg' to the
openSegment() callback.  Only walsender wants that, and it seems ...
weird.  Maybe that's not something for this patch series to fix, but it
would be good to find a more decent way to do the TLI switch at some
point.

> + /*
> +  * If the function is called by the XLOG reader, the 
> reader will
> +  * eventually set both "ws_segno" and "ws_off", however 
> the XLOG
> +  * reader is not necessarily involved. Furthermore, we 
> need to set
> +  * the current values for this function to work.
> +  */
> + seg->ws_segno = nextSegNo;
> + seg->ws_off = 0;

Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
make more sense to leave XLogRead be always responsible for setting
these correctly, and remove those lines from ReadPageInternal?  (BTW "is
called by the XLOG reader" is a bit strange in code that appears in
xlogreader.c).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_wal/RECOVERYHISTORY file remains after archive recovery

2019-09-27 Thread Masahiko Sawada

On Fri, Sep 27, 2019 at 8:41 PM Fujii Masao  wrote:
>
> On Fri, Sep 27, 2019 at 7:16 PM Michael Paquier  wrote:
> >
> > On Fri, Sep 27, 2019 at 05:58:21PM +0900, Fujii Masao wrote:
> > > So you think that it's better to remove them just after 
> > > writeTimeLineHistory()?
> > > Per the following Sawada-san's comment, I was thinking that idea is 
> > > basically
> > > not good. And, RemoveNonParentXlogFiles() also removes garbage files from
> > > pg_wal. It's simpler if similar codes exist near. Thought?
> >
> > Sawada-san's argument of upthread is that it is not good to put
> > exitArchiveRecovery() after writeTimeLineHIstory(), which is what
> > cbc55da has done per the reasons mentioned in the commit log, and we
> > should not change that.
> >
> > My argument is we know that RECOVERYXLOG and RECOVERYHISTORY are not
> > needed anymore at this stage of recovery, hence we had better remove
> > them as soon as possible.  I am not convinced that it is a good idea
> > to move the cleanup close to RemoveNonParentXlogFiles().  First, this
> > is an entirely different part of the logic where the startup process
> > has already switched to a new timeline.  Second, we add more steps
> > between the moment the two files are not needed and the moment they
> > are removed, so any failure in-between would cause those files to
> > still be there (we cannot say either that we will not manipulate this
> > code later on) and we don't want those files to lie around.  So,
> > mentioning that we do the cleanup just after writeTimeLineHIstory()
> > because we don't need them anymore is more consistent with what has
> > been done for ages for the end of archive recovery, something that
> > cbc55da unfortunately broke.
>
> Ok, I have no objection to remove them just after writeTimeLineHistory().
>

I abandoned once to move the removal code to between
writeTimeLineHistory() and timeline switching because of expanding the
window but since unlink itself will complete within a very short time
it would not be problamatic much.

Attached the updated patch that just moves the removal code.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c69eb6dd7..dcfb481f38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5541,17 +5541,6 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo, wal_segment_size);
 	XLogArchiveCleanup(xlogfname);
 
-	/*
-	 * Since there might be a partial WAL segment named RECOVERYXLOG, get rid
-	 * of it.
-	 */
-	snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
-	unlink(recoveryPath);		/* ignore any error */
-
-	/* Get rid of any remaining recovered timeline-history file, too */
-	snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYHISTORY");
-	unlink(recoveryPath);		/* ignore any error */
-
 	/*
 	 * Remove the signal files out of the way, so that we don't accidentally
 	 * re-enter archive recovery mode in a subsequent crash.
@@ -7475,6 +7464,17 @@ StartupXLOG(void)
 		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 			 EndRecPtr, reason);
+
+		/*
+		 * Since there might be a partial WAL segment named RECOVERYXLOG, get rid
+		 * of it.
+		 */
+		snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYXLOG");
+		unlink(recoveryPath);		/* ignore any error */
+
+		/* Get rid of any remaining recovered timeline-history file, too */
+		snprintf(recoveryPath, MAXPGPATH, XLOGDIR "/RECOVERYHISTORY");
+		unlink(recoveryPath);		/* ignore any error */
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */

Re: pg_upgrade fails with non-standard ACL

2019-09-27 Thread Bruce Momjian

On Fri, Sep 27, 2019 at 04:22:15PM +0900, Michael Paquier wrote:
> On Thu, Sep 26, 2019 at 04:19:38PM -0400, Bruce Momjian wrote:
> > On Thu, Sep 26, 2019 at 05:16:19PM -0300, Alvaro Herrera wrote:
> >> On 2019-Sep-26, Bruce Momjian wrote:
> >>> Well, right now, pg_upgrade --check succeeds, but the upgrade fails.  I
> >>> am proposing, at a minimum, that pg_upgrade --check fails in such cases,
> >> 
> >> Agreed, that should be a minimum fix.
> > 
> > Yes.
> 
> Agreed as well here.  At least the latest patch proposed has the merit
> to track automatically functions not existing anymore from the
> source's version to the target's version, so patching --check offers a
> good compromise.  Bruce, are you planning to look more at the patch
> posted at [1]?

I did look at it.  It has some TODO items listed in it still, and some
C++ comments, but if everyone likes it I can apply it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-27 Thread Alexey Kondratov


On 27.09.2019 6:27, Paul Guo wrote:



Secondarily, I see no reason to test connstr_source rather than just
"conn" in the other patch; doing it the other way is more natural,
since
it's that thing that's tested as an argument.

pg_rewind.c: Please put the new #include line keeping the alphabetical
order.


Agreed to the above suggestions. I attached the v9.



I went through the remaining two patches and they seem to be very clear 
and concise. However, there are two points I could complain about:


1) Maybe I've missed it somewhere in the thread above, but currently 
pg_rewind allows to run itself with -R and --source-pgdata. In that case 
-R option is just swallowed and neither standby.signal, nor 
postgresql.auto.conf is written, which is reasonable though. Should it 
be stated somehow in the docs that -R option always has to go altogether 
with --source-server? Or should pg_rewind notify user that options are 
incompatible and no recovery configuration will be written?


2) Are you going to leave -R option completely without tap-tests? 
Attached is a small patch, which tests -R option along with the existing 
'remote' case. If needed it may be split into two separate cases. First, 
it tests that pg_rewind is able to succeed with minimal permissions 
according to the Michael's patch d9f543e [1]. Next, it checks presence 
of standby.signal and adds REPLICATION permission to rewind_user to test 
that new standby is able to start with generated recovery configuration.


[1] 
https://github.com/postgres/postgres/commit/d9f543e9e9be15f92abdeaf870e57ef289020191



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From 8c607794f259cd4dec0fa6172b69d62e6468bee3 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Fri, 27 Sep 2019 14:30:57 +0300
Subject: [PATCH v9 3/3] Test new standby start with generated config during
 pg_rewind remote

---
 src/bin/pg_rewind/t/001_basic.pl   |  2 +-
 src/bin/pg_rewind/t/002_databases.pl   |  2 +-
 src/bin/pg_rewind/t/003_extrafiles.pl  |  2 +-
 src/bin/pg_rewind/t/004_pg_xlog_symlink.pl |  2 +-
 src/bin/pg_rewind/t/RewindTest.pm  | 11 ++-
 5 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index 115192170e..c3293e93df 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -1,7 +1,7 @@
 use strict;
 use warnings;
 use TestLib;
-use Test::More tests => 10;
+use Test::More tests => 11;
 
 use FindBin;
 use lib $FindBin::RealBin;
diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl
index f1eb4fe1d2..1db534c0dc 100644
--- a/src/bin/pg_rewind/t/002_databases.pl
+++ b/src/bin/pg_rewind/t/002_databases.pl
@@ -1,7 +1,7 @@
 use strict;
 use warnings;
 use TestLib;
-use Test::More tests => 6;
+use Test::More tests => 7;
 
 use FindBin;
 use lib $FindBin::RealBin;
diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl
index c4040bd562..f4710440fc 100644
--- a/src/bin/pg_rewind/t/003_extrafiles.pl
+++ b/src/bin/pg_rewind/t/003_extrafiles.pl
@@ -3,7 +3,7 @@
 use strict;
 use warnings;
 use TestLib;
-use Test::More tests => 4;
+use Test::More tests => 5;
 
 use File::Find;
 
diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
index ed1ddb6b60..639eeb9c91 100644
--- a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
+++ b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl
@@ -14,7 +14,7 @@ if ($windows_os)
 }
 else
 {
-	plan tests => 4;
+	plan tests => 5;
 }
 
 use FindBin;
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 68b6004e94..fcc48cb1d9 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -266,9 +266,18 @@ sub run_pg_rewind
 			[
 'pg_rewind',  "--debug",
 "--source-server",$standby_connstr,
-"--target-pgdata=$master_pgdata", "--no-sync"
+"--target-pgdata=$master_pgdata", "-R", "--no-sync"
 			],
 			'pg_rewind remote');
+
+		# Check that standby.signal has been created.
+		ok(-e "$master_pgdata/standby.signal");
+
+		# Now, when pg_rewind apparently succeeded with minimal permissions,
+		# add REPLICATION privilege.  So we could test that new standby
+		# is able to connect to the new master with generated config.
+		$node_standby->psql(
+			'postgres', "ALTER ROLE rewind_user WITH REPLICATION;");
 	}
 	else
 	{
-- 
2.17.1

Re: Built-in connection pooler

2019-09-27 Thread Konstantin Knizhnik

New version of builtin connection pooler fixing handling messages of 
extended protocol.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/spi/refint.c b/contrib/spi/refint.c
index adf0490..5c2095f 100644
--- a/contrib/spi/refint.c
+++ b/contrib/spi/refint.c
@@ -11,6 +11,7 @@
 
 #include "commands/trigger.h"
 #include "executor/spi.h"
+#include "storage/proc.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -93,6 +94,8 @@ check_primary_key(PG_FUNCTION_ARGS)
 	else
 		tuple = trigdata->tg_newtuple;
 
+	MyProc->is_tainted = true;
+
 	trigger = trigdata->tg_trigger;
 	nargs = trigger->tgnargs;
 	args = trigger->tgargs;
@@ -284,6 +287,8 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "check_foreign_key: cannot process INSERT events");
 
+	MyProc->is_tainted = true;
+
 	/* Have to check tg_trigtuple - tuple being deleted */
 	trigtuple = trigdata->tg_trigtuple;
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1..df0bcaf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,169 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  idle_pool_worker_timeout (integer)
+  
+   idle_pool_worker_timeout configuration parameter
+  
+  
+  
+
+ Terminate an idle connection pool worker after the specified number of milliseconds.
+ The default value is 0, so pool workers are never terminated.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (boolean)
+  
+   restart_pooler_on_reload configuration parameter
+  
+  
+  
+
+  Restart session pool workers once pg_reload_conf() is called.

Re: pglz performance

2019-09-27 Thread Peter Eisentraut

On 2019-09-04 14:45, Andrey Borodin wrote:
>> On 2019-09-04 11:22, Andrey Borodin wrote:
 What about the two patches?  Which one is better?
>>> On our observations pglz_decompress_hacked.patch is best for most of tested 
>>> platforms.
>>> Difference is that pglz_decompress_hacked8.patch will not appply 
>>> optimization if decompressed match is not greater than 8 bytes. This 
>>> optimization was suggested by Tom, that's why we benchmarked it 
>>> specifically.
>>
>> The patches attached to the message I was replying to are named
>>
>> 0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
>> 0001-Use-memcpy-in-pglz-decompression.patch
>>
>> Are those the same ones?
> 
> Yes. Sorry for this confusion.
> 
> The only difference of 
> 0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch is that it 
> fallbacks to byte-loop if len is <= 8.

After reviewing this thread and more testing, I think
0001-Use-memcpy-in-pglz-decompression.patch appears to be a solid win
and we should move ahead with it.

I don't, however, fully understand the code changes, and I think this
could use more and better comments.  In particular, I wonder about

off *= 2;

This is new logic that isn't explained anywhere.

This whole function is commented a bit strangely.  It begins with
"Otherwise", but there is nothing before it.  And what does "from OUTPUT
to OUTPUT" mean?  There is no "output" variable.  We should make this
match the code better.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_wal/RECOVERYHISTORY file remains after archive recovery

2019-09-27 Thread Fujii Masao

On Fri, Sep 27, 2019 at 7:16 PM Michael Paquier  wrote:
>
> On Fri, Sep 27, 2019 at 05:58:21PM +0900, Fujii Masao wrote:
> > So you think that it's better to remove them just after 
> > writeTimeLineHistory()?
> > Per the following Sawada-san's comment, I was thinking that idea is 
> > basically
> > not good. And, RemoveNonParentXlogFiles() also removes garbage files from
> > pg_wal. It's simpler if similar codes exist near. Thought?
>
> Sawada-san's argument of upthread is that it is not good to put
> exitArchiveRecovery() after writeTimeLineHIstory(), which is what
> cbc55da has done per the reasons mentioned in the commit log, and we
> should not change that.
>
> My argument is we know that RECOVERYXLOG and RECOVERYHISTORY are not
> needed anymore at this stage of recovery, hence we had better remove
> them as soon as possible.  I am not convinced that it is a good idea
> to move the cleanup close to RemoveNonParentXlogFiles().  First, this
> is an entirely different part of the logic where the startup process
> has already switched to a new timeline.  Second, we add more steps
> between the moment the two files are not needed and the moment they
> are removed, so any failure in-between would cause those files to
> still be there (we cannot say either that we will not manipulate this
> code later on) and we don't want those files to lie around.  So,
> mentioning that we do the cleanup just after writeTimeLineHIstory()
> because we don't need them anymore is more consistent with what has
> been done for ages for the end of archive recovery, something that
> cbc55da unfortunately broke.

Ok, I have no objection to remove them just after writeTimeLineHistory().

Regards,

-- 
Fujii Masao

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-27 Thread Tomas Vondra

On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:

On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
 wrote:

On 1/3/18 14:53, Tomas Vondra wrote:
>> I don't see the need to tie this setting to maintenance_work_mem.
>> maintenance_work_mem is often set to very large values, which could
>> then have undesirable side effects on this use.
>
> Well, we need to pick some default value, and we can either use a fixed
> value (not sure what would be a good default) or tie it to an existing
> GUC. We only really have work_mem and maintenance_work_mem, and the
> walsender process will never use more than one such buffer. Which seems
> to be closer to maintenance_work_mem.
>
> Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better.  We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

>> Moreover, the name logical_work_mem makes it sound like it's a logical
>> version of work_mem.  Maybe we could think of another name.
>
> I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set.  What are the problems we see if directly use work_mem for
this case?

IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.

It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.

FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).

If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.

Yes, that name seems better.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_wal/RECOVERYHISTORY file remains after archive recovery

2019-09-27 Thread Michael Paquier

On Fri, Sep 27, 2019 at 05:58:21PM +0900, Fujii Masao wrote:
> So you think that it's better to remove them just after 
> writeTimeLineHistory()?
> Per the following Sawada-san's comment, I was thinking that idea is basically
> not good. And, RemoveNonParentXlogFiles() also removes garbage files from
> pg_wal. It's simpler if similar codes exist near. Thought?

Sawada-san's argument of upthread is that it is not good to put
exitArchiveRecovery() after writeTimeLineHIstory(), which is what
cbc55da has done per the reasons mentioned in the commit log, and we
should not change that.

My argument is we know that RECOVERYXLOG and RECOVERYHISTORY are not
needed anymore at this stage of recovery, hence we had better remove
them as soon as possible.  I am not convinced that it is a good idea
to move the cleanup close to RemoveNonParentXlogFiles().  First, this
is an entirely different part of the logic where the startup process
has already switched to a new timeline.  Second, we add more steps
between the moment the two files are not needed and the moment they
are removed, so any failure in-between would cause those files to
still be there (we cannot say either that we will not manipulate this
code later on) and we don't want those files to lie around.  So,
mentioning that we do the cleanup just after writeTimeLineHIstory()
because we don't need them anymore is more consistent with what has
been done for ages for the end of archive recovery, something that
cbc55da unfortunately broke.
--
Michael


signature.asc
Description: PGP signature

Re: let's kill AtSubStart_Notify

2019-09-27 Thread Jeevan Ladhe

Correction -

On Fri, Sep 27, 2019 at 3:11 PM Jeevan Ladhe 
wrote:

> I ran your testcase and on my VM I get numbers like 3593.801 ms
> without patch and 3593.801 with the patch, average of 5 runs each.
> The runs were quite consistent.
>

 3593.801 ms without patch and 3213.809 with the patch,
approx. 10% gain.

Regards,
Jeevan Ladhe

Re: let's kill AtSubStart_Notify

2019-09-27 Thread Jeevan Ladhe

>
> I did not read the patch but run the same case what you have given and
> I can see the similar improvement with the patch.
> With the patch 8832.988, without the patch 10252.701ms (median of three
> reading)
>

Possibly you had debug symbols enabled? With debug symbols enabled
I also get about similar number 10136.839 with patch vs 12900.044 ms
without the patch.

Regards,
Jeevan Ladhe

Re: let's kill AtSubStart_Notify

2019-09-27 Thread Jeevan Ladhe

Hi Robert,

Generally, a subsystem can avoid needing a callback at subtransaction
> start (or transaction start) by detecting new levels of
> subtransactions at time of use.


Yes I agree with this argument.


> A typical practice is to maintain a
> stack which has entries only for those transaction nesting levels
> where the functionality was used. The attached patch implements this
> method for async.c.


I have reviewed your patch, and it seems correctly implementing the
actions per subtransactions using stack. Atleast I could not find
any flaw with your implementation here.


> I was a little surprised to find that it makes a
> pretty noticeable performance difference when starting and ending
> trivial subtransactions.  I used this test case:
>
> \timing
> do $$begin for i in 1 .. 1000 loop begin null; exception when
> others then null; end; end loop; end;$$;
>

I ran your testcase and on my VM I get numbers like 3593.801 ms
without patch and 3593.801 with the patch, average of 5 runs each.
The runs were quite consistent.

Further make check also passing well.

Regards,
Jeevan Ladhe

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-27 Thread Amit Kapila

On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
 wrote:
>
> On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:
> >On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera  
> >wrote:
> >>
> >> In the interest of moving things forward, how far are we from making
> >> 0001 committable?  If I understand correctly, the rest of this patchset
> >> depends on https://commitfest.postgresql.org/24/944/ which seems to be
> >> moving at a glacial pace (or, actually, slower, because glaciers do
> >> move, which cannot be said of that other patch.)
> >>
> >
> >I am not sure if it is completely correct that the other part of the
> >patch is dependent on that CF entry.  I have studied both the threads
> >(not every detail) and it seems to me it is dependent on one of the
> >patches from that series which handles concurrent aborts.  It is patch
> >0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
> >from what the Nikhil has posted on that thread [1].  Am, I wrong?
> >
>
> You're right - the part handling aborts is the only part required. There
> are dependencies on some other changes from the 2PC patch, but those are
> mostly refactorings that can be undone (e.g. switch from independent
> flags to a single bitmap in reorderbuffer).
>
> >So IIUC, the problem of concurrent aborts is that if we allow catalog
> >scans for in-progress transactions, then we might get wrong answers in
> >cases where somebody has performed Alter-Abort-Alter which is clearly
> >explained with an example in email [2].  To solve that problem Nikhil
> >seems to have written a patch [1] which detects these concurrent
> >aborts during a system table scan and then aborts the decoding of such
> >a transaction.
> >
> >Now, the problem is that patch has written considering 2PC
> >transactions and might not deal with all cases for in-progress
> >transactions especially when sub-transactions are involved as alluded
> >by Arseny Sher [3].  So, the problem seems to be for cases when some
> >sub-transaction aborts, but the main transaction still continued and
> >we try to decode it.  Nikhil's patch won't be able to deal with it
> >because I think it just checks top-level xid whereas for this we need
> >to check all-subxids which I think is possible now as Tomas seems to
> >have written WAL for each xid-assignment.  It might or might not be
> >the best solution to check the status of all-subxids, but I think
> >first we need to agree that the problem is just for concurrent aborts
> >and that we can solve it by using some part of the technology being
> >developed as part of patch "Logical decoding of two-phase
> >transactions" (https://commitfest.postgresql.org/24/944/) rather than
> >the entire patchset.
> >
> >I hope I am not saying something very obvious here and it helps in
> >moving this patch forward.
> >
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong.
>

I also think the patch that handles concurrent aborts should be
sufficient, but that need to be integrated with your patch.  Earlier,
I thought we need to check whether any of the subtransaction is
aborted as mentioned by Arseny Sher, but now after thinking again
about that problem, it seems that checking only the status current
subtransaction should be sufficient.  Because, if the user does
Rollback to Savepoint concurrently which aborts multiple
subtransactions, the latest one must be aborted as well which is what
I think we want to detect.  Once we detect that we have two options
(a) restart the decode of that transaction by removing changes of all
subxacts or (b) somehow mark the transaction such that it gets decoded
only at the commit time.

>
> Maybe we should focus on the 0001 part for now - it can be committed
> indepently and does provide useful feature.
>

If that can be done sooner, then it is fine, but otherwise, preparing
the patches on top of HEAD can facilitate the review of those.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Zedstore - compressed in-core columnar storage

2019-09-27 Thread Alexandra Wang

Hi Ashutosh,

Sorry I indeed missed your question, thanks for the reminder!

On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma 
wrote:

> > Further, the UPDATE operation on zedstore table is very slow. I think
> > that's because in case of zedstore table we have to update all the
> > btree data structures even if one column is updated and that really
> > sucks. Please let me know if there is some other reason for it.
> >
>
> There was no answer for this in your previous reply. It seems like you
> missed it. As I said earlier, I tried performing UPDATE operation with
> optimised build and found that to update around 10 lacs record in
> zedstore table it takes around 24k ms whereas for normal heap table it
> takes 2k ms. Is that because in case of zedstore table we have to
> update all the Btree data structures even if one column is updated or
> there is some other reason for it. If yes, could you please let us
> know. FYI, I'm trying to update the table with just two columns.
>

Zedstore UPDATE operation currently fetches the old rows, updates the
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.

However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.

To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.

postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 100) i;

-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;

postgres=# select * from pg_zs_page_counts;
 relnamespace |  oid  | relname | pg_zs_page_type | count
--+---+-+-+---
 public   | 32768 | onecol  | BTREE   |   640
 public   | 32768 | onecol  | FREE|90
 public   | 32768 | onecol  | META| 1
(3 rows)

-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in
zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)

postgres=# select * from pg_zs_page_counts;
 relnamespace |  oid  | relname | pg_zs_page_type | count
--+---+-+-+---
 public   | 32768 | onecol  | BTREE   |  6254
 public   | 32768 | onecol  | FREE| 26915
 public   | 32768 | onecol  | META| 1
(6 rows)

postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
 count
---
  5740
(1 row)

postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
 count
---
   514
(1 row)

postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and
totalsz > 0;
 blkno |  nextblk   | attno | level |  lokey  |  hikey  | nitems |
ncompressed | totalsz | uncompressedsz | freespace
---++---+---+-+-++-+-++---
   730 |   6580 | 1 | 0 |  01 | 1182451 |  1 |
  1 |3156 | 778480 |  4980
  6580 |  13030 | 1 | 0 | 1182451 | 1380771 |  2 |
  1 |8125 | 859104 |11
 13030 |  19478 | 1 | 0 | 1380771 | 1579091 |  2 |
  1 |8125 | 859104 |11
 19478 |  25931 | 1 | 0 | 1579091 | 1777411 |  2 |
  1 |8125 | 859104 |11
 25931 |  32380 | 1 | 0 | 1777411 | 1975731 |  2 |
  1 |8125 | 859104 |11
 32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 |  2 |
  1 |2033 | 105016 |  6103
(6 rows)

-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in
zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)

As you can see, it

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-27 Thread Amit Kapila

On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
 wrote:
>
> On 1/3/18 14:53, Tomas Vondra wrote:
> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> maintenance_work_mem is often set to very large values, which could
> >> then have undesirable side effects on this use.
> >
> > Well, we need to pick some default value, and we can either use a fixed
> > value (not sure what would be a good default) or tie it to an existing
> > GUC. We only really have work_mem and maintenance_work_mem, and the
> > walsender process will never use more than one such buffer. Which seems
> > to be closer to maintenance_work_mem.
> >
> > Pretty much any default value can have undesirable side effects.
>
> Let's just make it an independent setting unless we know any better.  We
> don't have a lot of settings that depend on other settings, and the ones
> we do have a very specific relationship.
>
> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> version of work_mem.  Maybe we could think of another name.
> >
> > I won't object to a better name, of course. Any proposals?
>
> logical_decoding_[work_]mem?
>

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set.  What are the problems we see if directly use work_mem for
this case?

If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pg_wal/RECOVERYHISTORY file remains after archive recovery

2019-09-27 Thread Fujii Masao

On Fri, Sep 27, 2019 at 5:09 PM Michael Paquier  wrote:
>
> On Fri, Sep 27, 2019 at 01:51:25PM +0900, Masahiko Sawada wrote:
> > On Thu, Sep 26, 2019 at 6:23 PM Fujii Masao  wrote:
> >> What about moving the logic that removes RECO VERYXLOG and
> >> RECOVERYHISTORY from exitArchiveRecovery() and performing it
> >> just before/after RemoveNonParentXlogFiles()? Which looks simple.
> >
> > Agreed. Attached the updated patch.
>
> Mea culpa here, I have just noticed the thread.  Fujii-san, would you
> prefer if I take care of it?  And also, what's the issue with not
> doing the removal of both files just after writeTimeLineHistory()?
> That's actually what happened before cbc55da5.

So you think that it's better to remove them just after writeTimeLineHistory()?
Per the following Sawada-san's comment, I was thinking that idea is basically
not good. And, RemoveNonParentXlogFiles() also removes garbage files from
pg_wal. It's simpler if similar codes exist near. Thought?

Regards,

-- 
Fujii Masao

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread Fujii Masao

On Fri, Sep 27, 2019 at 4:07 PM Masahiko Sawada  wrote:
>
> On Fri, Sep 27, 2019 at 3:36 AM David Steele  wrote:
> >
> > On 9/24/19 1:25 AM, Fujii Masao wrote:
> > >
> > > When backup_label exists, the startup process enters archive recovery mode
> > > even if recovery.signal file doesn't exist. In this case, the startup 
> > > process
> > > tries to retrieve WAL files by using restore_command. Then, at the 
> > > beginning
> > > of the archive recovery, the contents of backup_label are copied to 
> > > pg_control
> > > and backup_label file is removed. This would be an intentional behavior.
> >
> > > But I think the problem is that, if the server shuts down during that
> > > archive recovery, the restart of the server may cause the recovery to fail
> > > because neither backup_label nor recovery.signal exist and the server
> > > doesn't enter an archive recovery mode. Is this intentional, too? Seems 
> > > No.
> > >
> > > So the problematic scenario is;
> > >
> > > 1. the server starts with backup_label, but not recovery.signal.
> > > 2. the startup process enters an archive recovery mode because
> > > backup_label exists.
> > > 3. the contents of backup_label are copied to pg_control and
> > > backup_label is deleted.
> >
> > Do you mean deleted or renamed to backup_label.old?
> >
> > > 4. the server shuts down..
> >
> > This happens after the cluster has reached consistency?
> >
> > > 5. the server is restarted. neither backup_label nor recovery.signal 
> > > exist.
> > > 6. the startup process starts just crash recovery because neither 
> > > backup_label
> > > nor recovery.signal exist. Since it cannot retrieve WAL files from 
> > > archival
> > > area, it may fail.
> >
> > I tried a few ways to reproduce this but was not successful without
> > manually removing WAL.
>
> Hmm me too. I think that since we enter crash recovery at step #6 we
> don't retrieve WAL files from archival area.
>
> But I reproduced the problem Fujii-san mentioned that the restart of
> the server during archive recovery causes to the crash recovery
> instead of resuming the archive recovery.

Yes, it's strange and unexpected to start crash recovery
when restarting archive recovery. Archive recovery should
start again in that case, I think.

Regards,

-- 
Fujii Masao

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread Fujii Masao

On Fri, Sep 27, 2019 at 3:36 AM David Steele  wrote:
>
> On 9/24/19 1:25 AM, Fujii Masao wrote:
> >
> > When backup_label exists, the startup process enters archive recovery mode
> > even if recovery.signal file doesn't exist. In this case, the startup 
> > process
> > tries to retrieve WAL files by using restore_command. Then, at the beginning
> > of the archive recovery, the contents of backup_label are copied to 
> > pg_control
> > and backup_label file is removed. This would be an intentional behavior.
>
> > But I think the problem is that, if the server shuts down during that
> > archive recovery, the restart of the server may cause the recovery to fail
> > because neither backup_label nor recovery.signal exist and the server
> > doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> >
> > So the problematic scenario is;
> >
> > 1. the server starts with backup_label, but not recovery.signal.
> > 2. the startup process enters an archive recovery mode because
> > backup_label exists.
> > 3. the contents of backup_label are copied to pg_control and
> > backup_label is deleted.
>
> Do you mean deleted or renamed to backup_label.old?

Sorry for the confusing wording..
I meant the following code that renames backup_label to .old, in StartupXLOG().

/*
* If there was a backup label file, it's done its job and the info
* has now been propagated into pg_control.  We must get rid of the
* label file so that if we crash during recovery, we'll pick up at
* the latest recovery restartpoint instead of going all the way back
* to the backup start point.  It seems prudent though to just rename
* the file out of the way rather than delete it completely.
*/
if (haveBackupLabel)
{
unlink(BACKUP_LABEL_OLD);
durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
   }

> > 4. the server shuts down..
>
> This happens after the cluster has reached consistency?

You need to shutdown the server until WAL replay finishes,
no matter whether it reaches the consistent point or not.

> > 5. the server is restarted. neither backup_label nor recovery.signal exist.
> > 6. the startup process starts just crash recovery because neither 
> > backup_label
> > nor recovery.signal exist. Since it cannot retrieve WAL files from 
> > archival
> > area, it may fail.
>
> I tried a few ways to reproduce this but was not successful without
> manually removing WAL.  Probably I just needed a much larger set of WAL.
>
> I assume you have a repro?  Can you give more details?

What I did is:

1. Start PostgreSQL server with WAL archiving enabled.
2. Take an online backup by using pg_basebackup, for example,
 $ pg_basebackup -D backup
3. Execute many write SQL to generate lots of WAL files. During that execution,
perform CHECKPOINT to remove some WAL files from pg_wal directory.
You need to repeat these until you confirm that there are many WAL files
that have already been removed from pg_wal but exist only in archive area.
 4. Shutdown the server.
 5. Remove PGDATA and restore it from backup.
 6. Set up restore_command.
 7. (Forget to put recovery.signal)
 That is, in this scenario, you want to recover the database up to
 the latest WAL records in the archive area. So you need to start archive
 recovery by setting restore_command and putting recovery.signal.
 But the problem happens when you forget to put recovery.signal.
 8. Start PostgreSQL server.
 9. Shutdown the server while it's restoring archived WAL files and replaying
 them. At this point, you will notice that the archive recovery starts
 even though recovery.signal doesn't exist. So even archived WAL files
 are successfully restored at this step.
 10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
exist, crash recovery starts and fail to restore the archived WAL files.
   So you fail to recover the database up to the latest WAL record
in archive
   directory. The recovery will finish at early point.

> > One idea to fix this issue is to make the above step #3 remember that
> > backup_label existed, in pg_control. Then we should make the subsequent
> > recovery enter an archive recovery mode if pg_control indicates that
> > even if neither backup_label nor recovery.signal exist. Thought?
>
> That seems pretty invasive to me at this stage.  I'd like to reproduce
> it and see if there are alternatives.
>
> Also, are you sure this is a new behavior?

In v11 or before, if backup_label exists but not recovery.conf,
the startup process doesn't enter an archive recovery mode. It starts
crash recovery in that case. So the bahavior is somewhat different
between versions.

Regards,

-- 
Fujii Masao

Re: [PATCH] psql: add tab completion for \df slash command suffixes

2019-09-27 Thread Michael Paquier

On Wed, Sep 04, 2019 at 03:19:57PM +0900, Ian Barwick wrote:
> Ah, good catch, I will update and resubmit.

Ping.  Ian, could you update and resubmit please?
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-09-27 Thread Michael Paquier

On Thu, Aug 01, 2019 at 09:39:53PM +1200, Thomas Munro wrote:
> Looks like some good actionable feedback.  I've moved this patch to
> September, and set it to "Waiting on Author".

The patch is in this state for two months now, so I have switched it
to "returned with feedback".  The latest patch does not apply, and it
would require an update for the new test module dummy_index_am.

As I have touched recently in the area of the code, I got a look at it
and the core of the idea is that it would be cleaner to move the
control of reloptions from reloptions.c to each AM, as well as have an
extra later for toast.  This has additional costs in multiple ways:
- The parsing and filling of reloptions gets more duplicated, though
this could be solved with a wrapper routine (something HEAD already
does when filling in rd_options).
- Logic which was out of toast previously is not anymore.

Some of these have been mentioned by Tomas upthread, and I share
similar points.  I also doubt that the removal of
RelationGetTargetPageFreeSpace() is a good thing, and this complicates
more the checks around options and the handling of rd_options which
may become optionally NULL, making a bit more brittle the whole
structure.  We may be able to get useful pieces for it, though it is
not clear what would be the benefits.  It may help as well to group
that within the thread dedicated to the rework of the reloption APIs
as a preliminary, refactoring step.
--
Michael


signature.asc
Description: PGP signature

Re: Implementing Incremental View Maintenance

2019-09-27 Thread Yugo Nagata

On Tue, 17 Sep 2019 12:03:20 -0600
Paul Draper  wrote:

> Have you had any thoughts for more than two joined tables?
> 
> Either there needs to be an quadratic number of joins, or intermediate join
> results need to be stored and reused.

I don't think that we need to store intermediate join results.

Suppose that  we have a view V joining table R,S, and new tuples are inserted
to each table, dR,dS, and dT respectively.

 V = R*S*T
 R_new = R + dR
 S_new = S + dS
 T_new = T + dT

In this situation, we can calculate the new view state as bellow.

V_new 
= R_new * S_new * T_new
= (R + dR) * (S + dS) * (T + dT)
= R*S*T + dR*(S + dS)*(T + dT) + R*dS*(T + dT) + R*S*dT
= V + dR*(S + dS)*(T + dT) + R*dS*(T + dT) + R*S*dT

Although the number of terms is 2^3(=8), if we can use both of pre-update
state (eg. R) and post-update state (eg. R+dR), we only need only three joins.
Actually, post-update state is available in AFTER trigger, and pre-update state
can be calculated by using delta tables (transition tables) and cmin/xmin system
columns (or snapshot). This is the approach my implementation adopts.


> 
> On Tue, Sep 17, 2019 at 8:50 AM Yugo Nagata  wrote:
> 
> > Hi Paul,
> >
> > Thank you for your suggestion.
> >
> > On Sun, 15 Sep 2019 11:52:22 -0600
> > Paul Draper  wrote:
> >
> > > As I understand it, the current patch performs immediate IVM using AFTER
> > > STATEMENT trigger transition tables.
> > >
> > > However, multiple tables can be modified *before* AFTER STATEMENT
> > triggers
> > > are fired.
> > >
> > > CREATE TABLE example1 (a int);
> > > CREATE TABLE example2 (a int);
> > >
> > > CREATE INCREMENTAL MATERIALIZED VIEW mv AS
> > > SELECT example1.a, example2.a
> > > FROM example1 JOIN example2 ON a;
> > >
> > > WITH
> > >   insert1 AS (INSERT INTO example1 VALUES (1)),
> > >   insert2 AS (INSERT INTO example2 VALUES (1))
> > > SELECT NULL;
> > >
> > > Changes to example1 are visible in an AFTER STATEMENT trigger on
> > example2,
> > > and vice versa. Would this not result in the (1, 1) tuple being
> > > "double-counted"?
> > >
> > > IVM needs to either:
> > >
> > > (1) Evaluate deltas "serially' (e.g. EACH ROW triggers)
> > >
> > > (2) Have simultaneous access to multiple deltas:
> > > delta_mv = example1 x delta_example2 + example2 x delta_example1 -
> > > delta_example1 x delta_example2
> > >
> > > This latter method is the "logged" approach that has been discussed for
> > > deferred evaluation.
> > >
> > > tl;dr It seems that AFTER STATEMENT triggers required a deferred-like
> > > implementation anyway.
> >
> > You are right,  the latest patch doesn't support the situation where
> > multiple tables are modified in a query. I noticed this when working
> > on self-join, which also virtually need to handle multiple table
> > modification.
> >
> > I am now working on this issue and the next patch will enable to handle
> > this situation. I plan to submit the patch during this month. Roughly
> > speaking, in the new implementation, AFTER STATEMENT triggers are used to
> > collect  information of modified table and its changes (= transition
> > tables),
> > and then the only last trigger updates the view. This will avoid the
> > double-counting. I think this implementation also would be a base of
> > deferred approach implementation in future where "logs" are used instead
> > of transition tables.
> >
> > Regards,
> > Yugo Nagata
> >
> > --
> > Yugo Nagata 
> >


-- 
Yugo Nagata

Re: pg_wal/RECOVERYHISTORY file remains after archive recovery

2019-09-27 Thread Michael Paquier

On Fri, Sep 27, 2019 at 01:51:25PM +0900, Masahiko Sawada wrote:
> On Thu, Sep 26, 2019 at 6:23 PM Fujii Masao  wrote:
>> What about moving the logic that removes RECO VERYXLOG and
>> RECOVERYHISTORY from exitArchiveRecovery() and performing it
>> just before/after RemoveNonParentXlogFiles()? Which looks simple.
>
> Agreed. Attached the updated patch.

Mea culpa here, I have just noticed the thread.  Fujii-san, would you
prefer if I take care of it?  And also, what's the issue with not
doing the removal of both files just after writeTimeLineHistory()?
That's actually what happened before cbc55da5.
--
Michael


signature.asc
Description: PGP signature

Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )

2019-09-27 Thread Yuya Watari

Hello,

I add further information. This issue also has a problem about
*overflow checking*.

The original code is as follows.

src/backend/utils/adt/timestamp.c:3222
-
 if (result_double > PG_INT64_MAX || result_double < PG_INT64_MIN)
  ereport(ERROR,
(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
 errmsg("interval out of range")));
 result->time = (int64) result_double;
-

Here, the code checks whether "result_double" fits 64-bit integer size
before casting it.

However, as I have mentioned in the previous email, PG_INT64_MAX is
cast to double and the value becomes 9223372036854775808 due to lack
of precision.
Therefore, the above code is identical to "result_double >
9223372036854775808.0". This checking does not cover the case when
result_double is equal to 9223372036854775808. In this case, "(int64)
result_double" will be -9223372036854775808, which is wrong.

The next code confirms what I explained.

===
#include 
#include 
int main(void)
{
double value = (double) INT64_MAX;
printf("INT64_MAX = %ld\n", INT64_MAX);
printf("value = %lf\n", value);
printf("(value > (double) INT64_MAX) == %d\n", value > (double) INT64_MAX);
printf("(long int) value == %ld\n", (long int) value);
}
===
Output:
INT64_MAX = 9223372036854775807
value = 9223372036854775808.00
(value > (double) INT64_MAX) == 0
(long int) value == -9223372036854775808
===

I think the code should be "result_double >= (double) PG_INT64_MAX",
that is we have to use >= rather than >. I attached the modified
patch.

Thanks,
Yuya Watari
NTT Software Innovation Center
watari.y...@gmail.com

2019年9月27日(金) 12:00 Yuya Watari :
>
> Hello,
>
> I found the problem that clang compiler introduces warnings when building 
> PostgreSQL. Attached patch fixes it.
>
> ===
> Compiler version
> ===
> clang version 10.0.0-svn372772-1~exp1+0~20190924181208.2504~1.gbpb209ff 
> (trunk)
>
> Older versions of clang may not generate this warning.
>
> ===
> Warning
> ===
>
> timestamp.c:3236:22: warning: implicit conversion from 'long' to 'double' 
> changes value from 9223372036854775807 to 9223372036854775808 
> [-Wimplicit-int-float-conversion]
> if (result_double > PG_INT64_MAX || result_double < PG_INT64_MIN)
>   ~ ^~~~
> ../../../../src/include/c.h:444:22: note: expanded from macro 'PG_INT64_MAX'
> #define PG_INT64_MAXINT64CONST(0x7FFF)
> ^~
> ../../../../src/include/c.h:381:25: note: expanded from macro 'INT64CONST'
> #define INT64CONST(x)  (x##L)
> ^~~~
> :234:1: note: expanded from here
> 0x7FFFL
> ^~~
> 1 warning generated.
> pgbench.c:1657:30: warning: implicit conversion from 'long' to 'double' 
> changes value from 9223372036854775807 to 9223372036854775808 
> [-Wimplicit-int-float-conversion]
> if (dval < PG_INT64_MIN || PG_INT64_MAX < dval)
>^~~~ ~
> ../../../src/include/c.h:444:22: note: expanded from macro 'PG_INT64_MAX'
> #define PG_INT64_MAXINT64CONST(0x7FFF)
> ^~
> ../../../src/include/c.h:381:25: note: expanded from macro 'INT64CONST'
> #define INT64CONST(x)  (x##L)
> ^~~~
> :252:1: note: expanded from here
> 0x7FFFL
> ^~~
> 1 warning generated.
>
> ===
>
> This warning is due to implicit conversion from PG_INT64_MAX to double, which 
> drops the precision as described in the warning. This drop is not a problem 
> in this case, but we have to get rid of useless warnings. Attached patch 
> casts PG_INT64_MAX explicitly.
>
> Thanks,
> Yuya Watari
> NTT Software Innovation Center
> watari.y...@gmail.com


v2-keep-compiler-silence.patch
Description: Binary data

RE: Shared Memory: How to use SYSV rather than MMAP ?

2019-09-27 Thread REIX, Tony

Hello Thomas, Alvaro,

Sorry for the late answer, I missed your message of September 10. (I'm working 
on several different projects in parallel.)
Let me talk with Sylvie ASAP and see when I will be able to test it, probably 
next week, Tuesday. Is that OK for you?

Regards,

Tony

De : Alvaro Herrera 
Envoyé : jeudi 26 septembre 2019 21:22
À : Thomas Munro 
Cc : REIX, Tony ; Andres Freund ; 
Robert Haas ; Pg Hackers ; 
EMPEREUR-MOT, SYLVIE 
Objet : Re: Shared Memory: How to use SYSV rather than MMAP ?

On 2019-Sep-10, Thomas Munro wrote:

> Here's a quick rebase.  It needs testing, review and (probably)
> adjustment from AIX users.  I'm not going to be able to do anything
> with it on my own due to lack of access, though I'm happy to help get
> this committed eventually.  If we don't get any traction in this CF,
> I'll withdraw this submission for now.  For consistency, I think we
> should eventually also do the same thing for Linux when using sysv
> (it's pretty similar, it just uses different flag names; it may also
> be necessary to query the page size and round up the requested size,
> on one or both of those OSes; I'm not sure).

Tony, Sylvie, any chance for some testing on this patch?  It seems that
without that, this patch is going to waste.

If I don't hear from anyone on September 30, I'm going to close this as
Returned with Feedback.

--
Álvaro Herrera
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.2ndQuadrant.com%2Fdata=02%7C01%7Ctony.reix%40atos.net%7C9f484c05852d40cda8cb08d742b6ea15%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C637051225687793704sdata=bFOxofqr6Rbig8A2pPaz7ZhuGr5GOtJPntuCEQnEdww%3Dreserved=0
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Don't codegen deform code for virtual tuples in expr eval for scan fetch

2019-09-27 Thread Andres Freund

Hi,

On 2019-09-25 22:11:51 -0700, Soumyadeep Chakraborty wrote:
> Thank you very much for reviewing my patch!
> 
> On Wed, Sep 25, 2019 at 1:02 PM Andres Freund  wrote:
> > IOW, wherever ExecComputeSlotInfo() is called, we should only actually
> > push the expression step, when ExecComputeSlotInfo does not determine
> > that a) the slot is virtual, b) and fixed (i.e. guaranteed to always be
> > the same type of slot).
> >
> > Does that make sense?
> 
> That is a great suggestion and I totally agree. I have attached a patch
> that achieves this.

I think as done, this has the slight disadvantage of removing the
fast-path for small interpreted expressions, because that explicitly
matches for seeing the EEOP_*_FETCHSOME ops. Look at execExprInterp.c,
around:
/*
 * Select fast-path evalfuncs for very simple expressions.  "Starting 
up"
 * the full interpreter is a measurable overhead for these, and these
 * patterns occur often enough to be worth optimizing.
 */
if (state->steps_len == 3)
{

So I think we'd have to add a separate fastpath for virtual slots for
that.

What do you think about the attached?

Greetings,

Andres Freund
>From 26b963f2989cdeec963f30c5d9b6c0e9b942741f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 26 Sep 2019 11:14:46 -0700
Subject: [PATCH v1 1/2] Reduce code duplication for ExecJust*Var operations.

This is mainly in preparation for introducing a few additional
fastpath implementations.

Also reorder ExecJust*Var functions to be consistent with the order in
which they're used.

Author: Andres Freund
Discussion: https://postgr.es/m/cae-ml+9oksn71+mhtfmd-l24odp8dgtfavjdu6u+j+fnaw5...@mail.gmail.com
---
 src/backend/executor/execExprInterp.c | 94 ++-
 1 file changed, 35 insertions(+), 59 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 293bfb61c36..e876160a0e7 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -155,11 +155,11 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
-static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustApplyFuncToCase(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull);
 
 
 /*
@@ -1966,13 +1966,12 @@ ShutdownTupleDescRef(Datum arg)
  * Fast-path functions, for very simple expressions
  */
 
-/* Simple reference to inner Var */
-static Datum
-ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull)
+/* implementation of ExecJust(Inner|Outer|Scan)Var */
+static pg_attribute_always_inline Datum
+ExecJustVarImpl(ExprState *state, TupleTableSlot *slot, bool *isnull)
 {
 	ExprEvalStep *op = >steps[1];
 	int			attnum = op->d.var.attnum + 1;
-	TupleTableSlot *slot = econtext->ecxt_innertuple;
 
 	CheckOpSlotCompatibility(>steps[0], slot);
 
@@ -1984,52 +1983,34 @@ ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return slot_getattr(slot, attnum, isnull);
 }
 
-/* Simple reference to outer Var */
+/* Simple reference to inner Var */
 static Datum
-ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull)
+ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull)
 {
-	ExprEvalStep *op = >steps[1];
-	int			attnum = op->d.var.attnum + 1;
-	TupleTableSlot *slot = econtext->ecxt_outertuple;
-
-	CheckOpSlotCompatibility(>steps[0], slot);
-
-	/* See comments in ExecJustInnerVar */
-	return slot_getattr(slot, attnum, isnull);
+	return ExecJustVarImpl(state, econtext->ecxt_innertuple, isnull);
 }
 
-/* Simple reference to scan Var */
+/* Simple reference to outer Var */
 static Datum
-ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
+ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull)
 {
-	ExprEvalStep *op = >steps[1];
-	int			attnum = op->d.var.attnum + 1;
-	TupleTableSlot *slot = econtext->ecxt_scantuple;
-
-	CheckOpSlotCompatibility(>steps[0], slot);
-
-	/* See comments in ExecJustInnerVar */
-	return slot_getattr(slot, attnum, isnull);
+	return ExecJustVarImpl(state, econtext->ecxt_outertuple, isnull);
 }
 
-/* Simple Const expression */
+/* Simple reference to scan Var */
 static Datum
-ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull)
+ExecJustScanVar(ExprState

Re: pg_upgrade fails with non-standard ACL

2019-09-27 Thread Michael Paquier

On Thu, Sep 26, 2019 at 04:19:38PM -0400, Bruce Momjian wrote:
> On Thu, Sep 26, 2019 at 05:16:19PM -0300, Alvaro Herrera wrote:
>> On 2019-Sep-26, Bruce Momjian wrote:
>>> Well, right now, pg_upgrade --check succeeds, but the upgrade fails.  I
>>> am proposing, at a minimum, that pg_upgrade --check fails in such cases,
>> 
>> Agreed, that should be a minimum fix.
> 
> Yes.

Agreed as well here.  At least the latest patch proposed has the merit
to track automatically functions not existing anymore from the
source's version to the target's version, so patching --check offers a
good compromise.  Bruce, are you planning to look more at the patch
posted at [1]?

[1]: 
https://www.postgresql.org/message-id/392ca335-068d-7bd3-0ad8-fdf0a45d9...@postgrespro.ru
--
Michael

signature.asc
Description: PGP signature

Re: Unstable select_parallel regression output in 12rc1

2019-09-27 Thread Christoph Berg

Re: Tom Lane 2019-09-26 <12685.1569510...@sss.pgh.pa.us>
> We haven't seen it in quite some time in HEAD, though I fear that's
> just due to bad luck or change of timing of unrelated tests.

The v13 package builds that are running every 6h here haven't seen a
problem yet either, so the probability of triggering it seems very
low. So it's not a pressing problem. (There's some extension modules
where the testsuite fails at a much higher rate, getting all targets
to pass at the same time is next to impossible there :(. )

Christoph

Re: Refactoring of connection with password prompt loop for frontends

2019-09-27 Thread Michael Paquier

On Thu, Sep 26, 2019 at 10:06:27AM -0300, Alvaro Herrera wrote:
> Hmm, you have an XXX comment that appears to need addressing; and I'd
> add an explanation about the looping behavior to the comment atop the
> new function.
> 
> I see that vacuumlo and scripts/common retain their "have_password"
> variables.  That seems to be so that they can reuse one for other
> databases.  But how does that work with the new routine?

That's the second bullet point I mentioned at the top of the thread
(https://www.postgresql.org/message-id/20190822074558.gg1...@paquier.xyz),
and something I wanted to discuss for this patch:
"Some code paths spawn a password prompt before the first connection
attempt..."

Historically, vacuumlo, scripts/common.c and pg_backup_db.c ask for a
password before doing the first connection attempt, which is different
than any other code paths.  That's the reason why have_password is
still there in the case of the first one.  scripts/common.c and
pg_dump are different issues as we don't want to repeat the password
for multiple database connections. That point is also related to the
XXX comment, because if we make the logic more consistent with the
rest (aka ask for the password the first time after the first
connection attempt), then we could remove completely the extra
handling with saved_password (meaning no more XXX).  That would be
also nicer as we want to keep a fixed size for the password buffer of
100 bytes.

Thinking more about it, keeping connect_with_password_prompt() a
maximum simple would be nicer for future maintenance, and it looks
that we may actually be able to remove allow_password_reuse from
connectDatabase() in common.c, but this needs a rather close lookup as
we should not create any password exposure hazards.

For now I am switching the patch as returned with feedback as there is
much more to consider.  And it did not attract much attention either.
--
Michael

signature.asc
Description: PGP signature

Re: recovery starting when backup_label exists, but not recovery.signal

2019-09-27 Thread Masahiko Sawada

On Fri, Sep 27, 2019 at 3:36 AM David Steele  wrote:
>
> On 9/24/19 1:25 AM, Fujii Masao wrote:
> >
> > When backup_label exists, the startup process enters archive recovery mode
> > even if recovery.signal file doesn't exist. In this case, the startup 
> > process
> > tries to retrieve WAL files by using restore_command. Then, at the beginning
> > of the archive recovery, the contents of backup_label are copied to 
> > pg_control
> > and backup_label file is removed. This would be an intentional behavior.
>
> > But I think the problem is that, if the server shuts down during that
> > archive recovery, the restart of the server may cause the recovery to fail
> > because neither backup_label nor recovery.signal exist and the server
> > doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> >
> > So the problematic scenario is;
> >
> > 1. the server starts with backup_label, but not recovery.signal.
> > 2. the startup process enters an archive recovery mode because
> > backup_label exists.
> > 3. the contents of backup_label are copied to pg_control and
> > backup_label is deleted.
>
> Do you mean deleted or renamed to backup_label.old?
>
> > 4. the server shuts down..
>
> This happens after the cluster has reached consistency?
>
> > 5. the server is restarted. neither backup_label nor recovery.signal exist.
> > 6. the startup process starts just crash recovery because neither 
> > backup_label
> > nor recovery.signal exist. Since it cannot retrieve WAL files from 
> > archival
> > area, it may fail.
>
> I tried a few ways to reproduce this but was not successful without
> manually removing WAL.

Hmm me too. I think that since we enter crash recovery at step #6 we
don't retrieve WAL files from archival area.

But I reproduced the problem Fujii-san mentioned that the restart of
the server during archive recovery causes to the crash recovery
instead of resuming the archive recovery. Which is the different
behavior from version 11 or before and I personally think it made
behavior worse.

Regards,

--
Masahiko Sawada

Re: Batch insert in CTAS/MatView code

2019-09-27 Thread Asim R P

On Thu, Sep 26, 2019 at 7:13 PM Alvaro Herrera 
wrote:
>
> On 2019-Sep-25, Asim R P wrote:
>
> > I reviewed your patch today.  It looks good overall.  My concern is that
> > the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> > place such as createas.c, we should be using generic tableam API only.
> > However, I can also see that there is no better alternative.  We need to
> > compute the size of accumulated tuples so far, in order to decide
whether
> > to stop accumulating tuples.  There is no convenient way to obtain the
> > length of the tuple, given a slot.  How about making that decision
solely
> > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple
call
> > altogether?
>
> ... maybe we should add a new operation to slots, that returns the
> (approximate?) size of a tuple?  That would make this easy.  (I'm not
> sure however what to do about TOAST considerations -- is it size in
> memory that we're worried about?)

That will help.  For slots containing heap tuples, heap_compute_data_size()
is what we need.  Approximate size is better than nothing.
In case of CTAS, we are dealing with slots returned by a scan node.
Wouldn't TOAST datums be already expanded in those slots?

Asim

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

2019-09-27 Thread Michael Paquier

On Thu, Sep 26, 2019 at 01:13:56AM +0900, Fujii Masao wrote:
> On Tue, Sep 24, 2019 at 10:41 AM Michael Paquier  wrote:
>> This also points out that there are other things to worry about than
>> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
>> to an ERROR, and this happens before the physical truncation is done
>> but after the WAL record is replayed on the standby, so any failures
>> happening at the truncation phase before the work is done would be a
>> problem.  However we are talking about failures which should not
>> happen and these are elog() calls.  It would be tempting to add a
>> critical section here, but we could still have problems if we have a
>> failure after the WAL record has been flushed, which means that it
>> would be replayed on the standby, and the surrounding comments are
>> clear about that.
> 
> Could you elaborate what problem adding a critical section there occurs?

Wrapping the call of smgrtruncate() within RelationTruncate() to use a
critical section would make things worse from the user perspective on
the primary, no?  If the physical truncation fails, we would still
fail WAL replay on the standby, but instead of generating an ERROR in
the session of the user attempting the TRUNCATE, the whole primary
would be taken down.
--
Michael


signature.asc
Description: PGP signature

91 matches

Mail list logo