date:20190726

Re: pg_upgrade version checking questions

2019-07-26 Thread Peter Eisentraut

On 2019-07-25 16:33, Daniel Gustafsson wrote:
> Fair enough, those are both excellent points.  I’ve shuffled the code around 
> to
> move back the check for exec_path to setup (albeit earlier than before due to
> where we perform directory checking).  This does mean that the directory
> checking in the options parsing must learn to cope with missing directories,
> which is a bit unfortunate since it’s already doing a few too many things 
> IMHO.
> To ensure dogfooding, I also removed the use of -B in ‘make check’ for
> pg_upgrade, which should bump the coverage.
> 
> Also spotted a typo in a pg_upgrade file header in a file touched by this, so
> included it in this thread too as a 0004.

I have committed 0002, 0003, and 0004.

The implementation in 0001 (Only allow upgrades by the same exact
version new bindir) has a problem.  It compares (new_cluster.bin_version
!= PG_VERSION_NUM), but new_cluster.bin_version is actually just the
version of pg_ctl, so this is just comparing the version of pg_upgrade
with the version of pg_ctl, which is not wrong, but doesn't really
achieve the full goal of having all binaries match.

I think a better structure would be to add a version check for each
validate_exec() so that each program is checked against pg_upgrade.
This should mirror what find_other_exec() in src/common/exec.c does.  In
a better world we would use find_other_exec() directly, but then we
can't support -B.  Maybe expand find_other_exec() to support this, or
make a private copy for pg_upgrade to support this.  (Also, we have two
copies of validate_exec() around.  Maybe this could all be unified.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-26 Thread Noah Misch

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> No substantial change have been made by this rebasing.

Thanks.  I'll likely review this on 2019-08-20.  If someone opts to review it
earlier, I welcome that.

Re: Initdb failure

2019-07-26 Thread Peter Eisentraut

On 2019-07-25 17:09, Rafia Sabih wrote:
> But on the other hand emitting the right error message atleast would
> be good for the sake of correctness if nothing else. But yes that
> definitely should be weighed against what is the effort required for
> this.

I think if you want to make this more robust, get rid of the fixed-size
array, use dynamic allocation with PQExpBuffer, and let the operating
system complain if it doesn't like the directory name length.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SegFault on 9.6.14

2019-07-26 Thread Amit Kapila

On Sat, Jul 27, 2019 at 8:29 AM Thomas Munro  wrote:
>
> On Fri, Jul 26, 2019 at 4:13 PM Amit Kapila  wrote:
> > On Tue, Jul 23, 2019 at 5:28 PM Amit Kapila  wrote:
> > > Right, that will be lesser code churn and it can also work.  However,
> > > one thing that needs some thought is till now es_top_eflags is only
> > > set in ExecutorStart and same is mentioned in comments where it is
> > > declared and it seems we are going to change that with this idea.  How
> > > about having a separate function ExecBlahShutdown which will clean up
> > > resources as parallel context and can be called only from ExecutePlan
> > > where we are calling ExecShutdownNode?  I think both these and the
> > > other solution we have discussed are on similar lines and another idea
> > > could be to relax the assert which again is not a superb idea.
> >
> > It seems we don't have a clear preference for any particular solution
> > among these and neither there appears to be any better idea.  I guess
> > we can wait for a few days to see if Robert has any views on this,
> > otherwise, pick one of the above and move ahead.
>
> I take the EXEC_FLAG_DONE idea back.  It's ugly and too hard to verify
> that every appropriate path sets it, and a flag that means the
> opposite would be even more of a kluge, and generally I think I was
> looking at this too myopically: I was looking for a way to shut down
> processes ASAP without giving up the shared memory we'll need for
> rescanning, but what I should have been looking at is the reason you
> did that in the first place: to get the instrumentation data.  Can you
> explain why it's necessary to do that explicitly for Limit?  Wouldn't
> the right place to collect instrumentation be at the end of execution
> when Shutdown will run in all cases anyway (and possibly also during
> ExecParallelReinitialize() or something like that if it's being
> clobbered by rescans, I didn't check)?  What's special about Limit?
>

I think here you are missing the point that to collect the
instrumentation information one also need to use InstrStartNode and
InstrStopNode.  So, for the Limit node, the InstrStopNode would be
already done by the time we call shutdown of workers at the end of
execution.  To know a bit more details, see [1][2][3].

> Today while poking at this and trying to answer those questions for
> myself, I realised that  the repro I posted earlier[1] crashes exactly
> as Jerry reported on REL9_6_STABLE, but in later release branches it
> runs to completion.  That's because the crashing code was removed in
> commit 41b0dd98 "Separate reinitialization of shared parallel-scan
> state from ExecReScan.".
>
> So newer branches get past that problem, but they all spit out tons of
> each of these three warnings:
>
> WARNING:  buffer refcount leak: [172] (rel=base/12558/16390,
> blockNum=5, flags=0x9380, refcount=1 2998)
> ...
> WARNING:  relcache reference leak: relation "join_bar" not closed
> ...
> WARNING:  Snapshot reference leak: Snapshot 0x7ff20383bfb0 still referenced
> ...
>
> Oops.
>

This is exactly due to the same problem that before rescans, we have
destroyed the shared memory.  If you do the earlier trick of not
cleaning up shared memory till ExecEndNode, then you won't see this
problem.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1KZEbYKj9HHP-6WqqjAXuoB%2BWJu-w1s9uovj%3DeeBxC48Q%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CA%2BTgmoY3kcTcc5bFCZeY5NMFna-xaMPuTHA-z-z2Bmfg%2Bdb-XQ%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1L0KAZWgnRJz%3DVNVpyS3FFbVh8E5egyziaR0E10bC204Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: type info support functions for functions that use "any" type

2019-07-26 Thread Pavel Stehule

pá 26. 7. 2019 v 22:53 odesílatel Tom Lane  napsal:

> I wrote:
> > TBH, I don't like this proposal one bit.  As far as I can see, the idea
> > is to let a function's support function redefine the function's declared
> > argument and result types on-the-fly according to no predetermined rules,
> > and that seems to me like it's a recipe for disaster.  How will anyone
> > understand which function(s) are candidates to match a query, or why one
> > particular candidate got selected over others?  It's already hard enough
> > to understand the behavior of polymorphic functions in complex cases,
> > and those are much more constrained than this would be.
>
> After thinking about this a bit more, it seems like you could avoid
> a lot of problems if you restricted what the support function call
> does to be potentially replacing the result type of a function
> declared to return ANY with some more-specific type (computed from
> examination of the actual arguments).  That would make it act much
> more like a traditional polymorphic function.  It'd remove the issues
> about interactions among multiple potentially-matching functions,
> since we'd only call a single support function for an already-identified
> target function.
>

I am not sure if I understand well - so I repeat it with my words.

So calculation of result type (replace ANY by some specific) can be ok?

I am able to do it if there will be a agreement.

I wrote a possibility to specify argument types as optimization as
protection against repeated type identification and casting (that can be
done in planning time, and should not be repeated).

This feature should be used only for functions with types fx("any", "any",
..) returns "any". So it is very probable so in execution type you should
to do some work with parameter type identification.

But if we find a agreement just on work with return type, then it is good
enough solution. The practical overhead of type cache inside function
should not be dramatic.

> You'd still need to touch everyplace that knows about polymorphic
> type resolution, since this would essentially be another form of
> polymorphic function.  And I'm still very dubious that it's worth
> the trouble.  But it would be a lot more controllable than the
> proposal as it stands.
>

ok

> regards, tom lane
>

Re: proposal: type info support functions for functions that use "any" type

2019-07-26 Thread Pavel Stehule

pá 26. 7. 2019 v 22:03 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > so 9. 3. 2019 v 7:22 odesílatel Pavel Stehule 
> > napsal:
> >> Tom introduced supported functions for calculation function's
> selectivity.
> >> Still I have similar idea to use supported function for calculation
> >> function's parameter's types and function return type.
> >> Motivation:
> >> Reduce a necessity of overloading of functions. My motivation is related
> >> primary to Orafce, but this feature should be helpful for anybody with
> >> similar goals. The function's overloading is great functionality but it
> is
> >> hard for maintenance.
>
> > here is a patch
>
> TBH, I don't like this proposal one bit.  As far as I can see, the idea
> is to let a function's support function redefine the function's declared
> argument and result types on-the-fly according to no predetermined rules,
> and that seems to me like it's a recipe for disaster.  How will anyone
> understand which function(s) are candidates to match a query, or why one
> particular candidate got selected over others?  It's already hard enough
> to understand the behavior of polymorphic functions in complex cases,
> and those are much more constrained than this would be.
>

I quietly expect so this feature will be used without combination with
overloading. But the combination of support function and overloading can be
explicitly disabled - (in runtime for simple implementation).


> Moreover, I don't think you've even provided a compelling example
> case.  What's this doing that you couldn't do with existing polymorphic
> types or the anycompatibletype proposal?
>

There are two cases of usage

a) combination of polymorphic types - fx(t1, t1, t2, t1, t2, t1, t2, ...)
b) forcing types fx(t1, t2) t1 force explicit cast for t2 to t1
c) optimization of repeated call of functions like fx("any", "any", "any",
...)

It is pretty hard to create simple non-procedural language to describe
syntaxes like @a. But with procedural code it is easy.

@c is special case, that we can do already. But we cannot to push casting
outside function, and inside function, there is a overhead with casting.
With implementing type case inside function, then we increase startup time
and it is overhead for function started by plpgsql runtime.


> I also strongly suspect that this would break pieces of the system
> that expect that the stored pg_proc.prorettype has something to do
> with reality.  At minimum, you'd need to fix a number of places you
> haven't touched here that have their own knowledge of function type
> resolution, such as enforce_generic_type_consistency,
> resolve_polymorphic_argtypes, resolve_aggregate_transtype.  Probably
> anyplace that treats polymorphics as being any sort of special case
> would have to be taught to re-call the support function to find out
> what it should think the relevant types are.
>
> (I don't even want to think about what happens if the support function's
> behavior changes between original parsing and these re-checking spots.)
>

The helper function should be immutable - what I know, is not possible to
change data types dynamically, so repeated call should not be effective,
but should to produce same result, so it should not be a problem.

>
> Another thing that's very much less than compelling about your example
> is that your support function seems to be happy to throw errors
> if the argument types don't match what it's expecting.  That seems
> quite unacceptable, since it would prevent the parser from moving on
> to consider other possibly-matching functions.  Maybe that's just
> because it's a quick hack not a polished example, but it doesn't
> seem like a good precedent.
>

In this case it is decision, because I don't expect overloading.

I understand to your objections about mixing parser helper functions and
overloading. Currently it is pretty hard to understand what will be
expected behave when somebody overload function with polymorphic function.

With parser helper function the overloading is not necessary and can be
disabled.


> In short, I think the added complexity and bug potential outweigh
> any possible gain from this.
>
> regards, tom lane
>

Re: LLVM compile failing in seawasp

2019-07-26 Thread Thomas Munro

On Fri, Jun 7, 2019 at 12:13 PM Tom Lane  wrote:
> didier  writes:
> > c.h defines a C Min macro conflicting with llvm new class
> > llvm:ElementCount Min member
>
> Really?  Well, we will hardly be the only code they broke with that.
> I think we can just wait for them to reconsider.

FYI This is now on LLVM's release_90 branch, due out on August 28.

-- 
Thomas Munro
https://enterprisedb.com

Re: SegFault on 9.6.14

2019-07-26 Thread Thomas Munro

On Fri, Jul 26, 2019 at 4:13 PM Amit Kapila  wrote:
> On Tue, Jul 23, 2019 at 5:28 PM Amit Kapila  wrote:
> > Right, that will be lesser code churn and it can also work.  However,
> > one thing that needs some thought is till now es_top_eflags is only
> > set in ExecutorStart and same is mentioned in comments where it is
> > declared and it seems we are going to change that with this idea.  How
> > about having a separate function ExecBlahShutdown which will clean up
> > resources as parallel context and can be called only from ExecutePlan
> > where we are calling ExecShutdownNode?  I think both these and the
> > other solution we have discussed are on similar lines and another idea
> > could be to relax the assert which again is not a superb idea.
>
> It seems we don't have a clear preference for any particular solution
> among these and neither there appears to be any better idea.  I guess
> we can wait for a few days to see if Robert has any views on this,
> otherwise, pick one of the above and move ahead.

I take the EXEC_FLAG_DONE idea back.  It's ugly and too hard to verify
that every appropriate path sets it, and a flag that means the
opposite would be even more of a kluge, and generally I think I was
looking at this too myopically: I was looking for a way to shut down
processes ASAP without giving up the shared memory we'll need for
rescanning, but what I should have been looking at is the reason you
did that in the first place: to get the instrumentation data.  Can you
explain why it's necessary to do that explicitly for Limit?  Wouldn't
the right place to collect instrumentation be at the end of execution
when Shutdown will run in all cases anyway (and possibly also during
ExecParallelReinitialize() or something like that if it's being
clobbered by rescans, I didn't check)?  What's special about Limit?

Today while poking at this and trying to answer those questions for
myself, I realised that  the repro I posted earlier[1] crashes exactly
as Jerry reported on REL9_6_STABLE, but in later release branches it
runs to completion.  That's because the crashing code was removed in
commit 41b0dd98 "Separate reinitialization of shared parallel-scan
state from ExecReScan.".

So newer branches get past that problem, but they all spit out tons of
each of these three warnings:

WARNING:  buffer refcount leak: [172] (rel=base/12558/16390,
blockNum=5, flags=0x9380, refcount=1 2998)
...
WARNING:  relcache reference leak: relation "join_bar" not closed
...
WARNING:  Snapshot reference leak: Snapshot 0x7ff20383bfb0 still referenced
...

Oops.  I don't know exactly why yet, but the problem goes away if you
just comment out the offending ExecShutdownNode() call in nodeLimit.c.
I tried to understand whether the buffer stats were wrong with that
code commented out (Adrien Nayrat's original complaint[2]), but I ran
out of time for debugging adventures today.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJyqDp9FZSHLTjiNMcz-c6%3DRdStB%2BUjVZsR8wfHnJXy8Q%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/86137f17-1dfb-42f9-7421-82fd786b04a1%40anayrat.info

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-07-26 Thread Andres Freund

Hi,

On 2019-07-25 17:51:33 +1200, Thomas Munro wrote:
> 1.  WAL's use of fdatasync():  The reason we fill and then fsync()
> newly created WAL files up front is because we want to make sure the
> blocks are definitely on disk.  The comment doesn't spell out exactly
> why the author considered later fdatasync() calls to be insufficient,
> but they were: it was many years after commit 33cc5d8a4d0d that Linux
> ext3/4 filesystems began flushing file size changes to disk in
> fdatasync()[1][2].  I don't know if its original behaviour was
> intentional or not.  So, if you didn't use the bigger fsync() hammer
> on that OS, you might lose the end of a recently extended file in a
> power failure even though fdatasync() had returned success.
> 
> By my reading of POSIX, that shouldn't be necessary on a conforming
> implementation of fdatasync(), and that was fixed years ago in Linux.
> I'm not proposing any changes there, and I'm not proposing to take
> advantage of that in the new code.  I'm pointing out that that we
> don't have to worry about that for these undo segments, because they
> are already flushed with fsync(), not fdatasync().

> (To understand POSIX's descriptions of fsync() and fdatasync() you
> have to find the meanings of "Synchronized I/O Data Integrity
> Completion" and "Synchronized I/O File Integrity Completion" elsewhere
> in the spec.  TL;DR: fdatasync() is only allowed to skip flushing
> attributes like the modified time, it's not allowed to skip flushing a
> file size change since that would interfere with retrieving the data.)

Note that there's very good performance reasons trying to avoid metadata
changes at e.g. commit time. They're commonly journaled at the FS level,
which can add a good chunk of IO and synchronization to an operations
that we commonly want to be as fast as possible. Basically you often at
least double the amount of synchronous writes.

And for the potential future where use async direct IO - writes that
change the file size take considerably slower codepaths, and add a lot
of synchronization.

I suspect that's much more likely to be the reason for the preallocation
in 33cc5d8a4d0d, than avoiding an ext* bug (I doubt the bug you
reference existed back then, it IIUC didn't apply to ext2, and ext3 was
was introduced after 33cc5d8a4d0d).

> 2.  Time of reservation:  Although they don't call fsync(), regular
> relations and these new undo files still write zeroes up front
> (respectively, for a new block and for a new segment).  One reason for
> that is that most popular filesystems reserve space at write time, so
> you'll get ENOSPC when trying to allocate undo space, and that's a
> non-fatal ERROR.  If we deferred until writing back buffer contents,
> we might get file holes, and deferred ENOSPC is much harder to report
> to users and for users to deal with.

FWIW, the hole bit I don't quite buy - we could zero the hole at that
time (and not be worse than today, except that it might be done by
somebody that didn't cause the extension), or even better just look up
the buffers between the FS end of the relation, and the block currently
written, and write them out in order.

The whole thing with deferred ENOSPC being harder to report to users is
obviously true regardless of htat.

> BTW we could probably use posix_fallocate() instead of writing zeroes;
> I think Andres mentioned that recently.  I see also that someone tried
> that for WAL and it got reverted back in 2013 (commit
> b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down
> the discussion).

IIRC the problem from back then was that while the space is reserved on
the FS level, the actual blocks don't contain zeroes at that time. Which
means that

a) Small writes need to write more, because the surrounding data also
   needs to be zeroed (annoying but not terrible).

b) Writes into the fallocated but not written range IIRC effectively
   cause metadata writes, because while the "allocated file ending"
   doesn't change anymore, the new "non-zero written to" fileending does
   need to be journaled to disk before an f[data]sync - otherwise you
   could end up with the old value after a crash, and would read
   spurious zeroes.

   That's quite bad.

Those don't necessarily apply to e.g. extending relations as we
e.g. don't granularly fsync them. Although even there the performance
picture is mixed - it helps a lot in certain workloads, but there's
others were it mildly regresses performance on ext4. Not sure why yet,
possibly it's due to more heavyweight locking needed when later changing
the "non-zero size", or it's the additional metadata changes. I suspect
those would be mostly gone if we didn't write back blocks in random
order under memory pressure.

Note that neither of those mean that it's not a good idea to
posix_fallocate() and *then* write zeroes, when initializing. For
several filesystems that's more likely to result in more optimally sized
filesystem extents, reducing fragmentation. And

Re: Patch for SortSupport implementation on inet/cdir

2019-07-26 Thread Peter Geoghegan

On Fri, Jul 26, 2019 at 6:58 PM Peter Geoghegan  wrote:
> I found this part of your approach confusing:
>
> > +   /*
> > +* Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8.
> > +*
> > +* However, only some of the bits may have made it into the fixed sized
> > +* datum, so take the smallest number between bits in the subnet and 
> > bits
> > +* in the datum which are not part of the network.
> > +*/
> > +   datum_subnet_size = Min(ip_maxbits(authoritative) - 
> > ip_bits(authoritative),
> > +   SIZEOF_DATUM * BITS_PER_BYTE - 
> > ip_bits(authoritative));
>
> The way that you put a Min() on the subnet size potentially constrains
> the size of the bitmask used for the network component of the
> abbreviated key (the component that comes immediately after the
> ipfamily status bit). Why not just let the bitmask be a bitmask,
> without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
> more streamlined approach, with significantly fewer special cases. I'm
> not sure whether or not your approach had bugs, but I didn't like the
> way you sometimes did a straight "network = ipaddr_datum" assignment
> without masking.

I guess that the idea here was to prevent masking on ipv6 addresses,
though not on ipv4 addresses. Obviously we're only dealing with a
prefix with ipv6 addresses, whereas we usually have the whole raw
ipaddr with ipv4. Not sure if I'm doing the right thing there in v3,
even though the tests pass. In any case, this will need to be a lot
clearer in the final version.

-- 
Peter Geoghegan

Re: Patch for SortSupport implementation on inet/cdir

2019-07-26 Thread Peter Geoghegan

On Fri, Feb 8, 2019 at 10:20 AM Brandur Leach  wrote:
> Attached a V2 patch: identical to V1 except rebased and
> with a new OID selected.

Attached is a revised version that I came up with, based on your v2.

I found this part of your approach confusing:

> +   /*
> +* Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8.
> +*
> +* However, only some of the bits may have made it into the fixed sized
> +* datum, so take the smallest number between bits in the subnet and bits
> +* in the datum which are not part of the network.
> +*/
> +   datum_subnet_size = Min(ip_maxbits(authoritative) - 
> ip_bits(authoritative),
> +   SIZEOF_DATUM * BITS_PER_BYTE - 
> ip_bits(authoritative));

The way that you put a Min() on the subnet size potentially constrains
the size of the bitmask used for the network component of the
abbreviated key (the component that comes immediately after the
ipfamily status bit). Why not just let the bitmask be a bitmask,
without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
more streamlined approach, with significantly fewer special cases. I'm
not sure whether or not your approach had bugs, but I didn't like the
way you sometimes did a straight "network = ipaddr_datum" assignment
without masking.

I really liked your diagrams, but much of the text that went with them
either seemed redundant (it described established rules about how the
underlying types sort), or seemed to discuss things that were better
discussed next to the relevant network_abbrev_convert() code.

Thoughts?
-- 
Peter Geoghegan

v3-0001-Add-sort-support-for-inet-cidr-opfamily.patch
Description: Binary data

warning on reload for PGC_POSTMASTER, guc.c duplication, ...

2019-07-26 Thread Andres Freund

Hi,

When specifying a config a PGC_POSTMASTER variable on the commandline
(i.e. -c something=other) the config processing blurts a wrong warning
about not being able to change that value. E.g. when specifying
shared_buffers via -c, I get:

2019-07-26 16:28:04.795 PDT [14464][] LOG:  0: received SIGHUP, reloading 
configuration files
2019-07-26 16:28:04.795 PDT [14464][] LOCATION:  SIGHUP_handler, 
postmaster.c:2629
2019-07-26 16:28:04.798 PDT [14464][] LOG:  55P02: parameter "shared_buffers" 
cannot be changed without restarting the server
2019-07-26 16:28:04.798 PDT [14464][] LOCATION:  set_config_option, guc.c:7107
2019-07-26 16:28:04.798 PDT [14464][] LOG:  F: configuration file 
"/srv/dev/pgdev-dev/postgresql.conf" contains errors; unaffected changes were 
applied
2019-07-26 16:28:04.798 PDT [14464][] LOCATION:  ProcessConfigFileInternal, 
guc-file.l:502

ISTM that the codeblocks throwing these warnings:

if (prohibitValueChange)
{
if (*conf->variable != newval)
{
record->status |= GUC_PENDING_RESTART;
ereport(elevel,
(errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
 errmsg("parameter \"%s\" cannot be changed 
without restarting the server",
name)));
return 0;
}
record->status &= ~GUC_PENDING_RESTART;
return -1;
}

ought to only enter the error path if changeVal indicates that we're
actually intending to apply the value. I.e. something roughly like the
attached.


Two more things I noticed when looking at this code:

1) Aren't we leaking memory if prohibitValueChange is set, but newextra
   is present? The cleanup path for that:

/* Perhaps we didn't install newextra anywhere */
if (newextra && !extra_field_used(&conf->gen, newextra))
free(newextra);

   isn't reached in the prohibitValueChange path shown above. ISTM the
   return -1 in the prohibitValueChange ought to be removed?

2) The amount of PGC_* dependant code duplication in set_config_option()
   imo is over the top.  ISTM that they should be merged, and
   a call_*_check_hook wrapper take care of invoking the check hooks,
   and a nother wrapper should take care of calling the assign hook,
   ->variable, and reset_val processing.

   Those wrappers could probably also reduce the amount of code in
   InitializeOneGUCOption(), parse_and_validate_value(),
   ResetAllOptions(), AtEOXact_GUC().

   I'm also wondering we shouldn't just use config_var_value for at
   least config_*->{reset_val, boot_val}. It seems pretty clear that
   reset_extra ought to be moved?

   I'm even wondering the various hooks shouldn't actually just take
   config_var_value. But changing that would probably cause more pain to
   external users - in contrast to looking directly at reset_val,
   boot_val, reset_extra they're much more likely to have hooks.

Greetings,

Andres Freund
diff --git i/src/backend/utils/misc/guc.c w/src/backend/utils/misc/guc.c
index fc463601ff3..0dde7d3bf40 100644
--- i/src/backend/utils/misc/guc.c
+++ w/src/backend/utils/misc/guc.c
@@ -7008,7 +7008,7 @@ set_config_option(const char *name, const char *value,
 
 if (prohibitValueChange)
 {
-	if (*conf->variable != newval)
+	if (changeVal && *conf->variable != newval)
 	{
 		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
@@ -7098,7 +7098,7 @@ set_config_option(const char *name, const char *value,
 
 if (prohibitValueChange)
 {
-	if (*conf->variable != newval)
+	if (changeVal && *conf->variable != newval)
 	{
 		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
@@ -7188,7 +7188,7 @@ set_config_option(const char *name, const char *value,
 
 if (prohibitValueChange)
 {
-	if (*conf->variable != newval)
+	if (changeVal && *conf->variable != newval)
 	{
 		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
@@ -7295,8 +7295,9 @@ set_config_option(const char *name, const char *value,
 if (prohibitValueChange)
 {
 	/* newval shouldn't be NULL, so we're a bit sloppy here */
-	if (*conf->variable == NULL || newval == NULL ||
-		strcmp(*conf->variable, newval) != 0)
+	if (changeVal && (*conf->variable == NULL ||
+	  newval == NULL ||
+	  strcmp(*conf->variable, newval) != 0))
 	{
 		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
@@ -7391,7 +7392,7 @@ set_config_option(const char *name, const char *value,
 
 if (prohibitValueChange)
 {
-	if (*conf->variable != newval)
+	if (changeVal && *conf->variable != newval)
 	{
 		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,

Re: Duplicated LSN in ReorderBuffer

2019-07-26 Thread Andres Freund

Hi,

Petr, Simon, see the potential issue related to fast forward at the
bottom.

On 2019-07-26 18:46:35 -0400, Alvaro Herrera wrote:
> On 2019-Jul-09, Masahiko Sawada wrote:
>
> > I think the cause of this bug would be that a ReorderBufferTXN entry
> > of sub transaction is created as top-level transaction. And this
> > happens because we skip to decode ASSIGNMENT during the state of
> > snapshot builder < SNAPBUILD_FULL.
>
> That explanation seems to make sense.

Yea. The comment that "in the assignment case we'll not decode those
xacts" is true, but it misses that we *do* currently process
XLOG_HEAP2_NEW_CID records for transactions that started before reaching
FULL_SNAPSHOT.

Thinking about it, it was not immediately clear to me that it is
necessary to process XLOG_HEAP2_NEW_CID at that stage. We only need the
cid mapping when decoding content of the transaction that the
XLOG_HEAP2_NEW_CID record was about - which will not happen if it
started before SNAPBUILD_FULL.

Except that they also are responsible for signalling that a transaction
performed catalog modifications (cf ReorderBufferXidSetCatalogChanges()
call), which in turn is important for SnapBuildCommitTxn() to know
whether to include that transaction needs to be included in historic
snapshots.

So unless I am missing something - which is entirely possible, I've had
this code thoroughly swapped out - that means that we only need to
process XLOG_HEAP2_NEW_CID < SNAPBUILD_FULL if there can be transactions
with relevant catalog changes, that don't have any invalidations
messages.

After thinking about it for a bit, that's not guaranteed however. For
one, even for system catalog tables, looking at
CacheInvalidateHeapTuple() et al there can be catalog modifications that
create neither a snapshot invalidation message, nor a catcache
one. There's also the remote scenario that we possibly might be only
modifying a toast relation.

But more importantly, the only modified table could be a user defined
catalog table (i.e. WITH (user_catalog_table = true)). Which in all
likelihood won't cause invalidation messages. So I think currently it is
required to process NEW_ID records - although we don't need to actually
execute the ReorderBufferAddNewTupleCids() etc calls.

Perhaps the right fix for the future would actually be to not rely on on
NEW_ID for recognizing transactions as such, but instead have an xact.c
marker that signals whether a transaction performed catalog
modifications.

Hm, need to think more about this.

> > Instead, I wonder if we can decode ASSIGNMENT even when the state of
> > snapshot builder < SNAPBUILD_FULL_SNAPSHOT. That way, the
> > ReorderBufferTXN entries of both top transaction and sub transaction
> > are created properly before we decode NEW_CID.
>
> Yeah, that seems a sensible remediation to me.

That does seems like a reasonable approach. I can see two alternatives:

1) Change SnapBuildProcessNewCid()'s ReorderBufferXidSetCatalogChanges()
   call to reference the toplevel xid. That has the disadvantage that if
   the subtransaction that performed DDL rolls back, the main
   transaction will still be treated as a catalog transaction - i have a
   hard time seeing that being common, however.

   That'd then also require SnapBuildProcessNewCid() in
   SNAPBUILD_FULL_SNAPSHOT to return before processing any data assigned
   to subtransactions. Which would be good, because that's currently
   unnecessarily stored in memory.

2) We could simply assign the subtransaction to the parent using
   ReorderBufferAssignChild() in SnapBuildProcessNewCid() or it's
   caller. That ought to also fix the bug

   I also has the advantage that we can save some memory in transactions
   that have some, but fewer than the ASSIGNMENT limit subtransactions,
   because it allows us to avoid having a separate base snapshot for
   them (c.f. ReorderBufferTransferSnapToParent()).

   Like 1) that could be combined with adding an early return when <
   SNAPBUILD_FULL_SNAPSHOT, after ReorderBufferXidSetCatalogChanges(),
   but I don't think it'd be required for correctness in contrast to 1).

Both of these would have the advantage that we only would track
additional information for transactions that have modified the catalog,
whereas the proposal to process ASSIGNMENT earlier, would mean that we
additionally track all transactions with more than 64 children.  So
provided that I didn't mis-analyze here, I think both of my alternatives
are preferrable? I think 2) is simpler?

>   /*
> -  * No point in doing anything yet, data could not be decoded anyway. 
> It's
> -  * ok not to call ReorderBufferProcessXid() in that case, except in the
> -  * assignment case there'll not be any later records with the same xid;
> -  * and in the assignment case we'll not decode those xacts.
> +  * If the snapshot isn't yet fully built, we cannot decode anything, so
> +  * bail out.
> +  *
> +  * However, it's critical to process XLOG_XACT_

RE: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs

2019-07-26 Thread Chengchao Yu

Hi Kyotaro,

Thank you so much for your valued feedback and suggestions!

> I assume that we are in a consensus about the problem we are to  fix here.
> 
> > 0a 0004`8080cc30 0004`80dcf917 postgres!PGSemaphoreLock+0x65 
> > [d:\orcasqlagsea10\14\s\src\backend\port\win32_sema.c @ 158] 0b 
> > 0004`8080cc90 0004`80db025c postgres!LWLockAcquire+0x137 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\lmgr\lwlock.c @ 1234] 0c 
> > 0004`8080ccd0 0004`80db25db postgres!AbortBufferIO+0x2c 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 3995] 0d 
> > 0004`8080cd20 0004`80dbce36 postgres!AtProcExit_Buffers+0xb 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 2479] 0e 
> > 0004`8080cd50 0004`80dbd1bd postgres!shmem_exit+0xf6 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 262] 0f 
> > 0004`8080cd80 0004`80dbccfd postgres!proc_exit_prepare+0x4d 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 188]
> > 10 0004`8080cdb0 0004`80ef9e74 postgres!proc_exit+0xd 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 141]
> > 11 0004`8080cde0 0004`80ddb6ef postgres!errfinish+0x204 
> > [d:\orcasqlagsea10\14\s\src\backend\utils\error\elog.c @ 624]
> > 12 0004`8080ce50 0004`80db0f59 postgres!mdread+0x12f 
> > [d:\orcasqlagsea10\14\s\src\backend\storage\smgr\md.c @ 806]

Yes, this is one of the two situations we want to fix. The other situation is a 
cascade exception case like following.

  #0  0x7f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  #1  do_futex_wait (sem=sem(at)entry=0x7f0fd14c81b8, abstime=0x0) at 
sem_waitcommon.c:111
  #2  0x7f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8, 
abstime=0x0) at sem_waitcommon.c:181
  #3  0x5630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at 
pg_sema.c:316
  #4  0x5630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00, 
mode=LW_EXCLUSIVE) at 
/path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243
  #5  0x5630d47cd087 in AbortBufferIO () at 
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988
  #6  0x5630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at 
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473
  #7  0x5630d47dbc32 in shmem_exit (code=1) at 
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
  #8  0x5630d47dba5e in proc_exit_prepare (code=1) at 
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
  #9  0x5630d47db9c6 in proc_exit (code=1) at 
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
  #10 0x5630d49811bc in errfinish (dummy=0) at 
/path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
  #11 0x5630d4801f1f in mdwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, 
blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at 
/path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
  #12 0x5630d4804716 in smgrwrite (reln=0x5630d6588c68, 
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at 
/path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
  #13 0x5630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00, 
reln=0x5630d6588c68) at 
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
  #14 0x5630d47cb219 in SyncOneBuffer (buf_id=0, skip_recently_used=false, 
wb_context=0x7ffccc371a70) at 
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
  #15 0x5630d47cab00 in BufferSync (flags=6) at 
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
  #16 0x5630d47cb57f in CheckPointBuffers (flags=6) at 
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
  #17 0x5630d44a685b in CheckPointGuts (checkPointRedo=23612304, flags=6) 
at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
  #18 0x5630d44a62cf in CreateCheckPoint (flags=6) at 
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
  #19 0x5630d44a45e3 in StartupXLOG () at 
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723
  #20 0x5630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0 "postgres", 
dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0, out_dbname=0x0, 
override_allow_connections=false)
  at /path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636
  #21 0x5630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20, 
dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at 
/path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810
  #22 0x5630d4695e8b in main (argc=7, argv=0x5630d6534d20) at 
/path/to/postgres/source/build/../src/backend/main/main.c:224
Though ENOSPC is avoided by reservation in PG, the

Re: TopoSort() fix

2019-07-26 Thread Tom Lane

Andres Freund  writes:
> On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote:
>> Hello, is anybody looking into this issue?

> I guess this is on Robert's docket otherwise. He's on vacation till
> early next week...

I think this is a sufficiently obvious bug, and a sufficiently
obvious fix, that we should just fix it and not insist on getting
a reproducible test case first.  I think a test case would soon
bit-rot anyway, and no longer exercise the problem.

I certainly do *not* want to wait so long that we miss the
upcoming minor releases.

regards, tom lane

Re: proposal: make NOTIFY list de-duplication optional

2019-07-26 Thread Tom Lane

=?UTF-8?Q?Filip_Rembia=C5=82kowski?=  writes:
> Still no hash table fallback is implemented, so this is *not* a
> performance improvement. Only a little more flexibility.

I think that we'd probably be better off fixing the root performance issue
than adding semantic complexity to bypass it.  Especially since, if you
don't de-duplicate, that's going to cost you when it comes time to
actually send the notifications, receive them, forward them to clients,
and process them in the clients.

Admittedly, if the app *knows* that it's generating non-duplicate events,
maybe it'd be all right to skip checking that.  But if we can make the
check cheap, I'd just as soon keep it.

Accordingly, I looked into making a hash table when there are more than
a small number of notifications pending, and attached is a lightly-tested
version of that.  This seems to be more or less similar speed to the
existing code for up to 100 or so distinct notifies, but it soon pulls
away above that.

A point that needs discussion is that this patch, unlike the existing
code, *does* de-duplicate fully: any events generated by a subtransaction
that duplicate events already emitted by a parent will get removed when
the subxact is merged to its parent.  I did this partly because we have
to expend O(N) work to merge N subtransaction notifies in any case,
now that we have to make new hashtable entries in the parent xact;
so the old excuse that subxact-end processing is really cheap no
longer applies.  Also because the Assert(!found) assertions in the
attached hash coding fall over if we cheat on this.  If we really want
to maintain exactly the old semantics here, we could relax the hashtable
code to just ignore duplicate entries.  But, per the argument above,
de-duplication is a good thing so I'm inclined to keep it like this.

I also noticed that as things stand, it costs us two or three pallocs to
construct a Notification event.  It wouldn't be terribly hard to reduce
that to one palloc, and maybe it'd be worthwhile if we're thinking that
transactions with many many notifies are a case worth optimizing.
But I didn't do that here; it seems like a separable patch.

I also thought for awhile about not having the hashtable as an auxiliary
data structure, but making it the main data structure.  We could preserve
the required notification ordering information by threading the live
hashtable entries into an slist, say.  However, this would greatly
increase the overhead for transactions with just one or a few distinct
NOTIFY events, since we'd have to set up the hashtable at the first one.
I think that's a common enough case that we shouldn't de-optimize it.
A smaller objection is that such a data structure would absolutely commit
us to de-duplication semantics, whereas the list plus separate hashtable
can cope with not de-duping if someone persuades us that's sane.

Thoughts?

regards, tom lane

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 6e9c580..c21daa5 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -135,6 +135,7 @@
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
+#include "utils/hashutils.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/snapmgr.h"
@@ -323,14 +324,25 @@ static List *upperPendingActions = NIL; /* list of upper-xact lists */

 /*
  * State for outbound notifies consists of a list of all channels+payloads
- * NOTIFYed in the current transaction. We do not actually perform a NOTIFY
- * until and unless the transaction commits.  pendingNotifies is NIL if no
- * NOTIFYs have been done in the current transaction.
+ * NOTIFYed in the current transaction.  We do not actually perform a NOTIFY
+ * until and unless the transaction commits.  pendingNotifies is NULL if no
+ * NOTIFYs have been done in the current (sub) transaction.
+ *
+ * We discard duplicate notify events issued in the same transaction.
+ * Hence, in addition to the list proper (which we need to track the order
+ * of the events, since we guarantee to deliver them in order), we build a
+ * hash table which we can probe to detect duplicates.  Since building the
+ * hash table is somewhat expensive, we do so only once we have at least
+ * MIN_HASHABLE_NOTIFIES events queued in the current (sub) transaction;
+ * before that we just scan the events linearly.
  *
  * The list is kept in CurTransactionContext.  In subtransactions, each
  * subtransaction has its own list in its own CurTransactionContext, but
- * successful subtransactions attach their lists to their parent's list.
- * Failed subtransactions simply discard their lists.
+ * successful subtransactions add their entries to their parent's list.
+ * Failed subtransactions simply discard their lists.  Since these lists
+ * are independent, there may be notify events in a subtransaction's list
+ * that duplicate events in some ancestor (sub) transaction; we get rid o

Re: TopoSort() fix

2019-07-26 Thread Andres Freund

Hi,

On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote:
> On 2019-Jul-04, Rui Hai Jiang wrote:
> 
> > I'll try to figure out some  scenarios to do the test. A parallel process
> > group is needed for the test.

Rui, have you made any progress on this?


> > Actually I was trying to do some testing against the locking mechanism. I
> > happened to see this issue.
> 
> Hello, is anybody looking into this issue?

I guess this is on Robert's docket otherwise. He's on vacation till
early next week...

Greetings,

Andres Freund

Re: Duplicated LSN in ReorderBuffer

2019-07-26 Thread Alvaro Herrera

On 2019-Jul-09, Masahiko Sawada wrote:

> I think the cause of this bug would be that a ReorderBufferTXN entry
> of sub transaction is created as top-level transaction. And this
> happens because we skip to decode ASSIGNMENT during the state of
> snapshot builder < SNAPBUILD_FULL.

That explanation seems to make sense.

> Instead, I wonder if we can decode ASSIGNMENT even when the state of
> snapshot builder < SNAPBUILD_FULL_SNAPSHOT. That way, the
> ReorderBufferTXN entries of both top transaction and sub transaction
> are created properly before we decode NEW_CID.

Yeah, that seems a sensible remediation to me.

I would reduce the scope a little bit -- only create the assignment in
the BUILDING state, and skip it in the START state.  I'm not sure that
it's possible to get assignments while in START state that are
significant (I'm still trying to digest SnapBuildFindSnapshot).

I would propose the attached.  Andres, do you have an opinion on this?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 0d37a9a64bc20414742a8e9497dc8dffeeea16d1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 26 Jul 2019 18:38:44 -0400
Subject: [PATCH v3] decode XACT_ASSIGNMENT while building snapshot

---
 src/backend/replication/logical/decode.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 151c3ef882..a6f7dc2723 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -215,14 +215,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	ReorderBuffer *reorder = ctx->reorder;
 	XLogReaderState *r = buf->record;
 	uint8		info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
+	SnapBuildState state;
 
 	/*
-	 * No point in doing anything yet, data could not be decoded anyway. It's
-	 * ok not to call ReorderBufferProcessXid() in that case, except in the
-	 * assignment case there'll not be any later records with the same xid;
-	 * and in the assignment case we'll not decode those xacts.
+	 * If the snapshot isn't yet fully built, we cannot decode anything, so
+	 * bail out.
+	 *
+	 * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+	 * when the snapshot is being built: it is possible to get later records
+	 * that require subxids to be properly assigned.
 	 */
-	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
+	state = SnapBuildCurrentState(builder);
+	if (state < SNAPBUILD_FULL_SNAPSHOT ||
+		(info == XLOG_XACT_ASSIGNMENT && state < SNAPBUILD_BUILDING_SNAPSHOT))
 		return;
 
 	switch (info)
-- 
2.17.1

Re: Attached partition not considering altered column properties of root partition.

2019-07-26 Thread Alvaro Herrera

On 2019-Jul-03, Amit Langote wrote:

> Thanks for the report.  This seems like a bug.  Documentation claims
> that the child tables inherit column storage options from the parent
> table.  That's actually enforced in only some cases.

> To fix this, MergeAttributesIntoExisting() should check that the
> attribute options of a child don't conflict with the parent, which the
> attached patch implements.  Note that partitioning uses the same code
> as inheritance, so the fix applies to it too.  After the patch:

Thanks for the patch!

I'm not completely sold on back-patching this.  Should we consider
changing it in 12beta and up only, or should we just backpatch it all
the way back to 9.4?  It's going to raise errors in cases that
previously worked.

On the patch itself: I think ERRCODE_DATATYPE_MISMATCH is not the
correct one to use for this; maybe ERRCODE_INVALID_OBJECT_DEFINITION or,
more likely, ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE.

"FOO versus BAR" does not sound proper style.  Maybe "Child table has
FOO, parent table expects BAR."  Or maybe put it all in errmsg, 
  errmsg("child table \"%s\" has storage option \"%s\" for column \"%s\" 
mismatching \"%s\" on parent",

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TopoSort() fix

2019-07-26 Thread Alvaro Herrera

On 2019-Jul-04, Rui Hai Jiang wrote:

> I'll try to figure out some  scenarios to do the test. A parallel process
> group is needed for the test.
> 
> Actually I was trying to do some testing against the locking mechanism. I
> happened to see this issue.

Hello, is anybody looking into this issue?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Multivariate MCV list vs. statistics target

2019-07-26 Thread Tomas Vondra

On Fri, Jul 26, 2019 at 07:03:41AM +, Jamison, Kirk wrote:

On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:

>+   /* XXX What if the target is set to 0? Reset the statistic?
*/
>
>This also makes me wonder. I haven't looked deeply into the code, but
>since 0 is a valid value, I believe it should reset the stats.

I agree resetting the stats after setting the target to 0 seems quite
reasonable. But that's not what we do for attribute stats, because in that
case we simply skip the attribute during the future ANALYZE runs - we don't
reset the stats, we keep the existing stats. So I've done the same thing here,
and I've removed the XXX comment.

If we want to change that, I'd do it in a separate patch for both the regular
and extended stats.

Hi, Tomas

Sorry for my late reply.
You're right. I have no strong opinion whether we'd want to change that 
behavior.
I've also confirmed the change in the patch where setting statistics target 0
skips the statistics.

OK, thanks.

Maybe only some minor nitpicks in the source code comments below:
1. "it's" should be "its":

+* Compute statistic target, based on what's set for the 
statistic
+* object itself, and for it's attributes.

2. Consistency whether you'd use either "statistic " or "statisticS ".
Ex. statistic target vs statisticS target, statistics object vs statistic 
object, etc.

Attached is v4 of the patch, with a couple more improvements:

1) I've renamed the if_not_exists flag to missing_ok, because that's more
consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda
the exact opposite), and I've added a NOTICE about the skip.

+   boolmissing_ok;  /* do nothing if statistics does not 
exist */

Confirmed. So we ignore if statistic does not exist, and skip the error.
Maybe to make it consistent with other data structures in parsernodes.h,
you can change the comment of missing_ok to:
/* skip error if statistics object does not exist */

Thanks, I've fixed all those places in the attached v5.

2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's
what the function was doing anyway (computing sample size).

3) I've added a couple of regression tests to stats_ext.sql

Aside from that, I've cleaned up a couple of places and improved a bunch of
comments. Nothing huge.

I have a question though regarding ComputeExtStatisticsRows.
I'm just curious with the value 300 when computing sample size.
Where did this value come from?

+   /* compute sample size based on the statistic target */
+   return (300 * result);

Overall, the patch is almost already in good shape for commit.
I'll wait for the next update.

That's how we compute number of rows to sample, based on the statistics
target. See std_typanalyze() in analyze.c, which also cites the paper
where this comes from.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/doc/src/sgml/ref/alter_statistics.sgml 
b/doc/src/sgml/ref/alter_statistics.sgml
index 58c7ed020d..be4c3f1f05 100644
--- a/doc/src/sgml/ref/alter_statistics.sgml
+++ b/doc/src/sgml/ref/alter_statistics.sgml
@@ -26,6 +26,7 @@ PostgreSQL documentation
 ALTER STATISTICS name OWNER TO { 
new_owner | CURRENT_USER | 
SESSION_USER }
 ALTER STATISTICS name RENAME TO 
new_name
 ALTER STATISTICS name SET SCHEMA 
new_schema
+ALTER STATISTICS name SET 
STATISTICS new_target

@@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA 

+ 
+  new_target
+  
+   
+The statistic-gathering target for this statistics object for 
subsequent
+ operations.
+The target can be set in the range 0 to 1; alternatively, set it
+to -1 to revert to using the system default statistics
+target ().
+For more information on the use of statistics by the
+PostgreSQL query planner, refer to
+.
+   
+  
+ 
+

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8d633f2585..c27df32d92 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -307,7 +307,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
VacAttrStats **vacattrstats;
AnlIndexData *indexdata;
int targrows,
-   numrows;
+   numrows,
+   minrows;
double  totalrows,
totaldeadrows;
HeapTuple  *rows;
@@ -491,6 +492,16 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
}
}

+   /*
+* Look at extended statistics objects too, as those may define custom
+* statistics target. So we may need to sample more rows and then build
+* the statistics with enough detail.
+*/
+

Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-26 Thread Alvaro Herrera

On 2019-Jul-25, Michael Paquier wrote:

> On Wed, Jul 24, 2019 at 11:23:30AM -0400, Alvaro Herrera wrote:
> > Heh, yesterday I revised the original patch as attached and was about to
> > push when the bell rang.  I like this one because it keeps the comment
> > to one line and it mentions the function name in charge of the
> > validation (useful for grepping later on).  It's a bit laconic because
> > of the long function name and the desire to keep it to one line, but it
> > seems sufficient to me.
> 
> Looks fine to me.  A nit: addition of braces for the if block.  Even
> if that a one-liner, there is a comment so I think that this makes the
> code more readable.

Yeah, I had removed those on purpose, but that was probably inconsistent
with my own reviews of others' patches.  I pushed it with them.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: double free in ExecHashJoin, 9.6.12

2019-07-26 Thread Merlin Moncure

On Wed, Jul 24, 2019 at 11:01 PM Thomas Munro  wrote:
>
> On Thu, Jul 25, 2019 at 2:39 AM Merlin Moncure  wrote:
> > Server is generally running pretty well, and is high volume.  This
> > query is not new and is also medium volume.  Database rebooted in
> > about 4 seconds with no damage; fast enough we didn't even trip alarms
> > (I noticed this troubleshooting another issue).  We are a couple of
> > bug fixes releases behind but I didn't see anything obvious in the
> > release notes suggesting a resolved issue. Anyone have any ideas?
> > thanks in advance.
>
> > postgres: rms ysconfig 10.33.190.21(36788) 
> > SELECT(ExecHashJoin+0x5a2)[0x5e2d32]
>
> Hi Merlin,
>
> Where's the binary from (exact package name, if installed with a
> package)?  Not sure if this is going to help, but is there any chance
> you could disassemble that function so we can try to see what it's
> doing at that offset?  For example on Debian if you have
> postgresql-9.6 and postgresql-9.6-dbg installed you could run "gdb
> /usr/lib/postgresql/9.6/bin/postgres" and then "disassemble
> ExecHashJoin".  The code at "<+1442>" (0x5a2) is presumably calling
> free or some other libc thing (though I'm surprised not to see an
> intervening palloc thing).

Thanks -- great suggestion.  I'll report back with any interesting findings.

merlin

Re: proposal: type info support functions for functions that use "any" type

2019-07-26 Thread Tom Lane

I wrote:
> TBH, I don't like this proposal one bit.  As far as I can see, the idea
> is to let a function's support function redefine the function's declared
> argument and result types on-the-fly according to no predetermined rules,
> and that seems to me like it's a recipe for disaster.  How will anyone
> understand which function(s) are candidates to match a query, or why one
> particular candidate got selected over others?  It's already hard enough
> to understand the behavior of polymorphic functions in complex cases,
> and those are much more constrained than this would be.

After thinking about this a bit more, it seems like you could avoid
a lot of problems if you restricted what the support function call
does to be potentially replacing the result type of a function
declared to return ANY with some more-specific type (computed from
examination of the actual arguments).  That would make it act much
more like a traditional polymorphic function.  It'd remove the issues
about interactions among multiple potentially-matching functions,
since we'd only call a single support function for an already-identified
target function.

You'd still need to touch everyplace that knows about polymorphic
type resolution, since this would essentially be another form of
polymorphic function.  And I'm still very dubious that it's worth
the trouble.  But it would be a lot more controllable than the
proposal as it stands.

regards, tom lane

Re: pgbench - allow to create partitioned tables

2019-07-26 Thread Fabien COELHO



Attached v3 fixes strcasecmp non portability on windows, per postgresql 
patch tester.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 816f9cc4c7..3e8e292e39 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -306,6 +306,32 @@ pgbench  options  d
   
  
 
+ 
+  --partitions=NUM
+  
+   
+Create a partitioned pgbench_accounts table with
+NUM partitions of nearly equal size for
+the scaled number of accounts.
+Default is 0, meaning no partitioning.
+   
+  
+ 
+
+ 
+  --partition-method=NAME
+  
+   
+Create a partitioned pgbench_accounts table with
+NAME method.
+Expected values are range or hash.
+This option is only taken into account if
+--partitions is non-zero.
+Default is range.
+   
+  
+ 
+
  
   --tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..6fa8ed7f81 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -186,6 +186,11 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* partitioning for pgbench_accounts table, 0 for no partitioning */
+int 		partitions = 0;
+enum { PART_RANGE, PART_HASH }
+			partition_method = PART_RANGE;
+
 /* random seed used to initialize base_random_sequence */
 int64		random_seed = -1;
 
@@ -617,6 +622,9 @@ usage(void)
 		   "  --foreign-keys   create foreign key constraints between tables\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
+		   "  --partitions=NUM partition account table in NUM parts (defaults: 0)\n"
+		   "  --partition-method=(range|hash)\n"
+		   "   partition account table with this method (default: range)\n"
 		   "  --tablespace=TABLESPACE  create tables in the specified tablespace\n"
 		   "  --unlogged-tablescreate tables as unlogged tables\n"
 		   "\nOptions to select what to run:\n"
@@ -3601,6 +3609,17 @@ initDropTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * add fillfactor percent option if not 100.
+ */
+static void
+append_fillfactor(char *opts, int len)
+{
+	if (fillfactor < 100)
+		snprintf(opts + strlen(opts), len - strlen(opts),
+ " with (fillfactor=%d)", fillfactor);
+}
+
 /*
  * Create pgbench's standard tables
  */
@@ -3625,6 +3644,7 @@ initCreateTables(PGconn *con)
 		const char *bigcols;	/* column decls if accountIDs are 64 bits */
 		int			declare_fillfactor;
 	};
+
 	static const struct ddlinfo DDLs[] = {
 		{
 			"pgbench_history",
@@ -3651,11 +3671,10 @@ initCreateTables(PGconn *con)
 			1
 		}
 	};
-	int			i;
 
 	fprintf(stderr, "creating tables...\n");
 
-	for (i = 0; i < lengthof(DDLs); i++)
+	for (int i = 0; i < lengthof(DDLs); i++)
 	{
 		char		opts[256];
 		char		buffer[256];
@@ -3664,9 +3683,17 @@ initCreateTables(PGconn *con)
 
 		/* Construct new create table statement. */
 		opts[0] = '\0';
-		if (ddl->declare_fillfactor)
+
+		/* Partition pgbench_accounts table */
+		if (partitions >= 1 && strcmp(ddl->table, "pgbench_accounts") == 0)
+		{
 			snprintf(opts + strlen(opts), sizeof(opts) - strlen(opts),
-	 " with (fillfactor=%d)", fillfactor);
+	 " partition by %s (aid)",
+	 partition_method == PART_RANGE ? "range" : "hash");
+		}
+		else if (ddl->declare_fillfactor)
+			append_fillfactor(opts, sizeof(opts));
+
 		if (tablespace != NULL)
 		{
 			char	   *escape_tablespace;
@@ -3686,6 +3713,54 @@ initCreateTables(PGconn *con)
 
 		executeStatement(con, buffer);
 	}
+
+	if (partitions >= 1)
+	{
+		int64		part_size = (naccounts * (int64) scale + partitions - 1) / partitions;
+		char		ff[64];
+		ff[0] = '\0';
+		append_fillfactor(ff, sizeof(ff));
+
+		fprintf(stderr, "creating %d partitions...\n", partitions);
+
+		for (int p = 1; p <= partitions; p++)
+		{
+			char		query[256];
+
+			if (partition_method == PART_RANGE)
+			{
+char		minvalue[32], maxvalue[32];
+
+if (p == 1)
+	sprintf(minvalue, "MINVALUE");
+else
+	sprintf(minvalue, INT64_FORMAT, (p-1) * part_size + 1);
+
+if (p < partitions)
+	sprintf(maxvalue, INT64_FORMAT, p * part_size + 1);
+else
+	sprintf(maxvalue, "MAXVALUE");
+
+snprintf(query, sizeof(query),
+		 "create%s table pgbench_accounts_%d\n"
+		 "  partition of pgbench_accounts\n"
+		 "  for values from (%s) to (%s)%s\n",
+		 unlogged_tables ? " unlogged" : "", p,
+		 minvalue, maxvalue, ff);
+			}
+			else if (partition_method == PART_HASH)
+snprintf(query, sizeof(query),
+		 "create%s table pgbench_accounts_%d\n"
+		 "  partition of pgbench_accounts\n"
+		 "  for values with (modulus %d, remainder %d)%s\n",
+		 unlogged_tables ? " unlogged" : "", p,
+		 partitions, p-1, ff);
+			else /* cannot get there */
+Asse

Re: Built-in connection pooler

2019-07-26 Thread Tomas Vondra


Hi Konstantin,

I've started reviewing this patch and experimenting with it, so let me
share some initial thoughts.


1) not handling session state (yet)

I understand handling session state would mean additional complexity, so
I'm OK with not having it in v1. That being said, I think this is the
primary issue with connection pooling on PostgreSQL - configuring and
running a separate pool is not free, of course, but when people complain
to us it's when they can't actually use a connection pool because of
this limitation.

So what are your plans regarding this feature? I think you mentioned
you already have the code in another product. Do you plan to submit it
in the pg13 cycle, or what's the plan? I'm willing to put some effort
into reviewing and testing that.

FWIW it'd be nice to expose it as some sort of interface, so that other
connection pools can leverage it too. There are use cases that don't
work with a built-in connection pool (say, PAUSE/RESUME in pgbouncer
allows restarting the database) so projects like pgbouncer or odyssey
are unlikely to disappear anytime soon.

I also wonder if we could make it more permissive even in v1, without
implementing dump/restore of session state.

Consider for example patterns like this:

 BEGIN;
 SET LOCAL enable_nestloop = off;
 ...
 COMMIT;

or

 PREPARE x(int) AS SELECT ...;
 EXECUTE x(1);
 EXECUTE x(2);
 ...
 EXECUTE x(10);
 DEALLOCATE x;

or perhaps even

 CREATE FUNCTION f() AS $$ ... $$
 LANGUAGE sql
 SET enable_nestloop = off;

In all those cases (and I'm sure there are other similar examples) the
connection pool considers the session 'tainted' it marks it as tainted
and we never reset that. So even when an application tries to play nice,
it can't use pooling.

Would it be possible to maybe track this with more detail (number of
prepared statements, ignore SET LOCAL, ...)? That should allow us to do
pooling even without full support for restoring session state.


2) configuration

I think we need to rethink how the pool is configured. The options
available at the moment are more a consequence of the implementation and
are rather cumbersome to use in some cases.

For example, we have session_pool_size, which is (essentially) the
number of backends kept in the pool. Which seems fine at first, because
it seems like you might say

   max_connections = 100
   session_pool_size = 50

to say the connection pool will only ever use 50 connections, leaving
the rest for "direct" connection. But that does not work at all, because
the number of backends the pool can open is

   session_pool_size * connection_proxies * databases * roles

which pretty much means there's no limit, because while we can specify
the number of proxies, the number of databases and roles is arbitrary.
And there's no way to restrict which dbs/roles can use the pool.

So you can happily do this

   max_connections = 100
   connection_proxies = 4
   session_pool_size = 10

   pgbench -c 24 -U user1 test1
   pgbench -c 24 -U user2 test2
   pgbench -c 24 -U user3 test3
   pgbench -c 24 -U user4 test4

at which point it's pretty much game over, because each proxy has 4
pools, each with ~6 backends, 96 backends in total. And because
non-tainted connections are never closed, no other users/dbs can use the
pool (will just wait indefinitely).

To allow practical configurations, I think we need to be able to define:

* which users/dbs can use the connection pool
* minimum/maximum pool size per user, per db and per user/db
* maximum number of backend connections

We need to be able to close connections when needed (when not assigned,
and we need the connection for someone else).

Plus those limits need to be global, not "per proxy" - it's just strange
that increasing connection_proxies bumps up the effective pool size.

I don't know what's the best way to specify this configuration - whether
to store it in a separate file, in some system catalog, or what.


3) monitoring

I think we need much better monitoring capabilities. At this point we
have a single system catalog (well, a SRF) giving us proxy-level
summary. But I think we need much more detailed overview - probably
something like pgbouncer has - listing of client/backend sessions, with
various details.

Of course, that's difficult to do when those lists are stored in private
memory of each proxy process - I think we need to move this to shared
memory, which would also help to address some of the issues I mentioned
in the previous section (particularly that the limits need to be global,
not per proxy).


4) restart_pooler_on_reload

I find it quite strange that restart_pooler_on_reload binds restart of
the connection pool to reload of the configuration file. That seems like
a rather surprising behavior, and I don't see why would you ever want
that? Currently it seems like the only way to force the proxies to close
the connections (the docs mention DROP DATABASE), but why shouldn't we
have separate functions to do that? In particular, why would yo

Re: proposal: type info support functions for functions that use "any" type

2019-07-26 Thread Tom Lane

Pavel Stehule  writes:
> so 9. 3. 2019 v 7:22 odesílatel Pavel Stehule 
> napsal:
>> Tom introduced supported functions for calculation function's selectivity.
>> Still I have similar idea to use supported function for calculation
>> function's parameter's types and function return type.
>> Motivation:
>> Reduce a necessity of overloading of functions. My motivation is related
>> primary to Orafce, but this feature should be helpful for anybody with
>> similar goals. The function's overloading is great functionality but it is
>> hard for maintenance.

> here is a patch

TBH, I don't like this proposal one bit.  As far as I can see, the idea
is to let a function's support function redefine the function's declared
argument and result types on-the-fly according to no predetermined rules,
and that seems to me like it's a recipe for disaster.  How will anyone
understand which function(s) are candidates to match a query, or why one
particular candidate got selected over others?  It's already hard enough
to understand the behavior of polymorphic functions in complex cases,
and those are much more constrained than this would be.

Moreover, I don't think you've even provided a compelling example
case.  What's this doing that you couldn't do with existing polymorphic
types or the anycompatibletype proposal?

I also strongly suspect that this would break pieces of the system
that expect that the stored pg_proc.prorettype has something to do
with reality.  At minimum, you'd need to fix a number of places you
haven't touched here that have their own knowledge of function type
resolution, such as enforce_generic_type_consistency,
resolve_polymorphic_argtypes, resolve_aggregate_transtype.  Probably
anyplace that treats polymorphics as being any sort of special case
would have to be taught to re-call the support function to find out
what it should think the relevant types are.

(I don't even want to think about what happens if the support function's
behavior changes between original parsing and these re-checking spots.)

Another thing that's very much less than compelling about your example
is that your support function seems to be happy to throw errors
if the argument types don't match what it's expecting.  That seems
quite unacceptable, since it would prevent the parser from moving on
to consider other possibly-matching functions.  Maybe that's just
because it's a quick hack not a polished example, but it doesn't
seem like a good precedent.

In short, I think the added complexity and bug potential outweigh
any possible gain from this.

regards, tom lane

Re: Optimization of some jsonb functions

2019-07-26 Thread Joe Nelson

Thomas Munro wrote:
> This doesn't apply -- to attract reviewers, could we please have a rebase?

To help the review go forward, I have rebased the patch on 27cd521e6e.
It passes `make check` for me, but that's as far as I've verified the
correctness.

I squashed the changes into a single patch, sorry if that makes it
harder to review than the original set of five patch files...

--
Joe Nelson  https://begriffs.com
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 69f41ab455..8dced4ef6c 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -1873,9 +1873,7 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS)
 bool
 JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 {
-	JsonbIterator *it;
-	JsonbIteratorToken tok PG_USED_FOR_ASSERTS_ONLY;
-	JsonbValue	tmp;
+	JsonbValue  *scalar PG_USED_FOR_ASSERTS_ONLY;
 
 	if (!JsonContainerIsArray(jbc) || !JsonContainerIsScalar(jbc))
 	{
@@ -1884,25 +1882,8 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res)
 		return false;
 	}
 
-	/*
-	 * A root scalar is stored as an array of one element, so we get the array
-	 * and then its first (and only) member.
-	 */
-	it = JsonbIteratorInit(jbc);
-
-	tok = JsonbIteratorNext(&it, &tmp, true);
-	Assert(tok == WJB_BEGIN_ARRAY);
-	Assert(tmp.val.array.nElems == 1 && tmp.val.array.rawScalar);
-
-	tok = JsonbIteratorNext(&it, res, true);
-	Assert(tok == WJB_ELEM);
-	Assert(IsAJsonbScalar(res));
-
-	tok = JsonbIteratorNext(&it, &tmp, true);
-	Assert(tok == WJB_END_ARRAY);
-
-	tok = JsonbIteratorNext(&it, &tmp, true);
-	Assert(tok == WJB_DONE);
+	scalar = getIthJsonbValueFromContainer(jbc, 0, res);
+	Assert(scalar);
 
 	return true;
 }
diff --git a/src/backend/utils/adt/jsonb_op.c b/src/backend/utils/adt/jsonb_op.c
index a64206eeb1..82c4b0b2cb 100644
--- a/src/backend/utils/adt/jsonb_op.c
+++ b/src/backend/utils/adt/jsonb_op.c
@@ -24,6 +24,7 @@ jsonb_exists(PG_FUNCTION_ARGS)
 	Jsonb	   *jb = PG_GETARG_JSONB_P(0);
 	text	   *key = PG_GETARG_TEXT_PP(1);
 	JsonbValue	kval;
+	JsonbValue	vval;
 	JsonbValue *v = NULL;
 
 	/*
@@ -38,7 +39,7 @@ jsonb_exists(PG_FUNCTION_ARGS)
 
 	v = findJsonbValueFromContainer(&jb->root,
 	JB_FOBJECT | JB_FARRAY,
-	&kval);
+	&kval, &vval);
 
 	PG_RETURN_BOOL(v != NULL);
 }
@@ -59,6 +60,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS)
 	for (i = 0; i < elem_count; i++)
 	{
 		JsonbValue	strVal;
+		JsonbValue	valVal;
 
 		if (key_nulls[i])
 			continue;
@@ -69,7 +71,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS)
 
 		if (findJsonbValueFromContainer(&jb->root,
 		JB_FOBJECT | JB_FARRAY,
-		&strVal) != NULL)
+		&strVal, &valVal) != NULL)
 			PG_RETURN_BOOL(true);
 	}
 
@@ -92,6 +94,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS)
 	for (i = 0; i < elem_count; i++)
 	{
 		JsonbValue	strVal;
+		JsonbValue	valVal;
 
 		if (key_nulls[i])
 			continue;
@@ -102,7 +105,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS)
 
 		if (findJsonbValueFromContainer(&jb->root,
 		JB_FOBJECT | JB_FARRAY,
-		&strVal) == NULL)
+		&strVal, &valVal) == NULL)
 			PG_RETURN_BOOL(false);
 	}
 
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index ac04c4a57b..05e1c18472 100644
--- a/src/backend/utils/adt/jsonb_util.c
+++ b/src/backend/utils/adt/jsonb_util.c
@@ -57,6 +57,8 @@ static void appendValue(JsonbParseState *pstate, JsonbValue *scalarVal);
 static void appendElement(JsonbParseState *pstate, JsonbValue *scalarVal);
 static int	lengthCompareJsonbStringValue(const void *a, const void *b);
 static int	lengthCompareJsonbPair(const void *a, const void *b, void *arg);
+static int  lengthCompareJsonbString(const char *val1, int len1,
+	 const char *val2, int len2);
 static void uniqueifyJsonbObject(JsonbValue *object);
 static JsonbValue *pushJsonbValueScalar(JsonbParseState **pstate,
 		JsonbIteratorToken seq,
@@ -297,6 +299,102 @@ compareJsonbContainers(JsonbContainer *a, JsonbContainer *b)
 	return res;
 }
 
+/* Find scalar element in Jsonb array and return it. */
+static JsonbValue *
+findJsonbElementInArray(JsonbContainer *container, JsonbValue *elem,
+		JsonbValue *res)
+{
+	JsonbValue *result;
+	JEntry	   *children = container->children;
+	int			count = JsonContainerSize(container);
+	char	   *baseAddr = (char *) (children + count);
+	uint32		offset = 0;
+	int			i;
+
+	result = res ? res : palloc(sizeof(*result));
+
+	for (i = 0; i < count; i++)
+	{
+		fillJsonbValue(container, i, baseAddr, offset, result);
+
+		if (elem->type == result->type &&
+			equalsJsonbScalarValue(elem, result))
+			return result;
+
+		JBE_ADVANCE_OFFSET(offset, children[i]);
+	}
+
+	if (!res)
+		pfree(result);
+
+	return NULL;
+}
+
+/* Find value by key in Jsonb object and fetch it into 'res'. */
+JsonbValue *
+findJsonbKeyInObject(JsonbContainer *container, const char *keyVal, int keyLen,
+	 JsonbValue *res)
+{
+	JEntry	   *children = container->children;
+	JsonbValue *result = res;
+	int			coun

Re: seems like a bug in pgbench -R

2019-07-26 Thread Tom Lane

Fabien COELHO  writes:
>> |...] So I'll mark this ready for committer.

> Ok, thanks for the review.

LGTM, pushed.

regards, tom lane

Re: Optimze usage of immutable functions as relation

2019-07-26 Thread Tom Lane

I wrote:
> * It would be useful for the commentary to point out that in principle we
> could pull up any immutable (or, probably, even just stable) expression;
> but we don't, (a) for fear of multiple evaluations of the result costing
> us more than we can save, and (b) because a primary goal is to let the
> constant participate in further const-folding, and of course that won't
> happen for a non-Const.

BTW, so far as I can see, none of the test cases demonstrate that such
further const-folding can happen.  Since this is all pretty processing-
order-sensitive, I think an explicit test that that's possible would
be a good idea.

regards, tom lane

Re: Optimze usage of immutable functions as relation

2019-07-26 Thread Tom Lane

Anastasia Lubennikova  writes:
> New version is in attachments.

I took a quick look at this and I have a couple of gripes ---

* The naming and documentation of transform_const_function_to_result
seem pretty off-point to me.  ISTM the real goal of that function is to
pull up constant values from RTE_FUNCTION RTEs.  Replacing the RTE
with an RTE_RESULT is just a side-effect that's needed so that we
don't generate a useless FunctionScan plan node.  I'd probably call
it pull_up_constant_function or something like that.

* It would be useful for the commentary to point out that in principle we
could pull up any immutable (or, probably, even just stable) expression;
but we don't, (a) for fear of multiple evaluations of the result costing
us more than we can save, and (b) because a primary goal is to let the
constant participate in further const-folding, and of course that won't
happen for a non-Const.

* The test cases are really pretty bogus, because int4(1) or int4(0) are
not function invocations at all.  The parser thinks those are no-op type
casts, and so what comes out of parse analysis is already just a plain
Const.  Thus, not one of these test cases is actually verifying that
const-folding of an immutable function has happened before we try to
pull up.  While you could dodge the problem today by, say, writing
int8(0) which isn't a no-op cast, I'd recommend staying away from
typename() notation altogether.  There's too much baggage there and too
little certainty that you wrote a function call not something else.
The existing test cases you changed, with int4(sin(1)) and so on,
are better examples of something that has to actively be folded to
a constant.

regards, tom lane

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-07-26 Thread Liudmila Mantrova


On 7/1/19 5:20 PM, Alexey Kondratov wrote:

Hi Thomas,

On 01.07.2019 15:02, Thomas Munro wrote:


Hi Alexey,

This no longer applies.  Since the Commitfest is starting now, could
you please rebase it?


Thank you for a reminder. Rebased version of the patch is attached. 
I've also modified my logging code in order to obey new unified 
logging system for command-line programs commited by Peter (cc8d415117).



Regards


Hi Alexey,

I would like to suggest a couple of changes to docs and comments, please 
see the attachment.
The "...or fetched on startup" part also seems wrong here, but it's not 
a part of your patch, so I'm going to ask about it on psql-docs separately.


It might also be useful to reword the following error messages:
- "using restored from archive version of file \"%s\""
- "could not open restored from archive file \"%s\"
We could probably say something like "could not open file \"%s\" 
restored from WAL archive" instead.


On a more general note, I wonder if everyone is happy with the 
--using-postgresql-conf option name, or we should continue searching for 
a narrower term. Unfortunately, I don't have any better suggestions 
right now, but I believe it should be clear that its purpose is to fetch 
missing WAL files for target. What do you think?


--
Liudmila Mantrova
Technical writer at Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 52a1caa..7e76fcc 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,9 +66,12 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
+   target cluster ran for a long time after the divergence, its old WAL
+   files might no longer be present. In this case, you can manually copy them
+   from the WAL archive to the pg_wal directory, or run
+   pg_rewind with the -r or
+   -R option to automatically retrieve them from the WAL
+   archive, or
fetched on startup by configuring  or
.  The use of
pg_rewind is not limited to failover, e.g.  a standby
@@ -203,6 +206,39 @@ PostgreSQL documentation
  
 
  
+  -r
+  --use-postgresql-conf
+  
+   
+Use the restore_command defined in
+postgresql.conf to retrieve WAL files from
+the WAL archive if these files are no longer available in the
+pg_wal directory of the target cluster.
+   
+   
+This option cannot be used together with --restore-command.
+   
+  
+ 
+
+ 
+  -R restore_command
+  --restore-command=restore_command
+  
+   
+Specifies the restore_command to use for retrieving
+WAL files from the WAL archive if these files are no longer available
+in the pg_wal directory of the target cluster.
+   
+   
+If restore_command is already set in
+postgresql.conf, you can provide the
+--use-postgresql-conf option instead.
+   
+  
+ 
+
+ 
   --debug
   

@@ -288,7 +324,10 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
   history forked off from the target cluster. For each WAL record,
   record each data block that was touched. This yields a list of all
   the data blocks that were changed in the target cluster, after the
-  source cluster forked off.
+  source cluster forked off. If some of the WAL files are no longer
+  available, try re-running pg_rewind with
+  the -r or -R option to search
+  for the missing files in the WAL archive.
  
 
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 287af60..5a7f759 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -12,6 +12,7 @@
 #include "postgres_fe.h"
 
 #include 
+#include 
 
 #include "pg_rewind.h"
 #include "filemap.h"
@@ -44,6 +45,7 @@ static char xlogfpath[MAXPGPATH];
 typedef struct XLogPageReadPrivate
 {
 	const char *datadir;
+	const char *restoreCommand;
 	int			tliIndex;
 } XLogPageReadPrivate;
 
@@ -52,6 +54,9 @@ static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 			   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 			   TimeLineID *pageTLI);
 
+static int RestoreArchivedWAL(const char *path, const char *xlogfname,
+   off_t expectedSize, const char *restoreCommand);
+
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
  * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of
@@

Re: Built-in connection pooler

2019-07-26 Thread Konstantin Knizhnik



I attached a patch to apply after your latest patch [2] with my 
suggested changes to the docs.  I tried to make things read smoother 
without altering your meaning.  I don't think the connection pooler 
chapter fits in The SQL Language section, it seems more like Server 
Admin functionality so I moved it to follow the chapter on HA, load 
balancing and replication.  That made more sense to me looking at the 
overall ToC of the docs.



Thank you.
I have committed your documentation changes in my Git repository and 
attach new patch with your corrections.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a3..2758506 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,123 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (string)
+  
+   restart_pooler_on_reload configuration parameter
+  
+  
+  
+
+  Restart session pool workers once pg_reload_conf() is called.
+  The default value is false.
+   
+  
+ 
+
  
   unix_socket_directories (string)
   
diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml
new file mode 100644
index 000..a4b2720
--- /dev/null
+++ b/doc/src/sgml/connpool.sgml
@@ -0,0 +1,173 @@
+
+
+ 
+  Connection pooling
+
+  
+   built-in connection pool proxy
+  
+
+  
+PostgreSQL spawns a separate process (backend) for each client.
+For large number of clients this model can consume a large number of system
+resources and lead to significant performance degradation, especially on computers with large
+number of CPU cores. The reas

Re: POC: Cleaning up orphaned files using undo logs

2019-07-26 Thread Amit Khandekar

On Fri, 26 Jul 2019 at 12:25, Amit Kapila  wrote:
> I agree with all your other comments.

Thanks for addressing the comments. Below is the continuation of my
comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch
:


+ * Perform rollback request.  We need to connect to the database for first
+ * request and that is required because we access system tables while
for first request and that is required => for the first request. This
is required

---

+UndoLauncherShmemSize(void)
+{
+Sizesize;
+
+/*
+ * Need the fixed struct and the array of LogicalRepWorker.
+ */
+size = sizeof(UndoApplyCtxStruct);

The fixed structure size should be  offsetof(UndoApplyCtxStruct,
workers) rather than sizeof(UndoApplyCtxStruct)

---

In UndoWorkerCleanup(), we set individual fields of the
UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all
the UndoApplyWorker array elements, we just memset all the
UndoApplyWorker structure elements to 0. I think we should be
consistent for the two cases. I guess we can just memset to 0 as you
do in UndoLauncherShmemInit(), but this will cause the
worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in
UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we
can just set it to XID_QUEUE initially ?
Also, if we just use memset in UndoWorkerCleanup(), we need to first
save generation into a temp variable, and then after memset(), restore
it back.

That brought me to another point :
We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup()
can just call ResetUndoRequestInfo().


+boolallow_peek;
+
+CHECK_FOR_INTERRUPTS();
+
+allow_peek = !TimestampDifferenceExceeds(started_at,
Some comments would be good about what is allow_peek  used for. Something like :
"Arrange to prevent the worker from restarting quickly to switch databases"

-
+++ b/src/backend/access/undo/README.UndoProcessing
-

+worker then start reading from one of the queues the requests for that
start=>starts
---

+work, it lingers for UNDO_WORKER_LINGER_MS (10s as default).  This avoids
As per the latest definition, it is 20s. IMHO, there's no need to
mention the default value in the readme.
---

+++ b/src/backend/access/undo/discardworker.c
---

+ * portion of transaction that is overflowed into a separate log can
be processed
80-col crossed.

+#include "access/undodiscard.h"
+#include "access/discardworker.h"
Not in alphabetical order


+++ b/src/backend/access/undo/undodiscard.c
---

+next_insert = UndoLogGetNextInsertPtr(logno);
I checked UndoLogGetNextInsertPtr() definition. It calls
find_undo_log_slot() to get back the slot from logno. Why not make it
accept slot as against logno ? At all other places, the slot->logno is
passed, so it is convenient to just pass the slot there. And in
UndoDiscardOneLog(), first call find_undo_log_slot() just before the
above line (or call it at the end of the do-while loop). This way,
during each of the UndoLogGetNextInsertPtr() calls in undorequest.c,
we will have one less find_undo_log_slot() call. My suggestion is of
course valid only under the assumption that when you call
UndoLogGetNextInsertPtr(fooslot->logno), then inside
UndoLogGetNextInsertPtr(), find_undo_log_slot() will return back the
same fooslot.
-

In UndoDiscardOneLog(), there are at least 2 variable declarations
that can be moved inside the do-while loop : uur and next_insert. I am
not sure about the other variables viz : undofxid and
latest_discardxid. Values of these variables in one iteration continue
across to the second iteration. For latest_discardxid, it looks like
we do want its value to be carried forward, but is it also true for
undofxid ?

+ /* If we reach here, this means there is something to discard. */
+ need_discard = true;
+ } while (true);

Also, about need_discard; there is no place where need_discard is set
to false. That means, from 2nd iteration onwards, it will never be
false. So even if the code that explicitly sets need_discard to true
does not get run, still the undolog will be discarded. Is this
expected ?
-

+if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid))
+{
+(void) RegisterRollbackReq(InvalidUndoRecPtr,
+   undo_recptr,
+   uur->uur_txn->urec_dbid,
+   uur->uur_fxid);
+
+pending_abort = true;
+}
We can get rid of request_rollback variable. Whatever the "if" block
above is doing, do it in this upper condition :
if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress))

Something like this :

if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress))
{
if (dbid_exists(uur->uur_txn->urec_dbid))
{

Re: Fetching timeline during recovery

2019-07-26 Thread Jehan-Guillaume de Rorthais

On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais  wrote:

> On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
> Kyotaro Horiguchi  wrote:
[...]
> > We have an LSN reporting function each for several objectives.
> > 
> >  pg_current_wal_lsn
> >  pg_current_wal_insert_lsn
> >  pg_current_wal_flush_lsn
> >  pg_last_wal_receive_lsn
> >  pg_last_wal_replay_lsn  
> 
> Yes. In fact, my current implementation might be split as:
> 
>   pg_current_wal_tl: returns TL on a production cluster
>   pg_last_wal_received_tl: returns last received TL on a standby
> 
> If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
> *flush_tl would be useful as a cluster in production is not supposed to
> change its timeline during its lifetime.
> 
> > But, I'm not sure just adding further pg_last_*_timeline() to
> > this list is a good thing..  
> 
> I think this is a much better idea than mixing different case (production and
> standby) in the same function as I did. Moreover, it's much more coherent with
> other existing functions.

Please, find in attachment a new version of the patch. It now creates two new
fonctions: 

  pg_current_wal_tl()
  pg_last_wal_received_tl()

Regards,
>From 1e21fb7203e66ed514129d41c9bbf947c5284d7b Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Thu, 25 Jul 2019 19:36:40 +0200
Subject: [PATCH] Add functions to get timeline

pg_current_wal_tl() returns the current timeline of a cluster in production.

pg_last_wal_received_tl() returns the timeline of the last xlog record
flushed to disk.
---
 src/backend/access/transam/xlog.c  | 17 +
 src/backend/access/transam/xlogfuncs.c | 20 
 src/backend/replication/walreceiver.c  | 19 +++
 src/include/access/xlog.h  |  1 +
 src/include/catalog/pg_proc.dat| 12 
 5 files changed, 69 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index da3d250986..fd30c88534 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12243,3 +12243,20 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Returns current active timeline.
+ */
+TimeLineID
+GetCurrentTimeLine(void)
+{
+	TimeLineID	localTimeLineID;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+
+	localTimeLineID = XLogCtl->ThisTimeLineID;
+
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return localTimeLineID;
+}
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..ae877be351 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -776,3 +776,23 @@ pg_promote(PG_FUNCTION_ARGS)
 			(errmsg("server did not promote within %d seconds", wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Returns the current timeline on a production cluster
+ */
+Datum
+pg_current_wal_tl(PG_FUNCTION_ARGS)
+{
+	TimeLineID currentTL;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("%s cannot be executed during recovery.",
+		 "pg_current_wal_tl()")));
+
+	currentTL = GetCurrentTimeLine();
+
+	PG_RETURN_INT32(currentTL);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6abc780778..97d1c900c7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1454,3 +1454,22 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+/*
+ * Returns the timeline of the last xlog record flushed to WAL
+ */
+Datum
+pg_last_wal_received_tl(PG_FUNCTION_ARGS)
+{
+	TimeLineID	localTimeLineID;
+	WalRcvData *walrcv = WalRcv;
+
+	SpinLockAcquire(&walrcv->mutex);
+	localTimeLineID = walrcv->receivedTLI;
+	SpinLockRelease(&walrcv->mutex);
+
+	if (!localTimeLineID)
+		PG_RETURN_NULL();
+
+	return localTimeLineID;
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..f0502c0b41 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
+extern TimeLineID GetCurrentTimeLine(void);
 
 extern bool CheckPromoteSignal(void);
 extern void WakeupRecovery(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..d7ec6ea100 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6006,6 +6006,18 @@
 { oid => '2851', descr => 'wal filename, given a wal location',
   proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn',
   prosrc => 'pg_walfile_name' },
+{ oid => '3434',
+  descr =>

Re: Built-in connection pooler

2019-07-26 Thread Ryan Lambert

> I attached new version of the patch with fixed indentation problems and
> Win32 specific fixes.

Great, this latest patch applies cleanly to master.  installcheck world
still passes.

> "connections_proxies" is used mostly to toggle connection pooling.
> Using more than 1 proxy is be needed only for huge workloads (hundreds
> connections).

My testing showed using only one proxy performing very poorly vs not using
the pooler, even at 300 connections, with -3% TPS.  At lower numbers of
connections it was much worse than other configurations I tried.  I just
shared my full pgbench results [1], the "No Pool" and "# Proxies 2" data is
what I used to generate the charts I previously shared.  The 1 proxy and 10
proxy data I had referred to but hadn't shared the results, sorry about
that.

> And "session_pool_size" is core parameter  which determine efficiency of
> pooling.
> The main trouble with it now, is that it is per database/user
>  combination. Each such combination will have its own connection pool.
>  Choosing optimal value of pooler backends is non-trivial task. It
>  certainly depends on number of available CPU cores.
>  But if backends and mostly disk-bounded, then optimal number of pooler
>  worker can be large than number of cores.

I will do more testing around this variable next.  It seems that increasing
session_pool_size for connection_proxies = 1 might help and leaving it at
its default was my problem.

> PgPRO EE version of connection pooler has "idle_pool_worker_timeout"
> parameter which allows to terminate idle workers.

+1

>  It is possible to implement it also for vanilla version of pooler. But
>  primary intention of this patch was to minimize changes in Postgres core

Understood.

I attached a patch to apply after your latest patch [2] with my suggested
changes to the docs.  I tried to make things read smoother without altering
your meaning.  I don't think the connection pooler chapter fits in The SQL
Language section, it seems more like Server Admin functionality so I moved
it to follow the chapter on HA, load balancing and replication.  That made
more sense to me looking at the overall ToC of the docs.

Thanks,

[1]
https://docs.google.com/spreadsheets/d/11XFoR26eiPQETUIlLGY5idG3fzJKEhuAjuKp6RVECOU
[2]
https://www.postgresql.org/message-id/attachment/102848/builtin_connection_proxy-11.patch


*Ryan*


>


builtin_connection_proxy-docs-1.patch
Description: Binary data

Re: psql FETCH_COUNT feature does not work with combined queries

2019-07-26 Thread Fabien COELHO



FETCH_COUNT does not work with combined queries, and probably has never 
worked since 2006.


For the record, this bug has already been reported & discussed by Daniel 
Vérité a few months back, see:


https://www.postgresql.org/message-id/flat/a0a854b6-563c-4a11-bf1c-d6c6f924004d%40manitou-mail.org

Given the points of view expressed on this thread, especially Tom's view 
against improving the lexer used by psql to known where query ends, it 
looks that my documentation patch is the way to go in the short term, and 
that if the "always show query results" patch is committed, it might 
simplify a bit solving this issue as the PQexec call is turned into 
PQsendQuery.



--
Fabien.

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-26 Thread Masahiko Sawada

On Fri, Jul 26, 2019 at 10:57 AM Jonathan S. Katz  wrote:
>
> Hi,
>
> Before my reply, I wanted to say that I've been lurking on this thread
> for a bit as I've tried to better inform myself on encryption at rest
> and how it will apply to what we want to build. I actually built a
> (poor) prototype in Python of the key management system that Joe &
> Masahiko both laid out, in addition to performing some "buffer
> encrpytion" with it. It's not worth sharing at this point.
>
> With the disclaimer that I'm not as familiar with a lot of concepts as I
> would like to be:
>
> On 7/25/19 1:54 PM, Masahiko Sawada wrote:
> > On Fri, Jul 26, 2019 at 2:18 AM Bruce Momjian  wrote:
> >>
> >> On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote:
> >>> I've re-considered the design of TDE feature based on the discussion
> >>> so far. The one of the main open question is the granular of
> >>> encryption objects: cluster encryption or more-granular-than-cluster
> >>> encryption. The followings describe about the new TDE design when we
> >>> choose table-level encryption or something-new-group-level encryption.
> >>>
> >>> General
> >>> 
> >>> We will use AES and support both AES-128 and AES-256. User can specify
> >>> the new initdb option something like --aes-128 or --aes-256 to enable
> >>> encryption and must specify --encryption-key-passphrase-command along
> >>> with. (I guess we also require openssl library.) If these options are
> >>> specified, we write the key length to the control file and derive the
> >>> KEK and generate MDEK during initdb. wal_log_hints will be enabled
> >>> automatically in encryption mode, like we do for checksum mode,
> >>
> >> Agreed.  pg_control will store the none/AES128/AES256 indicator.
> >>
> >>> Key Management
> >>> ==
> >>> We will use 3-tier key architecture as Joe proposed.
> >>>
> >>>   1. A master key encryption key (KEK): this is the ley supplied by the
> >>>  database admin using something akin to ssl_passphrase_command
> >>>
> >>>   2. A master data encryption key (MDEK): this is a generated key using a
> >>>  cryptographically secure pseudo-random number generator. It is
> >>>  encrypted using the KEK, probably with Key Wrap (KW):
> >>>  or maybe better Key Wrap with Padding (KWP):
> >>>
> >>>   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate
> >>>   table specific keys.
> >>
> >> What is the value of a per-table encryption key?  How is HKDF derived?
> >
> > Per-table encryption key is derived from MDEK with salt and its OID as
> > info. I think we can store salts for each encryption keys into the
> > separate file so that off-line tool also can read it.
>
> +1 with using the info/salt for the HKDF as described above. The other
> decision will be the hashing algorithm to use. SHA-256?

Yeah, SHA-256 would be better for safety.

>
>
> >>>   3b. WAL data encryption keys (WDEK):  Similarly use MDEK and a HKDF to
> >>>   generate new keys when needed for WAL.
> >>>
> >>> We store MDEK to the plain file (say global/pgkey) after encrypted
> >>> with the KEK. I might want to store the hash of passphrase of the KEK
> >>> in order to verify the correctness of the given passphrase. However we
> >>> don't need to store TDEK and WDEK as we can derive them as needed. The
> >>> key file can be read by both backend processes and front-end tools.
> >>
> >> Yes, we need to verify the pass phrase.
>
> Just to clarify, this would be a hash of the KEK?

No, it's a hash of passphrase. Or we might be able to use crypt(3) to
verify the input passphrase.

Apart from passing the passphrase, there are users who rather want to
pass the key directly, for example when using external key management
services. So it might be good if we provide both way.

>
> From my experiments, the MDEK key unwrapping fails if you do not have
> the correct KEK (as it should). If it's a matter of storing a hash of
> the KEK, I'm not sure if there is much added benefit to have it, but I
> would not necessarily oppose it either.
>
> >>> When postmaster startup, it reads the key file and decrypts MDEK and
> >>> derive WDEK using key id for WDEK.
>
> I don't know if this is getting too far ahead, but what happens if the
> supplied KEK fails to decrypt the MDEK? Will postmaster refuse to startup?

I think it should refuse to startup. It would not able to operate all
things properly without correct keys and we prevent to startup from
possible malicious user.

>
> >>> WDEK is loaded to the key hash map
> >>> (keyid -> key) on the shared memory. Also we derive TDEK as needed
> >>> when reading tables or indexes and add it to the key hash map as well
> >>> if not exists.
>
> +1 to this approach.
>
> >>>
> >>> Buffer Encryption
> >>> ==
> >>> We will use AES-CBC for buffer encryption. We will add key id (4byte)
> >>
> >> I think we might want to use CTR for this, and will post after this.
>
> Not sure if I missed this post or not (as several

Re: Optimze usage of immutable functions as relation

2019-07-26 Thread Anastasia Lubennikova


23.07.2019 14:36, Anastasia Lubennikova :

08.07.2019 4:18, Thomas Munro:
The July Commitfest is here.  Could we please have a rebase of this 
patch?

Updated patch is in attachments.
I've only resolved one small cosmetic merge conflict.

Later this week I'm going to send a more thoughtful review.

Well, I looked through the patch and didn't find any issues, so I'll 
mark this ready for committer.


Last implementation differs from the original one and is based on 
suggestion from this message:

https://www.postgresql.org/message-id/2906.1542395026%40sss.pgh.pa.us

It does eval_const_expressions() earlier and pulls up only Const functions.

I also added a few more tests and comments.
New version is in attachments.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36fefd9..acc3776 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1071,7 +1071,8 @@ preprocess_expression(PlannerInfo *root, Node *expr, int kind)
 		expr = flatten_join_alias_vars(root->parse, expr);
 
 	/*
-	 * Simplify constant expressions.
+	 * Simplify constant expressions. For function RTEs, this was already done
+	 * by pullup_simple_subqueries.
 	 *
 	 * Note: an essential effect of this is to convert named-argument function
 	 * calls to positional notation and insert the current actual values of
@@ -1085,7 +1086,8 @@ preprocess_expression(PlannerInfo *root, Node *expr, int kind)
 	 * careful to maintain AND/OR flatness --- that is, do not generate a tree
 	 * with AND directly under AND, nor OR directly under OR.
 	 */
-	expr = eval_const_expressions(root, expr);
+	if (kind != EXPRKIND_RTFUNC && kind != EXPRKIND_RTFUNC_LATERAL)
+		expr = eval_const_expressions(root, expr);
 
 	/*
 	 * If it's a qual or havingQual, canonicalize it.
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 4fbc03f..f96b290 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -79,6 +79,9 @@ static Node *pull_up_simple_union_all(PlannerInfo *root, Node *jtnode,
 static void pull_up_union_leaf_queries(Node *setOp, PlannerInfo *root,
 	   int parentRTindex, Query *setOpQuery,
 	   int childRToffset);
+static void transform_const_function_to_result(PlannerInfo *root, Node *jtnode,
+	 RangeTblEntry *rte,
+	 JoinExpr *lowest_nulling_outer_join);
 static void make_setop_translation_list(Query *query, Index newvarno,
 		List **translated_vars);
 static bool is_simple_subquery(Query *subquery, RangeTblEntry *rte,
@@ -754,6 +757,18 @@ pull_up_subqueries_recurse(PlannerInfo *root, Node *jtnode,
 			is_simple_values(root, rte))
 			return pull_up_simple_values(root, jtnode, rte);
 
+		/*
+		 * Or is it an immutable function that evaluated to a single Const?
+		 */
+		if (rte->rtekind == RTE_FUNCTION)
+		{
+			rte->functions = (List *)
+eval_const_expressions(root, (Node *) rte->functions);
+			transform_const_function_to_result(root, jtnode, rte,
+			   lowest_nulling_outer_join);
+			return jtnode;
+		}
+
 		/* Otherwise, do nothing at this node. */
 	}
 	else if (IsA(jtnode, FromExpr))
@@ -1784,6 +1799,78 @@ is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes)
 }
 
 /*
+ * If the function of this RTE_FUNCTION entry evaluated to a single Const
+ * after eval_const_expressions, transform it to RTE_RESULT.
+ */
+static void
+transform_const_function_to_result(PlannerInfo *root, Node *jtnode,
+   RangeTblEntry *rte,
+   JoinExpr *lowest_nulling_outer_join)
+{
+	ListCell *lc;
+	pullup_replace_vars_context rvcontext;
+	RangeTblFunction *rtf = (RangeTblFunction *) linitial(rte->functions);
+	Query *parse = root->parse;
+
+	if (!IsA(rtf->funcexpr, Const))
+		return;
+
+	/* Fail if the RTE has ORDINALITY - we don't implement that here. */
+	if (rte->funcordinality)
+		return;
+
+	/* Fail if RTE isn't a single, simple Const expr */
+	if (list_length(rte->functions) != 1)
+		return;
+
+	rvcontext.targetlist = list_make1(makeTargetEntry((Expr *) rtf->funcexpr,
+		1 /* resno */, NULL /* resname */, false /* resjunk */));
+	rvcontext.root = root;
+	rvcontext.target_rte = rte;
+
+	/*
+	 * Since this function was reduced to Const, it can't really have lateral
+	 * references, even if it's marked as LATERAL. This means we don't need
+	 * to fill relids.
+	 */
+	rvcontext.relids = NULL;
+
+	rvcontext.outer_hasSubLinks = &parse->hasSubLinks;
+	rvcontext.varno = ((RangeTblRef *) jtnode)->rtindex;
+
+	/*
+	 * If this RTE is on a nullable side of an outer join, we have to insert
+	 * PHVs around our Consts so that they go to null when needed.
+	 */
+	rvcontext.need_phvs = lowest_nulling_outer_join != NULL;
+
+	rvcontext.wrap_non_vars = false;
+	rvcontext.rv_cache = palloc0((list_length(rvcontext.targetlis

Re: Support for jsonpath .datetime() method

2019-07-26 Thread Andrew Dunstan



On 7/23/19 6:48 PM, Nikita Glukhov wrote:
> Some concrete pieces of review:
>> +   
>> +FF1
>> +decisecond (0-9)
>> +   
>>
>> Let's not use such weird terms as "deciseconds".  We could say
>> "fractional seconds, 1 digit" etc. or something like that.
> And what about "tenths of seconds", "hundredths of seconds"?



Yes, those are much better.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for jsonpath .datetime() method

2019-07-26 Thread Andrew Dunstan



On 7/24/19 4:25 PM, Peter Eisentraut wrote:
> On 2019-07-24 00:48, Nikita Glukhov wrote:
>> It seems that our YY works like RR should:
>>
>> SELECT to_date('69', 'YY');
>>   to_date   
>> 
>>  2069-01-01
>> (1 row)
>>
>> SELECT to_date('70', 'YY');
>>   to_date   
>> 
>>  1970-01-01
>> (1 row)
>>
>> But by the standard first two digits of current year should be used in YY.
> Is this behavior even documented anywhere in our documentation?  I
> couldn't find it.  What's the exact specification of what it does in
> these cases?
>
>> So it's unclear what we should do: 
>>  - implement YY and RR strictly following the standard only in .datetime()
>>  - fix YY implementation in to_date()/to_timestamp() and implement RR
>>  - use our non-standard templates in .datetime()
> I think we definitely should try to use the same template system in both
> the general functions and in .datetime().



Agreed. It's too hard to maintain otherwise.


>   This might involve some
> compromises between existing behavior, Oracle behavior, SQL standard.
> So far I'm not worried: If you're using two-digit years like above,
> you're playing with fire anyway.  Also some of the other cases like
> dealing with trailing spaces are probably acceptable as slight
> incompatibilities or extensions.


My instict wouyld be to move as close as possible to the standard,
especially if the current behaviour isn't documented.


>
> We should collect a list of test cases that illustrate the differences
> and then work out how to deal with them.
>


Agreed.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Contribution to Perldoc for TestLib module in Postgres

2019-07-26 Thread Andrew Dunstan


On 7/10/19 9:38 AM, Alvaro Herrera wrote:
> On 2019-Apr-11, Iwata, Aya wrote:
>
>> In the above document, why not write a description after the function name?
>> I think it is better to write the function name first and then the 
>> description.
>> In your code;
>>   #Checks if all the tests passed or not
>>  all_tests_passing()
>>
>> In my suggestion;
>>   all_tests_passing()
>>   Checks if all the tests passed or not
> Yeah, so there are two parts in the submitted patch: first the synopsis
> list the methods using this format you describe, and later the METHODS
> section lists then again, using your suggested style.  I think we should
> do away with the long synopsis -- maybe keep it as just the "use
> TestLib" line, and then let the METHODS section list and describe the
> methods.
>
>> And some functions return value. How about adding return information
>> to the above doc?
> That's already in the METHODS section for some of them.  For example:
>
> all_tests_passing()
> Returns 1 if all the tests pass. Otherwise returns 0
>
> It's missing for others, such as "tempdir".
>
> In slurp_file you have this:
>   Opens the file provided as an argument to the function in read mode(as
>   indicated by <).
> I think the parenthical comment is useless; remove that.
>
> Please break long source lines (say to 78 chars -- make sure pgperltidy
> agrees), and keep some spaces after sentence-ending periods and other
> punctuation.
>


I've fixed the bitrot and some other infelicities on this patch. It's
not commitable yet but I think it's more reviewable.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 6195c21c59..ebb0eb82e7 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -1,9 +1,108 @@
-# TestLib, low-level routines and actions regression tests.
-#
-# This module contains a set of routines dedicated to environment setup for
-# a PostgreSQL regression test run and includes some low-level routines
-# aimed at controlling command execution, logging and test functions. This
-# module should never depend on any other PostgreSQL regression test modules.
+
+=pod
+
+=head1 NAME
+
+TestLib - module containing routines for environment setup, low-level routines
+for command execution, logging and test functions
+
+=head1 SYNOPSIS
+
+  use TestLib;
+
+  # Checks if all the tests passed or not
+  all_tests_passing()
+
+  # Creates a temporary directory
+  tempdir(prefix)
+
+  # Creates a temporary directory outside the build tree for the
+  # Unix-domain socket
+  tempdir_short()
+
+  # Returns the real directory for a virtual path directory under msys
+  real_dir(dir)
+
+  # Runs the command which is passed as an argument
+  system_log()
+
+  # Runs the command which is passed as an argument. On failure abandons
+  # all other tests
+  system_or_bail()
+
+  # Runs the command which is passed as an argument.
+  run_log()
+
+  # Returns the output after running a command
+  run_command(dir)
+
+  # Generate a string made of the given range of ASCII characters
+  generate_ascii_string(from_char, to_char)
+
+  # Read the contents of the directory
+  slurp_dir(dir)
+
+  # Read the contents of the file
+  slurp_file(filename)
+
+  # Appends a string at the end of a given file
+  append_to_file(filename, str)
+
+  # Check that all file/dir modes in a directory match the expected values
+  check_mode_recursive(dir, expected_dir_mode, expected_file_mode,
+   ignore_list)
+
+  # Change mode recursively on a directory
+  chmod_recursive(dir, dir_mode, file_mode)
+
+  # Check presence of a given regexp within pg_config.h
+  check_pg_config(regexp)
+
+  # Test function to check that the command runs successfully
+  command_ok(cmd, test_name)
+
+  # Test function to check that the command fails
+  command_fails(cmd, test_name)
+
+  # Test function to check that the command exit code matches the
+  # expected exit code
+  command_exit_is(cmd, expected, test_name)
+
+  # Test function to check that the command supports --help option
+  program_help_ok(cmd)
+
+  # Test function to check that the command supports --version option
+  program_version_ok(cmd)
+
+  # Test function to check that a command with invalid option returns
+  # non-zero exit code and error message
+  program_options_handling_ok(cmd)
+
+  # Test function to check that the command runs successfully and the
+  # output matches the given regular expression
+  command_like(cmd, expected_stdout, test_name)
+
+  #TODO
+  command_like_safe(cmd, expected_stdout, test_name)
+
+  # Test function to check that the command fails and the error message
+  # matches the given regular expression
+  command_fails_like(cmd, expected_stderr, test_name)
+
+  # Test function to run a command and check its status and outputs
+  command_checks_all(cmd, expected_ret, out, er

Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-07-26 Thread Ibrar Ahmed

The patch requires to rebase on the master branch.

The new status of this patch is: Waiting on Author

Re: psql - add SHOW_ALL_RESULTS option

2019-07-26 Thread Daniel Verite

Fabien COELHO wrote:

>  sh> /usr/bin/psql
>   psql (12beta2 ...)
>   fabien=# \set FETCH_COUNT 2
>   fabien=# SELECT 1234 \; SELECT 5432 ;
>   fabien=#
> 
>   same thing with pg 11.4, and probably down to every version of postgres
>   since the feature was implemented...
> 
> I think that fixing this should be a separate bug report and patch. I'll 
> try to look at it.

That reminds me that it was already discussed in [1]. I should add the
proposed fix to the next commitfest.


[1]
https://www.postgresql.org/message-id/flat/a0a854b6-563c-4a11-bf1c-d6c6f924004d%40manitou-mail.org


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Hypothetical indexes using BRIN broken since pg10

2019-07-26 Thread Julien Rouhaud

On Fri, Jul 26, 2019 at 1:34 PM Heikki Linnakangas  wrote:
>
> There seems to be consensus on the going with the approach from the
> original patch, so I had a closer look at it.
>
> The patch first does this:
>
> >
> >   /*
> >* Obtain some data from the index itself, if possible.  Otherwise 
> > invent
> >* some plausible internal statistics based on the relation page 
> > count.
> >*/
> >   if (!index->hypothetical)
> >   {
> >   indexRel = index_open(index->indexoid, AccessShareLock);
> >   brinGetStats(indexRel, &statsData);
> >   index_close(indexRel, AccessShareLock);
> >   }
> >   else
> >   {
> >   /*
> >* Assume default number of pages per range, and estimate the 
> > number
> >* of ranges based on that.
> >*/
> >   indexRanges = Max(ceil((double) baserel->pages /
> >  
> > BRIN_DEFAULT_PAGES_PER_RANGE), 1.0);
> >
> >   statsData.pagesPerRange = BRIN_DEFAULT_PAGES_PER_RANGE;
> >   statsData.revmapNumPages = (indexRanges / 
> > REVMAP_PAGE_MAXITEMS) + 1;
> >   }
> >   ...
>
> And later in the function, there's this:
>
> >   /* work out the actual number of ranges in the index */
> >   indexRanges = Max(ceil((double) baserel->pages / 
> > statsData.pagesPerRange),
> > 1.0);
>
> It seems a bit error-prone that essentially the same formula is used
> twice in the function, to compute 'indexRanges', with some distance
> between them. Perhaps some refactoring would help with, although I'm not
> sure what exactly would be better. Maybe move the second computation
> earlier in the function, like in the attached patch?

I had the same thought without a great idea on how to refactor it.
I'm fine with the one in this patch.

> The patch assumes the default pages_per_range setting, but looking at
> the code at https://github.com/HypoPG/hypopg, the extension actually
> takes pages_per_range into account when it estimates the index size. I
> guess there's no easy way to pass the pages_per_range setting down to
> brincostestimate(). I'm not sure what we should do about that, but seems
> that just using BRIN_DEFAULT_PAGES_PER_RANGE here is not very accurate.

Yes, hypopg can use a custom pages_per_range as it intercepts it when
the hypothetical index is created.  I didn't find any way to get that
information in brincostestimate(), especially for something that could
backpatched.  I don't like it, but I don't see how to do better than
just using BRIN_DEFAULT_PAGES_PER_RANGE :(

> The attached patch is based on PG v11, because I tested this with
> https://github.com/HypoPG/hypopg, and it didn't compile with later
> versions. There's a small difference in the locking level used between
> v11 and 12, which makes the patch not apply across versions, but that's
> easy to fix by hand.

FTR I created a REL_1_STABLE branch for hypopg which is compatible
with pg12 (it's already used for debian packages), as master is still
in beta and v12 compatibility worked on.

Re: Hypothetical indexes using BRIN broken since pg10

2019-07-26 Thread Heikki Linnakangas


On 27/06/2019 23:09, Alvaro Herrera wrote:

On 2019-Jun-27, Tom Lane wrote:


Dunno, I just can't get excited about exposing REVMAP_PAGE_MAXITEMS.
Especially not since we seem to agree on the long-term solution here,
and what you're suggesting to Julien doesn't particularly fit into
that long-term solution.


Well, it was brin_page.h, which is supposed to be internal to BRIN
itself.  But since we admit that in its current state selfuncs.c is not
salvageable as a module and we'll redo the whole thing in the short
term, I withdraw my comment.


There seems to be consensus on the going with the approach from the 
original patch, so I had a closer look at it.


The patch first does this:



/*
 * Obtain some data from the index itself, if possible.  Otherwise 
invent
 * some plausible internal statistics based on the relation page count.
 */
if (!index->hypothetical)
{
indexRel = index_open(index->indexoid, AccessShareLock);
brinGetStats(indexRel, &statsData);
index_close(indexRel, AccessShareLock);
}
else
{
/*
 * Assume default number of pages per range, and estimate the 
number
 * of ranges based on that.
 */
indexRanges = Max(ceil((double) baserel->pages /
   
BRIN_DEFAULT_PAGES_PER_RANGE), 1.0);

statsData.pagesPerRange = BRIN_DEFAULT_PAGES_PER_RANGE;
statsData.revmapNumPages = (indexRanges / REVMAP_PAGE_MAXITEMS) 
+ 1;
}
...


And later in the function, there's this:


/* work out the actual number of ranges in the index */
indexRanges = Max(ceil((double) baserel->pages / 
statsData.pagesPerRange),
  1.0);


It seems a bit error-prone that essentially the same formula is used 
twice in the function, to compute 'indexRanges', with some distance 
between them. Perhaps some refactoring would help with, although I'm not 
sure what exactly would be better. Maybe move the second computation 
earlier in the function, like in the attached patch?


The patch assumes the default pages_per_range setting, but looking at 
the code at https://github.com/HypoPG/hypopg, the extension actually 
takes pages_per_range into account when it estimates the index size. I 
guess there's no easy way to pass the pages_per_range setting down to 
brincostestimate(). I'm not sure what we should do about that, but seems 
that just using BRIN_DEFAULT_PAGES_PER_RANGE here is not very accurate.


The attached patch is based on PG v11, because I tested this with 
https://github.com/HypoPG/hypopg, and it didn't compile with later 
versions. There's a small difference in the locking level used between 
v11 and 12, which makes the patch not apply across versions, but that's 
easy to fix by hand.


- Heikki
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 7ac6d2b339..0a2e7898cc 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,6 +102,7 @@
 #include 
 
 #include "access/brin.h"
+#include "access/brin_page.h"
 #include "access/gin.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -8079,11 +8080,31 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			  &spc_seq_page_cost);
 
 	/*
-	 * Obtain some data from the index itself.
+	 * Obtain some data from the index itself, if possible.  Otherwise invent
+	 * some plausible internal statistics based on the relation page count.
 	 */
-	indexRel = index_open(index->indexoid, AccessShareLock);
-	brinGetStats(indexRel, &statsData);
-	index_close(indexRel, AccessShareLock);
+	if (!index->hypothetical)
+	{
+		indexRel = index_open(index->indexoid, AccessShareLock);
+		brinGetStats(indexRel, &statsData);
+		index_close(indexRel, AccessShareLock);
+
+		/* work out the actual number of ranges in the index */
+		indexRanges = Max(ceil((double) baserel->pages /
+			   statsData.pagesPerRange), 1.0);
+	}
+	else
+	{
+		/*
+		 * Assume default number of pages per range, and estimate the number
+		 * of ranges based on that.
+		 */
+		indexRanges = Max(ceil((double) baserel->pages /
+			   BRIN_DEFAULT_PAGES_PER_RANGE), 1.0);
+
+		statsData.pagesPerRange = BRIN_DEFAULT_PAGES_PER_RANGE;
+		statsData.revmapNumPages = (indexRanges / REVMAP_PAGE_MAXITEMS) + 1;
+	}
 
 	/*
 	 * Compute index correlation
@@ -8184,10 +8205,6 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			 baserel->relid,
 			 JOIN_INNER, NULL);
 
-	/* work out the actual number of ranges in the index */
-	indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange),
-	  1.0);
-
 	/*
 	 * Now calculate the minimum possible ranges we could match with if all of
 	 * the rows were in the perfect order in the

Re: make libpq documentation navigable between functions

2019-07-26 Thread Fabien COELHO




Hello Peter,


I have committed this with some additions.


Thanks for the push. It was really a pain to write a small libpq app 
without navigation.



Also, due to some mysterious problems with the PDF toolchain I had to
remove some links.  Your script would find those, so I won't list them
here.  If you put those back in, the PDF build breaks.  If you want to
work out why, feel free to submit more patches.  Otherwise I'm happy to
leave it as is now; it's very useful.


Ok fine with me for now as well. I'm not keen to invest more time on the 
documentation tool chain.


--
Fabien.

psql FETCH_COUNT feature does not work with combined queries

2019-07-26 Thread Fabien COELHO



Hello devs,

As pointed out by Kyotaro Horiguchi in

https://www.postgresql.org/message-id/20190726.131704.86173346.horikyota@gmail.com

FETCH_COUNT does not work with combined queries, and probably has never 
worked since 2006.


What seems to happen is that ExecQueryUsingCursor is hardcoded to handle 
one simple query. It simply inserts the cursor generation in front of the 
query believed to be a select:


  DECLARE ... 

For combined queries, say two selects, it results in:

  DECLARE ... ; 

Then PQexec returns the result of the second one, and nothing is printed.

However, if the second query is not a select, eg: "select ... \; update 
... ;", the result of the *first* query is shown.


How fun!

This is because PQexec returns the second result. The cursor declaration 
expects a PGRES_COMMAND_OK before proceeding. With a select it gets 
PGRES_TUPLES_OK so decides it is an error and silently skips to the end. 
With the update it indeed obtains the expected PGRES_COMMAND_OK, not 
really for the command it sent but who cares, and proceeds to show the 
cursor results.


Basically, the whole logic is broken.

The minimum is to document that it does not work properly with combined 
queries. Attached patch does that, so that the bug becomes a documented 
bug, aka a feature:-)


Otherwise, probably psql lexer could detect, with some efforts, that it is 
a combined query (detect embedded ; and check that they are not empty 
queries), so that it could skip the feature if it is the case.


Another approach would be to try to detect if the returned result does not 
correspond to the cursor one reliably. Maybe some result counting could be 
added somewhere so that the number of results under PQexec is accessible 
to the user, i.e. result struct would contain its own number. Hmmm.


A more complex approach would be to keep the position of embedded queries, 
and to insert cursor declarations where needed, currently the last one if 
it is a SELECT. However, for the previous ones the allocation and such 
could be prohibitive as no cursor would be used. Not sure it is worth the 
effort as the bug has not been detected for 13 years.


--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7789fc6177..d217d82f57 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3684,6 +3684,12 @@ bar
 fail after having already displayed some rows.
 
 
+
+
+This feature does not work properly with combined queries.
+
+
+
 
 
 Although you can use any output format with this feature,

Re: make libpq documentation navigable between functions

2019-07-26 Thread Peter Eisentraut

On 2019-07-22 22:56, Fabien COELHO wrote:
> Attached script does, hopefully, the expected transformation. It adds ids 
> to  occurrences when the id is not defined elsewhere.
> 
> Attached v3 is the result of applying your kindly provided xslt patch plus
> the script on "libpq.sgml".
> 
> Three functions are ignored because no documentation is found: 
> PQerrorField (does not exist anywhere in the sources), 
> PQsetResultInstanceData (idem) and PQregisterThreadLock (it exists).
> 
> Doc build works for me and looks ok.

I have committed this with some additions.

I have changed all the function-related id attributes in libpq.sgml to
be mixed case, for easier readability.  So if you run your script again,
you can omit the lc() call.

I also needed to make some changes to the markup in some places to
remove extra whitespace that would have appeared in the generated link.
(This was already happening in some places, but your patch would have
repeated it in many places.)

Also, due to some mysterious problems with the PDF toolchain I had to
remove some links.  Your script would find those, so I won't list them
here.  If you put those back in, the PDF build breaks.  If you want to
work out why, feel free to submit more patches.  Otherwise I'm happy to
leave it as is now; it's very useful.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: Cleaning up orphaned files using undo logs

2019-07-26 Thread Amit Kapila

On Wed, Jul 24, 2019 at 10:00 AM Dilip Kumar  wrote:
>
> On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar  wrote:
> >
> Please find my review comments for
> 0013-Allow-foreground-transactions-to-perform-undo-action
>
>
> + * We can't postpone applying undo actions for subtransactions as the
> + * modifications made by aborted subtransaction must not be visible even if
> + * the main transaction commits.
> + */
> + if (IsSubTransaction())
> + return;
>
> I am not completely sure but is it possible that the outer function
> CommitTransactionCommand/AbortCurrentTransaction can avoid
> calling this function in the switch case based on the current state,
> so that under subtransaction this will never be called?
>

We can do that and also can have an additional check similar to "if
(!s->performUndoActions)", but such has to be all places from where
this function is called.  I feel that will make code less readable at
many places.

>
> + bool undo_req_pushed[UndoLogCategories]; /* undo request pushed
> + * to worker? */
> + bool performUndoActions;
> +
>   struct TransactionStateData *parent; /* back link to parent */
>
> We must have some comments to explain how performUndoActions is used,
> where it's set.  If it's explained somewhere else then we can
> give reference to that code.
>

I am planning to remove this variable in the next version and have an
explicit check as we have in UndoActionsRequired.

I agree with your other comments and will address them in the next
version of the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

COPY command on a table column marked as GENERATED ALWAYS

2019-07-26 Thread Ashutosh Sharma

Hi All,

I'm able to insert data into a table column marked as GENERATED ALWAYS
using COPY command however, it fails with INSERT command. Isn't that a
bug with COPY command?

Here is the test-case for more clarity.

postgres=# create table tab_always  (i int generated always as identity, j int);
CREATE TABLE

postgres=# insert into tab_always values(1, 10);
ERROR:  cannot insert into column "i"
DETAIL:  Column "i" is an identity column defined as GENERATED ALWAYS.
HINT:  Use OVERRIDING SYSTEM VALUE to override.

[ashu@localhost bin]$ cat /tmp/always.csv
1310
1420
1530
1640

postgres=# copy tab_always from '/tmp/always.csv';
COPY 4
postgres=# select * from tab_always;
 i  | j
+
 13 | 10
 14 | 20
 15 | 30
 16 | 40
(4 rows)

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: pg_walfile_name_offset can return inconsistent values

2019-07-26 Thread Jehan-Guillaume de Rorthais

On Fri, 26 Jul 2019 17:21:20 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi  wrote:

> Hello.
> 
> While looking [1], I noticed that pg_walfile_name_offset behaves
> somewhat oddly at segment boundary.
> 
> select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as
> t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn);
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00020016 |16777215
>  0/1700 | 00020016 |   0
>  0/1701 | 00020017 |   1
> 
> 
> The file names are right as defined, but the return value of the
> second line wrong, or at least misleading.

+1
I noticed it as well and put this report on hold while working on my patch.
Thanks for reporting this!

> It should be (16, 100) or (16, FF). The former is out-of-domain so we
> would have no way than choosing the latter. I'm not sure the purpose of
> the second output parameter, thus the former might be right
> decision.
>
> The function returns the following result after this patch is
> applied.
> 
> select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as
> t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn);
> lsn |file_name | file_offset
> +--+-
>  0/16ff | 00020016 |16777214
>  0/1700 | 00020016 |16777215
>  0/1701 | 00020017 |   0

So you shift the file offset for all LSN by one byte? This could lead to
regression in various tools relying on this function.

Moreover, it looks weird as the LSN doesn't reflect the given offset anymore
(FF <> 16777214, 01 <> 0, etc).

Another solution might be to return the same result when for both 0/16ff and
0/1700, but it doesn't feel right either.

So in fact, returning 0x100 seems to be the cleaner result to me.

Regards,

pg_walfile_name_offset can return inconsistent values

2019-07-26 Thread Kyotaro Horiguchi

Hello.

While looking [1], I noticed that pg_walfile_name_offset behaves
somewhat oddly at segment boundary.

select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as 
t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn);
lsn |file_name | file_offset 
+--+-
 0/16ff | 00020016 |16777215
 0/1700 | 00020016 |   0
 0/1701 | 00020017 |   1


The file names are right as defined, but the return value of the
second line wrong, or at least misleading. It should be (16,
100) or (16, FF). The former is out-of-domain so we would
have no way than choosing the latter. I'm not sure the purpose of
the second output parameter, thus the former might be right
decision.


The function returns the following result after this patch is
applied.

select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as 
t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn);
lsn |file_name | file_offset 
+--+-
 0/16ff | 00020016 |16777214
 0/1700 | 00020016 |16777215
 0/1701 | 00020017 |   0


regards.

[1]: https://www.postgresql.org/message-id/20190725193808.1648ddc8@firost

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ca9e174f53ac4d513163cbe8201746c8d3d2eb62 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 26 Jul 2019 17:12:24 +0900
Subject: [PATCH] Fix offset of pg_walfile_name_offset.

The function returns the name of the previous segment with the offset
of the given location at segment boundaries. For example, it returns
("...16", 0) returned for '0/1700'. That is inconsistent, or at
least confusing. This patch changes the function to return the given
LSN - 1 as offset.
---
 src/backend/access/transam/xlogfuncs.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..79ea0495b4 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -447,6 +447,8 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS)
  * Note that a location exactly at a segment boundary is taken to be in
  * the previous segment.  This is usually the right thing, since the
  * expected usage is to determine which xlog file(s) are ready to archive.
+ * To be consistent to filename, returns the offset one byte before the given
+ * location as offset.
  */
 Datum
 pg_walfile_name_offset(PG_FUNCTION_ARGS)
@@ -480,10 +482,16 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 
 	resultTupleDesc = BlessTupleDesc(resultTupleDesc);
 
+	/*
+	 * We assume the given location point as one-byte after the location
+	 * really wanted.
+	 */
+	locationpoint--;
+
 	/*
 	 * xlogfilename
 	 */
-	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
+	XLByteToSeg(locationpoint, xlogsegno, wal_segment_size);
 	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
 
 	values[0] = CStringGetTextDatum(xlogfilename);
-- 
2.16.3

Re: psql - add SHOW_ALL_RESULTS option

2019-07-26 Thread Fabien COELHO



Hello Kyotaro-san,

Attached a v2 for the always-show-all-results variant. Thanks for the 
debug!


I have some comments on this patch.

I'm +1 for always output all results without having knobs.


That makes 4 opinions expressed towards this change of behavior, and none 
against.



Documentation (psql-ref.sgml) has another place that needs the
same amendment.


Indeed.


Looking the output for -t, -0, -A or something like, we might need
to introduce result-set separator.


Yep, possibly. I'm not sure this is material for this patch, though.


# -eH looks broken for me but it would be another issue.


It seems to work for me. Could you be more precise about how it is broken?


Valid setting of FETCH_COUNT disables this feature. I think it is
unwanted behavior.


Yes and no: this behavior (bug, really) is pre-existing, FETCH_COUNT does 
not work with combined queries:


  sh> /usr/bin/psql
  psql (12beta2 ...)
  fabien=# \set FETCH_COUNT 2
  fabien=# SELECT 1234 \; SELECT 5432 ;
  fabien=#

  same thing with pg 11.4, and probably down to every version of postgres
  since the feature was implemented...

I think that fixing this should be a separate bug report and patch. I'll 
try to look at it.


Thanks for the feedback. Attached v3 with further documentation updates.

--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7789fc6177..4e6ab5a0a5 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql
commands included in the string to divide it into multiple
transactions.  (See 
for more details about how the server handles multi-query strings.)
-   Also, psql only prints the
-   result of the last SQL command in the string.
-   This is different from the behavior when the same string is read from
-   a file or fed to psql's standard input,
-   because then psql sends
-   each SQL command separately.
   
   
-   Because of this behavior, putting more than one SQL command in a
-   single -c string often has unexpected results.
-   It's better to use repeated -c commands or feed
-   multiple commands to psql's standard input,
+   If having several commands executed in one transaction is not desired, 
+   use repeated -c commands or feed multiple commands to
+   psql's standard input,
either using echo as illustrated above, or
via a shell here-document, for example:
 
@@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3;
 commands included in the string to divide it into multiple
 transactions.  (See 
 for more details about how the server handles multi-query strings.)
-psql prints only the last query result
-it receives for each request; in this example, although all
-three SELECTs are indeed executed, psql
-only prints the 3.
+psql prints all results it receives, one
+after the other.
 
 
   
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 44a782478d..4534c45854 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -472,6 +472,16 @@ ResetCancelConn(void)
 #endif
 }
 
+static void
+ShowErrorMessage(const PGresult *result)
+{
+	const char *error = PQerrorMessage(pset.db);
+
+	if (strlen(error))
+		pg_log_info("%s", error);
+
+	CheckConnection();
+}
 
 /*
  * AcceptResult
@@ -482,7 +492,7 @@ ResetCancelConn(void)
  * Returns true for valid result, false for error state.
  */
 static bool
-AcceptResult(const PGresult *result)
+AcceptResult(const PGresult *result, bool show_error)
 {
 	bool		OK;
 
@@ -513,15 +523,8 @@ AcceptResult(const PGresult *result)
 break;
 		}
 
-	if (!OK)
-	{
-		const char *error = PQerrorMessage(pset.db);
-
-		if (strlen(error))
-			pg_log_info("%s", error);
-
-		CheckConnection();
-	}
+	if (!OK && show_error)
+		ShowErrorMessage(result);
 
 	return OK;
 }
@@ -701,7 +704,7 @@ PSQLexec(const char *query)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		res = NULL;
@@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		return 0;
@@ -999,199 +1002,113 @@ loop_exit:
 	return success;
 }
 
-
 /*
- * ProcessResult: utility function for use by SendQuery() only
- *
- * When our command string contained a COPY FROM STDIN or COPY TO STDOUT,
- * PQexec() has stopped at the PGresult associated with the first such
- * command.  In that event, we'll marshal data for the COPY and then cycle
- * through any subsequent PGresult objects.
- *
- * When the command string contained no such COPY command, this function
- * degenerates to an AcceptResult() call.
- *
- * Changes its argument to point to the last PGresult of the command string,
- * or NULL if tha

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-26 Thread Michael Paquier

On Fri, Jul 26, 2019 at 10:53:03AM +0300, Sergei Kornilov wrote:
> Explicit is better than implicit, so I am +1 to commit both patches.

Hence my count is incorrect:
- Forbid --jobs and --index: Michael P, Sergei K.
- Enforce --jobs=1 with --index: Julien R.
- Have no restrictions: 0.
--
Michael


signature.asc
Description: PGP signature

Re: Fetching timeline during recovery

2019-07-26 Thread Jehan-Guillaume de Rorthais

On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi  wrote:

> Hi.
> 
> At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais
>  wrote in <20190725193808.1648ddc8@firost>
> > On Wed, 24 Jul 2019 14:33:27 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> >   
> > > On Wed, 24 Jul 2019 09:49:05 +0900
> > > Michael Paquier  wrote:
> > >   
> > > > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
> > > > wrote:
> > [...]  
> > > > I think that there are arguments for being more flexible with it, and
> > > > perhaps have a system-level view to be able to look at some of its
> > > > fields.
> > > 
> > > Great idea. I'll give it a try to keep the discussion on.  
> > 
> > After some thinking, I did not find enough data to expose to justify the
> > creation a system-level view. As I just need the current timeline I
> > wrote "pg_current_timeline()". Please, find the patch in attachment.
> > 
> > The current behavior is quite simple: 
> > * if the cluster is in production, return ThisTimeLineID
> > * else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)
> > 
> > This is really naive implementation. We should probably add some code around
> > the startup process to gather and share general recovery stats. This would
> > allow to fetch eg. the current recovery method, latest xlog file name
> > restored from archives or streaming, its timeline, etc.
> > 
> > Any thoughts?  
> 
> If replay is delayed behind timeline switch point, replay-LSN and
> receive/write/flush LSNs are on different timelines.  When
> replica have not reached the new timeline to which alredy
> received file belongs, the fucntion returns wrong file name,
> specifically a name consisting of the latest segment number and
> the older timeline where the segment doesn't belong to.

Indeed.

> We have an LSN reporting function each for several objectives.
> 
>  pg_current_wal_lsn
>  pg_current_wal_insert_lsn
>  pg_current_wal_flush_lsn
>  pg_last_wal_receive_lsn
>  pg_last_wal_replay_lsn

Yes. In fact, my current implementation might be split as:

  pg_current_wal_tl: returns TL on a production cluster
  pg_last_wal_received_tl: returns last received TL on a standby

If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
*flush_tl would be useful as a cluster in production is not supposed to
change its timeline during its lifetime.

> But, I'm not sure just adding further pg_last_*_timeline() to
> this list is a good thing..

I think this is a much better idea than mixing different case (production and
standby) in the same function as I did. Moreover, it's much more coherent with
other existing functions.

> The function returns NULL for NULL input (STRICT behavior) but
> returns (NULL, NULL) for undefined timeline. I don't think the
> differene is meaningful.

Unless I'm missing something, nothing
returns "(NULL, NULL)" in 0001-v1-Add-function-pg_current_timeline.patch.

Thank you for your feedback!

Re: block-level incremental backup

2019-07-26 Thread vignesh C

On Fri, Jul 26, 2019 at 11:21 AM Jeevan Ladhe 
wrote:

> Hi Vignesh,
>
> Please find my comments inline below:
>
> 1) If relation file has changed due to truncate or vacuum.
>> During incremental backup the new files will be copied.
>> There are chances that both the old  file and new file
>> will be present. I'm not sure if cleaning up of the
>> old file is handled.
>>
>
> When an incremental backup is taken it either copies the file in its
> entirety if
> a file is changed more than 90%, or writes .partial with changed blocks
> bitmap
> and actual data. For the files that are unchanged, it writes 0 bytes and
> still
> creates a .partial file for unchanged files too. This means there is a
> .partitial
> file for all the files that are to be looked up in full backup.
> While composing a synthetic backup from incremental backup the
> pg_combinebackup
> tool will only look for those relation files in full(parent) backup which
> are
> having .partial files in the incremental backup. So, if vacuum/truncate
> happened
> between full and incremental backup, then the incremental backup image
> will not
> have a 0-length .partial file for that relation, and so the synthetic
> backup
> that is restored using pg_combinebackup will not have that file as well.
>
>
Thanks Jeevan for the update, I feel this logic is good.
It will handle the case of deleting the old relation files.

>
>
>> 2) Just a small thought on building the bitmap,
>> can the bitmap be built and maintained as
>> and when the changes are happening in the system.
>> If we are building the bitmap while doing the incremental backup,
>> Scanning through each file might take more time.
>> This can be a configurable parameter, the system can run
>> without capturing this information by default, but if there are some
>> of them who will be taking incremental backup frequently this
>> configuration can be enabled which should track the modified blocks.
>
>
> IIUC, this will need changes in the backend. Honestly, I think backup is a
> maintenance task and hampering the backend for this does not look like a
> good
> idea. But, having said that even if we have to provide this as a switch
> for some
> of the users, it will need a different infrastructure than what we are
> building
> here for constructing bitmap, where we scan all the files one by one.
> Maybe for
> the initial version, we can go with the current proposal that Robert has
> suggested,
> and add this switch at a later point as an enhancement.
>
>
That sounds fair to me.


Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-26 Thread Sergei Kornilov

Hi

> So here we go:
> - 0001 is your original thing, with --jobs enforced to 1 for the index
> part.
> - 0002 is my addition to forbid --index with --jobs.
>
> I am fine to be outvoted regarding 0002, and it is the case based on
> the state of this thread with 2:1. We could always revisit this
> decision in this development cycle anyway.

Explicit is better than implicit, so I am +1 to commit both patches.

regards, Sergei

Re: Fetching timeline during recovery

2019-07-26 Thread Kyotaro Horiguchi

Hi.

At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais 
 wrote in <20190725193808.1648ddc8@firost>
> On Wed, 24 Jul 2019 14:33:27 +0200
> Jehan-Guillaume de Rorthais  wrote:
> 
> > On Wed, 24 Jul 2019 09:49:05 +0900
> > Michael Paquier  wrote:
> > 
> > > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
> > > wrote:  
> [...]
> > > I think that there are arguments for being more flexible with it, and
> > > perhaps have a system-level view to be able to look at some of its 
> > > fields.  
> > 
> > Great idea. I'll give it a try to keep the discussion on.
> 
> After some thinking, I did not find enough data to expose to justify the
> creation a system-level view. As I just need the current timeline I
> wrote "pg_current_timeline()". Please, find the patch in attachment.
> 
> The current behavior is quite simple: 
> * if the cluster is in production, return ThisTimeLineID
> * else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)
> 
> This is really naive implementation. We should probably add some code around
> the startup process to gather and share general recovery stats. This would
> allow to fetch eg. the current recovery method, latest xlog file name restored
> from archives or streaming, its timeline, etc.
> 
> Any thoughts?

If replay is delayed behind timeline switch point, replay-LSN and
receive/write/flush LSNs are on different timelines.  When
replica have not reached the new timeline to which alredy
received file belongs, the fucntion returns wrong file name,
specifically a name consisting of the latest segment number and
the older timeline where the segment doesn't belong to.

We have an LSN reporting function each for several objectives.

 pg_current_wal_lsn
 pg_current_wal_insert_lsn
 pg_current_wal_flush_lsn
 pg_last_wal_receive_lsn
 pg_last_wal_replay_lsn

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..

The function returns NULL for NULL input (STRICT behavior) but
returns (NULL, NULL) for undefined timeline. I don't think the
differene is meaningful.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-26 Thread Michael Paquier

On Fri, Jul 26, 2019 at 09:36:32AM +0200, Julien Rouhaud wrote:
> I see that you iterate over the SimpleStringList after it's generated.
> Why not computing that while building it in get_parallel_object_list
> (and keep the provided table list count) instead?

Yeah.  I was hesitating to do that, or just break out of the counting
loop if there are more objects than concurrent jobs, but that's less
intuitive. 
--
Michael


signature.asc
Description: PGP signature

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-26 Thread Julien Rouhaud

On Fri, Jul 26, 2019 at 5:27 AM Michael Paquier  wrote:
>
> On Thu, Jul 25, 2019 at 01:00:34PM +0200, Julien Rouhaud wrote:
> > The problem is that a user doing something like:
> >
> > reindexdb -j48 -i some_index -S s1 -S s2 -S s3
> >
> > will probably be disappointed to learn that he has to run a specific
> > command for the index(es) that should be reindexed.  Maybe we can
> > issue a warning that parallelism isn't used when an index list is
> > processed and user asked for multiple jobs?
>
> Arguments go in both directions as some other users may be surprised
> by the performance of indexes as serialization is enforced.

Sure, but there is no easy solution in that case, as you'd have to do
all the work of spawning multiple reindexdb according to the
underlying table, so probably what will happen here is that there'll
just be two simple calls to reindexdb, one for the indexes, serialized
anyway, and one for everything else.  My vote is still to allow it,
possibly emitting a notice or a warning.

> > I don't send a new patch since the --index wanted behavior is not
> > clear yet.
>
> So I am sending one patch (actually two) after a closer review that I
> have spent time shaping into a committable state.  And for this part I
> have another suggestion that is to use a switch/case without a default
> so as any newly-added object types would allow somebody to think about
> those code paths as this would generate compiler warnings.

Thanks for that!  I'm fine with using switch to avoid future bad surprises.

> While reviewing I have found an extra bug in the logic: when using a
> list of tables, the number of parallel slots is the minimum between
> concurrentCons and tbl_count, but this does not get applied after
> building a list of tables for a schema or database reindex, so we had
> better recompute the number of items in reindex_one_database() before
> allocating the number of parallel slots.

I see that you iterate over the SimpleStringList after it's generated.
Why not computing that while building it in get_parallel_object_list
(and keep the provided table list count) instead?

RE: Multivariate MCV list vs. statistics target

2019-07-26 Thread Jamison, Kirk

On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:

> >+/* XXX What if the target is set to 0? Reset the statistic?
> */
> >
> >This also makes me wonder. I haven't looked deeply into the code, but
> >since 0 is a valid value, I believe it should reset the stats.
> 
> I agree resetting the stats after setting the target to 0 seems quite
> reasonable. But that's not what we do for attribute stats, because in that
> case we simply skip the attribute during the future ANALYZE runs - we don't
> reset the stats, we keep the existing stats. So I've done the same thing here,
> and I've removed the XXX comment.
> 
> If we want to change that, I'd do it in a separate patch for both the regular
> and extended stats.

Hi, Tomas

Sorry for my late reply.
You're right. I have no strong opinion whether we'd want to change that 
behavior.
I've also confirmed the change in the patch where setting statistics target 0
skips the statistics. 

Maybe only some minor nitpicks in the source code comments below:
1. "it's" should be "its":
> +  * Compute statistic target, based on what's set for the 
> statistic
> +  * object itself, and for it's attributes.

2. Consistency whether you'd use either "statistic " or "statisticS ".
Ex. statistic target vs statisticS target, statistics object vs statistic 
object, etc.

> Attached is v4 of the patch, with a couple more improvements:
>
> 1) I've renamed the if_not_exists flag to missing_ok, because that's more
> consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda
> the exact opposite), and I've added a NOTICE about the skip.

+   boolmissing_ok;  /* do nothing if statistics does not 
exist */

Confirmed. So we ignore if statistic does not exist, and skip the error.
Maybe to make it consistent with other data structures in parsernodes.h,
you can change the comment of missing_ok to: 
/* skip error if statistics object does not exist */

> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's
> what the function was doing anyway (computing sample size).
>
> 3) I've added a couple of regression tests to stats_ext.sql
> 
> Aside from that, I've cleaned up a couple of places and improved a bunch of
> comments. Nothing huge.

I have a question though regarding ComputeExtStatisticsRows.
I'm just curious with the value 300 when computing sample size.
Where did this value come from?

+   /* compute sample size based on the statistic target */
+   return (300 * result);

Overall, the patch is almost already in good shape for commit.
I'll wait for the next update.

Regards,
Kirk Jamison

62 matches

Mail list logo