Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Greg Nancarrow
On Thu, Feb 11, 2021 at 5:33 PM Greg Nancarrow  wrote:
>
> On Tue, Feb 9, 2021 at 1:04 AM Amit Langote  wrote:
> >
> > * I think that the concerns raised by Tsunakawa-san in:
> >
> > https://www.postgresql.org/message-id/TYAPR01MB2990CCB6E24B10D35D28B949FEA30%40TYAPR01MB2990.jpnprd01.prod.outlook.com
> >
> > regarding how this interacts with plancache.c deserve a look.
> > Specifically, a plan that uses parallel insert may fail to be
> > invalidated when partitions are altered directly (that is without
> > altering their root parent).  That would be because we are not adding
> > partition OIDs to PlannerGlobal.invalItems despite making a plan
> > that's based on checking their properties.  See this (tested with all
> > patches applied!):
> >
>
> Does any current Postgres code add partition OIDs to
> PlannerGlobal.invalItems for a similar reason?
> I would have thought that, for example,  partitions with a default
> column expression, using a function that is changed from SAFE to
> UNSAFE, would suffer the same plancache issue (for current parallel
> SELECT functionality) as we're talking about here - but so far I
> haven't seen any code handling this.
>
> (Currently invalItems seems to support PROCID and TYPEOID; relation
> OIDs seem to be handled through a different mechanism)..
> Can you elaborate on what you believe is required here, so that the
> partition OID dependency is registered and the altered partition
> results in the plan being invalidated?
> Thanks in advance for any help you can provide here.
>

Actually, I tried adding the following in the loop that checks the
parallel-safety of each partition and it seemed to work:

glob->relationOids =
lappend_oid(glob->relationOids, pdesc->oids[i]);

Can you confirm, is that what you were referring to?
(note that I've already updated the code to use
CreatePartitionDirectory() and PartitionDirectoryLookup())

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Greg Nancarrow
On Tue, Feb 9, 2021 at 1:04 AM Amit Langote  wrote:
>
> * I think that the concerns raised by Tsunakawa-san in:
>
> https://www.postgresql.org/message-id/TYAPR01MB2990CCB6E24B10D35D28B949FEA30%40TYAPR01MB2990.jpnprd01.prod.outlook.com
>
> regarding how this interacts with plancache.c deserve a look.
> Specifically, a plan that uses parallel insert may fail to be
> invalidated when partitions are altered directly (that is without
> altering their root parent).  That would be because we are not adding
> partition OIDs to PlannerGlobal.invalItems despite making a plan
> that's based on checking their properties.  See this (tested with all
> patches applied!):
>

Does any current Postgres code add partition OIDs to
PlannerGlobal.invalItems for a similar reason?
I would have thought that, for example,  partitions with a default
column expression, using a function that is changed from SAFE to
UNSAFE, would suffer the same plancache issue (for current parallel
SELECT functionality) as we're talking about here - but so far I
haven't seen any code handling this.

(Currently invalItems seems to support PROCID and TYPEOID; relation
OIDs seem to be handled through a different mechanism)..
Can you elaborate on what you believe is required here, so that the
partition OID dependency is registered and the altered partition
results in the plan being invalidated?
Thanks in advance for any help you can provide here.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Operands don't affect result (CONSTANT_EXPRESSION_RESULT) (src/backend/utils/adt/jsonfuncs.c)

2021-02-10 Thread Tom Lane
Ranier Vilela  writes:
> *long* 4 *long int*, *signed long int* -2.147.483.648 a 2.147.483.647
> Therefore long never be > INT_MAX at Windows 64 bits.
> Thus lindex is always false in this expression:
> if (errno != 0 || badp == c || *badp != '\0' || lindex > INT_MAX ||  lindex
>  < INT_MIN)

Warnings about this are purest nanny-ism.

At the same time, I think this code could be improved; but the way
to do that is to use strtoint(), rather than kluging the choice of
datatype even further.

regards, tom lane




Some regular-expression performance hacking

2021-02-10 Thread Tom Lane
As I mentioned in connection with adding the src/test/modules/test_regex
test code, I've been fooling with some performance improvements to our
regular expression engine.  Here's the first fruits of that labor.
This is mostly concerned with cutting the overhead for handling trivial
unconstrained patterns like ".*".

0001 creates the concept of a "rainbow" arc within regex NFAs.  You can
read background info about this in the "Colors and colormapping" part of
regex/README, but the basic point is that right now, representing a dot
(".", match anything) within an NFA requires a separate arc for each
"color" (character equivalence class) that the regex needs.  This uses
up a fair amount of storage and processing effort, especially in larger
regexes which tend to have a lot of colors.  We can replace such a
"rainbow" of arcs with a single arc labeled with a special color
RAINBOW.  This is worth doing on its own account, just because it saves
space and time.  For example, on the reg-33.15.1 test case in
test_regex.sql (a moderately large real-world RE), I find that HEAD
requires 1377614 bytes to represent the compiled RE, and the peak space
usage during pg_regcomp() is 3124376 bytes.  With this patch, that drops
to 1077166 bytes for the RE (21% savings) with peak compilation space
2800752 bytes (10% savings).  Moreover, the runtime for that test case
drops from ~57ms to ~44ms, a 22% savings.  (This is mostly measuring the
RE compilation time.  Execution time should drop a bit too since miss()
need consider fewer arcs; but that savings is in a cold code path so it
won't matter much.)  These aren't earth-shattering numbers of course,
but for the amount of code needed, it seems well worth while.

A possible point of contention is that I exposed the idea of a rainbow
arc in the regexport.h APIs, which will force consumers of that API
to adapt --- see the changes to contrib/pg_trgm for an example.  I'm
not too concerned about this because I kinda suspect that pg_trgm is
the only consumer of that API anywhere.  (codesearch.debian.net knows
of no others, anyway.)  We could in principle hide the change by
having the regexport functions expand a rainbow arc into one for
each color, but that seems like make-work.  pg_trgm would certainly
not see it as an improvement, and in general users of that API should
appreciate recognizing rainbows as such, since they might be able to
apply optimizations that depend on doing so.

Which brings us to 0002, which is exactly such an optimization.
The idea here is to short-circuit character-by-character scanning
when matching a sub-NFA that is like "." or ".*" or variants of
that, ie it will match any sequence of some number of characters.
This requires the ability to recognize that a particular pair of
NFA states are linked by a rainbow, so it's a lot less painful
to do when rainbows are represented explicitly.  The example that
got me interested in this is adapted from a Tcl trouble report:

select array_dims(regexp_matches(repeat('x',40) || '=' || repeat('y',5),
 '^(.*)=(.*)$'));

On my machine, this takes about 6 seconds in HEAD, because there's an
O(N^2) effect: we try to match the sub-NFA for the first "(.*)" capture
group to each possible starting string, and only after expensively
verifying that tautological match do we check to see if the next
character is "=".  By not having to do any per-character work to decide
that .* matches a substring, the O(N^2) behavior is removed and the time
drops to about 7 msec.

(One could also imagine fixing this by rearranging things to check for
the "=" match before verifying the capture-group matches.  That's an
idea I hope to look into in future, because it could help for cases
where the variable parts are not merely ".*".  But I don't have clear
ideas about how to do that, and in any case ".*" is common enough that
the present change should still be helpful.)

There are two non-boilerplate parts of the 0002 patch.  One is the
checkmatchall() function that determines whether an NFA is match-all,
and if so what the min and max match lengths are.  This is actually not
very complicated once you understand what the regex engine does at the
"pre" and "post" states.  (See the "Detailed semantics" part of
regex/README for some info about that, which I tried to clarify as part
of the patch.)  Other than those endpoint conditions it's just a
recursive graph search.  The other hard part is the changes in
rege_dfa.c to provide the actual short-circuit behavior at runtime.
That's ticklish because it's trying to emulate some overly complicated
and underly documented code, particularly in longest() and shortest().
I think that stuff is right; I've studied it and tested it.  But it
could use more eyeballs.

Notably, I had to add some more test cases to test_regex.sql to exercise
the short-circuit part of matchuntil() properly.  That's only used for
lookbehind constraints, so we won't hit the short-circuit path except
with 

Re: 64-bit XIDs in deleted nbtree pages

2021-02-10 Thread Peter Geoghegan
On Wed, Feb 10, 2021 at 7:10 PM Peter Geoghegan  wrote:
> Attached is v3 of the index. I'll describe the changes I made in more
> detail in my response to your points below.

I forget to mention that v3 adds several assertions like this one:

Assert(!_bt_page_recyclable(BufferGetPage(buf)));

These appear at a few key points inside generic routines like
_bt_getbuf(). The overall effect is that every nbtree buffer access
(with the exception of buffer accesses by VACUUM) will make sure that
the page that they're about to access is not recyclable (a page that
an index scan lands on might be half-dead or deleted, but it had
better not be recyclable).

This can probably catch problems with recycling pages too early, such
as the problem fixed by commit d3abbbeb back in 2012. Any similar bugs
in this area that may appear in the future can be expected to be very
subtle, for a few reasons. For one, a page can be recyclable but not
yet entered into the FSM by VACUUM for a long time. (I could go on.)

The assertions dramatically improve our chances of catching problems
like that early.

-- 
Peter Geoghegan




Re: 64-bit XIDs in deleted nbtree pages

2021-02-10 Thread Peter Geoghegan
On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada  wrote:
> Thank you for working on this!

I'm glad that I finally found time for it! It seems like it'll make
things easier elsewhere.

Attached is v3 of the index. I'll describe the changes I made in more
detail in my response to your points below.

> I agree that btm_oldest_btpo_xact will no longer be necessary in terms
> of recycling deleted pages.

Cool.

> Given that we can guarantee that deleted pages never be leaked by
> using 64-bit XID, I also think we don't need this value. We can do
> amvacuumcleanup only if the table receives enough insertions to update
> the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
> think this is a more desirable behavior. Not skipping amvacuumcleanup
> if there is even one deleted page that we can recycle is very
> conservative.
>
> Considering your idea of keeping newly deleted pages in the meta page,
> I can see a little value that keeping btm_oldest_btpo_xact and making
> it 64-bit XID. I described below.

> Interesting.
>
> I like this idea that triggers btvacuumscan() if there are many newly
> deleted pages. I think this would be helpful especially for the case
> of bulk-deletion on the table. But why we use the number of *newly*
> deleted pages but not the total number of deleted pages in the index?

I was unclear here -- I should not have said "newly deleted" pages at
all. What I actually do when calling _bt_vacuum_needs_cleanup() is
this (from v3, at the end of btvacuumscan()):

-   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+   Assert(stats->pages_deleted >= stats->pages_free);
+   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
 info->num_heap_tuples);

We're actually passing something I have called
"pages_deleted_not_free" here, which is derived from the bulk delete
stats in the obvious way that you see here (subtraction). I'm not
using pages_newly_deleted at all now. Note also that the behavior
inside _bt_update_meta_cleanup_info() no longer varies based on
whether it is called during btvacuumcleanup() or during btbulkdelete()
-- the same rules apply either way. We want to store
pages_deleted_not_free in the metapage at the end of btvacuumscan(),
no matter what.

This same pages_deleted_not_free information is now used by
_bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
high (over 2.5%), then that will trigger a call to btbulkdelete() (we
won't skip scanning the index). Though in practice it probably won't
come up that often -- there just aren't ever that many deleted pages
in most indexes.

> IIUC if several btbulkdelete executions deleted index pages less than
> 2% of the index and those deleted pages could not be recycled yet,
> then the number of recyclable pages would exceed 2% of the index in
> total but amvacuumcleanup() would not trigger btvacuumscan() because
> the last newly deleted pages are less than the 2% threshold. I might
> be missing something though.

I think you're right -- my idea of varying the behavior of
_bt_update_meta_cleanup_info() based on whether it's being called
during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
half the problem was that I explained the idea badly to begin with).
But, as I said, it's fixed in v3: we simply pass
"pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
now.

Does that make sense? Does it address this concern?

> Also, we need to note that having newly deleted pages doesn't
> necessarily mean these always are recyclable at that time. If the
> global xmin is still older than deleted page's btpo.xact values, we
> still could not recycle them. I think btm_oldest_btpo_xact probably
> will help this case. That is, we store the oldest btpo.xact among
> those newly deleted pages to btm_oldest_btpo_xact and we trigger
> btvacuumscan() if there are many newly deleted pages (more than 2% of
> index) and the btm_oldest_btpo_xact is older than the global xmin (I
> suppose each newly deleted pages could have different btpo.xact).

I agree that having no XID in the metapage creates a new small
problem. Specifically, there are certain narrow cases that can cause
confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
exist before my patch (kind of).

The simplest example is easy to run into when debugging the patch on
your laptop. Because you're using your personal laptop, and not a real
production server, there will be no concurrent sessions that might
consume XIDs. You can run VACUUM VERBOSE manually several times, but
that alone will never be enough to enable VACUUM to recycle any of the
pages that the first VACUUM manages to delete (many to mark deleted,
reporting the pages as "newly deleted" via the new instrumentation
from the second patch). Note that the master branch is *also* unable
to recycle these deleted pages, simply because the "safe xid" never
gets

Re: Single transaction in the tablesync worker?

2021-02-10 Thread Peter Smith
I have reviewed again the latest patch (V31)

I found only a few minor nitpick issues not worth listing.

Then I ran the subscription TAP tests 50x in a loop as a kind of
stress test. That ran for 2.5hrs and the result was all 50x 'Result:
PASS'.

So V31 looks good to me.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2021-02-10 Thread Andy Fan
On Sun, Jan 24, 2021 at 6:26 PM Andy Fan  wrote:

> Hi Masahiko:
>
> On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada 
> wrote:
>
>> Hi Andy,
>>
>> On Mon, Dec 7, 2020 at 9:15 PM Andy Fan  wrote:
>> >
>> >
>> >
>> > On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen <
>> jesper.peder...@redhat.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> On 12/5/20 10:38 PM, Andy Fan wrote:
>> >> > Currently the UniqueKey is defined as a List of Expr, rather than
>> >> > EquivalenceClasses.
>> >> > A complete discussion until now can be found at [1] (The messages I
>> replied
>> >> > to also
>> >> > care a lot and the information is completed). This patch has stopped
>> at
>> >> > this place for
>> >> > a while,  I'm planning to try EquivalenceClasses,  but any
>> suggestion would
>> >> > be welcome.
>> >> >
>> >> >
>> >>
>> >> Unfortunately I think we need a RfC style patch of both versions in
>> >> their minimum implementation.
>> >>
>> >> Hopefully this will make it easier for one or more committers to decide
>> >> on the right direction since they can do a side-by-side comparison of
>> >> the two solutions.
>> >>
>> >
>> > I do get the exact same idea.  Actually I have made EquivalenceClasses
>> > works with baserel last weekend and then I realized it is hard to
>> compare
>> > the 2 situations without looking into the real/Poc code, even for very
>> > experienced people.  I will submit a new patch after I get the
>> partitioned
>> > relation, subquery works.  Hope I can make it in one week.
>>
>> Status update for a commitfest entry.
>>
>> Are you planning to submit a new patch? Or is there any blocker for
>> this work? This patch entry on CF app has been in state Waiting on
>> Author for a while. If there is any update on that, please reflect on
>> CF app.
>>
>>
>> I agree that  the current status is "Waiting on author",  and no block
> issue for others.
> I plan to work on this in 1 month.  I have to get my current urgent case
> completed first.
> Sorry for the delay action and thanks for asking.
>
>
>
I'd start to continue this work today. At the same time, I will split the
multi-patch series
into some dedicated small chunks for easier review.  The first one is just
for adding a
notnullattrs in RelOptInfo struct,  in thread [1].

https://www.postgresql.org/message-id/flat/CAKU4AWpQjAqJwQ2X-aR9g3%2BZHRzU1k8hNP7A%2B_mLuOv-n5aVKA%40mail.gmail.com


-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


Re: 64-bit XIDs in deleted nbtree pages

2021-02-10 Thread Peter Geoghegan
On Tue, Feb 9, 2021 at 11:58 PM Heikki Linnakangas  wrote:
> Thanks for picking this up!

I actually had a patch for this in 2019, albeit one that remained in
rough shape until recently. Must have forgotten about it.

> Is it really worth the trouble to maintain 'level' on deleted pages? All
> you currently do with it is check that the BTP_LEAF flag is set iff
> "level == 0", which seems pointless. I guess there could be some
> forensic value in keeping 'level', but meh.

What trouble is that? The only way in which it's inconvenient is that
we have to include the level field in xl_btree_unlink_page WAL records
for the first time. The structure of the relevant REDO routine (which
is called btree_xlog_unlink_page()) ought to explicitly recreate the
original page from scratch, without any special cases. This makes it
possible to pretend that there never was such a thing as an nbtree
page whose level field could not be relied on. I personally think that
it's simpler when seen in the wider context of how the code works and
is verified.

Besides, there is also amcheck to consider. I am a big believer in
amcheck, and see it as something that has enabled my work on the
B-Tree code over the past few years. Preserving the level field in
deleted pages increases our coverage just a little, and practically
eliminates cases where we cannot rely on the level field.

Of course it's still true that this detail (the deleted pages level
field question) will probably never seem important to anybody else. To
me it's one small detail of a broader strategy. No one detail of that
broader strategy, taken in isolation, will ever be crucially
important.

Of course it's also true that we should not assume that a very high
cost in performance/code/whatever can justify a much smaller benefit
in amcheck. But you haven't really explained why the cost seems
unacceptable to you. (Perhaps I missed something.)

> How about:
>
> INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
> DETAIL:  25 index row versions and 686 pages were removed.
> 2056 index pages are now unused, 1370 are currently reusable.
>
> The idea is that the first DETAIL line now says what the VACUUM did this
> round, and the last line says what the state of the index is now. One
> concern with that phrasing is that it might not be clear what "686 pages
> were removed" means.

> It's still a bit weird that the "what VACUUM did this round" information
> is sandwiched between the two other lines that talk about the state of
> the index after the operation. But I think the language now makes it
> more clear which is which.

IMV our immediate goal for the new VACUUM VERBOSE output should be to
make the output as accurate and descriptive as possible (while still
using terminology that works for all index AMs, not just nbtree). I
don't think that we should give too much weight to making the
information easy to understand in isolation. Because that's basically
impossible -- it just doesn't work that way IME.

Confusion over the accounting of "deleted pages in indexes" vs "pages
deleted by this VACUUM" is not new. See my bugfix commit 73a076b0 to
see one vintage example. The relevant output of VACUUM VERBOSE
produced inconsistent results for perhaps as long as 15 years before I
noticed it and fixed it. I somehow didn't notice this despite using it
for various tests for my own B-Tree projects a year or two before the
fix. Tests that produced inconsistent results that I noticed pretty
early on, and yet assumed were all down to some subtlety that I didn't
yet understand.

My point is this: I am quite prepared to admit that these details
really are complicated. But that's not incidental to what's really
going on, or anything (though I agree with your later remarks on the
general tidiness of VACUUM VERBOSE -- it is a real dog's dinner).

I'm not saying that we should assume that no DBA will find the
relevant VACUUM VERBOSE output useful -- I don't think that at all. It
will be kind of rare for a user to really comb through it. But that's
mostly because big problems in this area are themselves kind of rare
(most individual indexes never have any deleted pages IME).

Any DBA consuming this output sensibly will consume it in a way that
makes sense in the *context of the problem that they're experiencing*,
whatever that might mean for them. They'll consider how it changes
over time for the same index. They'll try to correlate it with other
symptoms, or other problems, and make sense of it in a top-down
fashion. We should try to make it as descriptive as possible so that
DBAs will have the breadcrumbs they need to tie it back to whatever
the core issue happens to be -- maybe they'll have to read the source
code to get to the bottom of it. It's likely to be some rare issue in
those cases where the DBA really cares about the details -- it's
likely to be workload dependent.

Good DBAs spend much of their time on exceptional problems -- all the
easy problems will have been

Re: partial heap only tuples

2021-02-10 Thread Bossart, Nathan
On 2/10/21, 2:43 PM, "Bruce Momjian"  wrote:
> On Tue, Feb  9, 2021 at 06:48:21PM +, Bossart, Nathan wrote:
>> HOT works wonders when no indexed columns are updated.  However, as
>> soon as you touch one indexed column, you lose that optimization
>> entirely, as you must update every index on the table.  The resulting
>> performance impact is a pain point for many of our (AWS's) enterprise
>> customers, so we'd like to lend a hand for some improvements in this
>> area.  For workloads involving a lot of columns and a lot of indexes,
>> an optimization like PHOT can make a huge difference.  I'm aware that
>> there was a previous attempt a few years ago to add a similar
>> optimization called WARM [0] [1].  However, I only noticed this
>> previous effort after coming up with the design for PHOT, so I ended
>> up taking a slightly different approach.  I am also aware of a couple
>> of recent nbtree improvements that may mitigate some of the impact of
>> non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
>> complement to those.  I've attached a very early proof-of-concept
>> patch with the design described below.
>
> How is your approach different from those of [0] and [1]?  It is
> interesting you still see performance benefits even after the btree
> duplication improvements.  Did you test with those improvements?

I believe one of the main differences is that index tuples will point
to the corresponding PHOT tuple instead of the root of the HOT/PHOT
chain.  I'm sure there are other differences.  I plan on giving those
two long threads another read-through in the near future.

I made sure that the btree duplication improvements were applied for
my benchmarking.  IIUC those don't alleviate the requirement that you
insert all index tuples for non-HOT updates, so PHOT can still provide
some added benefits there.

>> Next, I'll go into the design a bit.  I've commandeered the two
>> remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and
>> HEAP_PHOT_TUPLE.  These are analogous to the HEAP_HOT_UPDATED and
>> HEAP_ONLY_TUPLE bits.  (If there are concerns about exhausting the
>> t_infomask2 bits, I think we could only use one of the remaining bits
>> as a "modifier" bit on the HOT ones.  I opted against that for the
>> proof-of-concept patch to keep things simple.)  When creating a PHOT
>> tuple, we only create new index tuples for updated columns.  These new
>> index tuples point to the PHOT tuple.  Following is a simple
>> demonstration with a table with two integer columns, each with its own
>> index:
>
> Whatever solution you have, you have to be able to handle
> adding/removing columns, and adding/removing indexes.

I admittedly have not thought too much about the implications of
adding/removing columns and indexes for PHOT yet, but that's
definitely an important part of this project that I need to look into.
I see that HOT has some special handling for commands like CREATE
INDEX that I can reference.

>> When it is time to scan through a PHOT chain, there are a couple of
>> things to account for.  Sequential scans work out-of-the-box thanks to
>> the visibility rules, but other types of scans like index scans
>> require additional checks.  If you encounter a PHOT chain when
>> performing an index scan, you should only continue following the chain
>> as long as none of the columns the index indexes are modified.  If the
>> scan does encounter such a modification, we stop following the chain
>> and continue with the index scan.  Even if there is a tuple in that
>
> I think in patch [0] and [1], if an index column changes, all the
> indexes had to be inserted into, while you seem to require inserts only
> into the index that needs it.  Is that correct?

Right, PHOT only requires new index tuples for the modified columns.
However, I was under the impression that WARM aimed to do the same
thing.  I might be misunderstanding your question.

> I wonder if you should create a Postgres wiki page to document all of
> this.  I agree PG 15 makes sense.  I would like to help with this if I
> can.  I will need to study this email more later.

Thanks for taking a look.  I think a wiki is a good idea for keeping
track of the current state of the design.  I'll look into that.

Nathan



Re: TRUNCATE on foreign table

2021-02-10 Thread Kohei KaiGai
2021年2月10日(水) 13:55 Ashutosh Bapat :
>
> On Tue, Feb 9, 2021 at 7:45 PM Kazutaka Onishi  wrote:
> >
> > > IIUC, "truncatable" would be set to "false" for relations which do not
> > > have physical storage e.g. views but will be true for regular tables.
> >
> > "truncatable" option is just for the foreign table and it's not related 
> > with whether it's on a physical storage or not.
> > "postgres_fdw" already has "updatable" option to make the table read-only.
> > However, "updatable" is for DML, and it's not suitable for TRUNCATE.
> > Therefore new options "truncatable" was added.
> >
> > Please refer to this message for details.
> > https://www.postgresql.org/message-id/20200128040346.GC1552%40paquier.xyz
> >
> > > DELETE is very different from TRUNCATE. Application may want to DELETE
> > > based on a join with a local table and hence it can not be executed on
> > > a foreign server. That's not true with TRUNCATE.
> >
> > Yeah, As you say, Applications doesn't need  TRUNCATE.
> > We're focusing for analytical use, namely operating huge data.
> > TRUNCATE can erase rows faster than DELETE.
>
> The question is why can't that truncate be run on the foreign server
> itself rather than local server?
>
At least, PostgreSQL applies different access permissions on TRUNCATE.
If unconditional DELETE implicitly promotes to TRUNCATE, DB administrator
has to allow TRUNCATE permission on the remote table also.

Also, TRUNCATE acquires stronger lock the DELETE.
DELETE still allows concurrent accesses to the table, even though TRUNCATE
takes AccessExclusive lock, thus, FDW driver has to control the
concurrent accesses
by itself, if we have no dedicated TRUNCATE interface.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 




Re: doc review for v14

2021-02-10 Thread Justin Pryzby
Another round of doc fixen.

wdiff to follow

commit 389c4ac2febe21fd48480a86819d94fd2eb9c1cc
Author: Justin Pryzby 
Date:   Wed Feb 10 17:19:51 2021 -0600

doc review for pg_stat_progress_create_index

ab0dfc961b6a821f23d9c40c723d11380ce195a6

should backpatch to v13

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..16eb1d9e9c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5725,7 +5725,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
  
  
   When creating an index on a partitioned table, this column is set to
   the number of partitions on which the index has been 
[-completed.-]{+created.+}
  
 


commit bff6f0b557ff79365fc21d0ae261bad0fcb96539
Author: Justin Pryzby 
Date:   Sat Feb 6 15:17:51 2021 -0600

*an old and "deleted [has] happened"

Heikki missed this in 6b387179baab8d0e5da6570678eefbe61f3acc79

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 3763b4b995..a51f2c9920 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -6928,8 +6928,8 @@ Delete



Identifies the following TupleData message as [-a-]{+an+} old 
tuple.
This field is present if the table in which the delete[-has-]
happened has REPLICA IDENTITY set to FULL.



commit 9bd601fa82ceeaf09573ce31eb3c081b4ae7a45d
Author: Justin Pryzby 
Date:   Sat Jan 23 21:03:37 2021 -0600

doc review for logical decoding of prepared xacts

0aa8a01d04c8fe200b7a106878eebc3d0af9105c

diff --git a/doc/src/sgml/logicaldecoding.sgml 
b/doc/src/sgml/logicaldecoding.sgml
index b854f2ccfc..71e9f36b8e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -791,9 +791,9 @@ typedef void (*LogicalDecodeMessageCB) (struct 
LogicalDecodingContext *ctx,
 
   The optional filter_prepare_cb callback
   is called to determine whether data that is part of the current
   two-phase commit transaction should be considered for 
[-decode-]{+decoding+}
   at this prepare stage or {+later+} as a regular one-phase transaction at
   COMMIT PREPARED [-time later.-]{+time.+} To signal 
that
   decoding should be skipped, return true;
   false otherwise. When the callback is not
   defined, false is assumed (i.e. nothing is
@@ -820,11 +820,11 @@ typedef bool (*LogicalDecodeFilterPrepareCB) (struct 
LogicalDecodingContext *ctx
  The required begin_prepare_cb callback is called
  whenever the start of a prepared transaction has been decoded. The
  gid field, which is part of the
  txn [-parameter-]{+parameter,+} can be used in 
this callback to
  check if the plugin has already received this [-prepare-]{+PREPARE+} in 
which case it
  can skip the remaining changes of the transaction. This can only happen
  if the user restarts the decoding after receiving the 
[-prepare-]{+PREPARE+} for a
  transaction but before receiving the [-commit prepared-]{+COMMIT 
PREPARED,+} say because of some
  error.
  
   typedef void (*LogicalDecodeBeginPrepareCB) (struct 
LogicalDecodingContext *ctx,
@@ -842,7 +842,7 @@ typedef bool (*LogicalDecodeFilterPrepareCB) (struct 
LogicalDecodingContext *ctx
  decoded. The change_cb callback for all modified
  rows will have been called before this, if there have been any modified
  rows. The gid field, which is part of the
  txn [-parameter-]{+parameter,+} can be used in 
this callback.
  
   typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext 
*ctx,
   ReorderBufferTXN *txn,
@@ -856,9 +856,9 @@ typedef bool (*LogicalDecodeFilterPrepareCB) (struct 
LogicalDecodingContext *ctx

 
  The required commit_prepared_cb callback is called
  whenever a transaction [-commit prepared-]{+COMMIT PREPARED+} has been 
decoded. The
  gid field, which is part of the
  txn [-parameter-]{+parameter,+} can be used in 
this callback.
  
   typedef void (*LogicalDecodeCommitPreparedCB) (struct 
LogicalDecodingContext *ctx,
  ReorderBufferTXN *txn,
@@ -872,15 +872,15 @@ typedef bool (*LogicalDecodeFilterPrepareCB) (struct 
LogicalDecodingContext *ctx

 
  The required rollback_prepared_cb callback is called
  whenever a transaction [-rollback prepared-]{+ROLLBACK PREPARED+} has 
been decoded. The
  gid field, which is part of the
  txn [-parameter-]{+parameter,+} can be used in 
this callback. The
  parameters prepare_end_lsn and
  prepare_time can be used to check if the plugin
  has received this [-prepare transaction-]{+PREPARE TRANSACTION+} in which 
case it can apply the
  rollback, otherwise, it can skip the rollback operation. The
  gid alone is not sufficient because the downstream
  node can have {+a+} prepar

Re: PATCH: Batch/pipelining support for libpq

2021-02-10 Thread Alvaro Herrera
On 2021-Jan-21, Alvaro Herrera wrote:

> As you can see in an XXX comment in the libpq test program, the current
> implementation has the behavior that PQgetResult() returns NULL after a
> batch is finished and has reported PGRES_BATCH_END.  I don't know if
> there's a hard reason to do that, but I'd like to supress it because it
> seems weird and out of place.

Hello Craig, a question for you since this API is your devising.  I've
been looking at changing the way this works to prevent those NULL
returns from PQgetResult.  That is, instead of having what seems like a
"NULL separator" of query results, you'd just get the PGRES_BATCH_END to
signify a batch end (not a NULL result after the BATCH_END); and the
normal PGRES_COMMAND_OK or PGRES_TUPLES_OK etc when the result of a
command has been sent.  It's not working yet so I'm not sending an
updated patch, but I wanted to know if you had a rationale for including
this NULL return "separator" or was it just a convenience because of how
the code grew together.

Such a decision has obvious consequences for the test program (which
right now expects that PQgetResult() returns NULL at each step); and
naturally for libpq's internal state machine that controls how it all
works.

Thanks,

-- 
Álvaro Herrera39°49'30"S 73°17'W




Operands don't affect result (CONSTANT_EXPRESSION_RESULT) (src/backend/utils/adt/jsonfuncs.c)

2021-02-10 Thread Ranier Vilela
Hi,

Per Coverity.

The use of type "long" is problematic with Windows 64bits.
Long type on Windows 64bits is 32 bits.

See at:
https://docs.microsoft.com/pt-br/cpp/cpp/data-type-ranges?view=msvc-160


*long* 4 *long int*, *signed long int* -2.147.483.648 a 2.147.483.647
Therefore long never be > INT_MAX at Windows 64 bits.

Thus lindex is always false in this expression:
if (errno != 0 || badp == c || *badp != '\0' || lindex > INT_MAX ||  lindex
 < INT_MIN)

Patch suggestion to fix this.

diff --git a/src/backend/utils/adt/jsonfuncs.c
b/src/backend/utils/adt/jsonfuncs.c
index 215a10f16e..54b0eded76 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -1675,7 +1675,7 @@ push_path(JsonbParseState **st, int level, Datum
*path_elems,
  * end, the access index must be normalized by level.
  */
  enum jbvType *tpath = palloc0((path_len - level) * sizeof(enum jbvType));
- long lindex;
+ int64 lindex;
  JsonbValue newkey;

  /*

regards,
Ranier Vilela


jsonfuncs.patch
Description: Binary data


Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-10 Thread Michail Nikolaev
Hello, Peter.

If you are interested, the possible patch (based on FPI mask during
replay) was sent with some additional explanation and graphics to (1).
At the moment I unable to find any "incorrectness" in it.

Thanks again for your comments.

Michail.


[1] 
https://www.postgresql.org/message-id/flat/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com#4c81a4d623d8152f5e8889e97e750eec




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-02-10 Thread Michail Nikolaev
Hello, everyone.

After some correspondence with Peter Geoghegan (1) and his ideas, I
have reworked the patch a lot and now it is much more simple with even
better performance (no new WAL or conflict resolution, hot standby
feedback is unrelated).

The idea is pretty simple now - let’s mark the page with
“standby-safe” LP_DEAD hints by the bit in btpo_flags
(BTP_LP_SAFE_ON_STANDBY and similar for gist and hash).

If standby wants to set LP_DEAD - it checks BTP_LP_SAFE_ON_STANDBY on
the page first, if it is not set - all “primary” hints are removed
first, and then the flag is set (with memory barrier to avoid memory
ordering issues in concurrent scans).
Also, standby checks BTP_LP_SAFE_ON_STANDBY to be sure about ignoring
tuples marked by LP_DEAD during the scan.

Of course, it is not so easy. If standby was promoted (or primary was
restored from standby backup) - it is still possible to receive FPI
with such flag set in WAL logs. So, the main problem is still there.

But we could just clear this flag while applying FPI because the page
remains dirty after that anyway! It should not cause any checksum,
consistency, or pg_rewind issues as explained in (2).
Semantically it is the same as set hint bit one milisecond after FPI
was applied (while page still remains dirty after FPI replay) - and
standby already does it with *heap* hint bits.

Also, TAP-test attached to (2) shows how it is easy to flush a hint
bit which was set by standby to achieve different checksum comparing
to primary already.

If standby was promoted (or restored from standby backup) it is safe
to use LP_DEAD with or without BTP_LP_SAFE_ON_STANDBY on a page. But
for accuracy BTP_LP_SAFE_ON_STANDBY is cleared by primary if found.

Also, we should take into account minRecoveryPoint as described in (3)
to avoid consistency issues during crash recovery (see
IsIndexLpDeadAllowed).

Also, as far as I know - there is no practical sense to keep
minRecoveryPoint at a low value. So, there is an optional patch that
moves minRecoveryPoint forward at each xl_running_data (to allow
standby to set hint bits and LP_DEADs more aggressively). It is about
every 15s.

There are some graphics showing performance testing results on my PC
in the attachment (test is taken from (4)). Each test was running for
10 minutes.
Additional primary performance is probably just measurement error. But
standby performance gain is huge.

Feel free to ask if you need more proof about correctness.

Thanks,
Michail.

[1] - 
https://www.postgresql.org/message-id/flat/CAH2-Wz%3D-BoaKgkN-MnKj6hFwO1BOJSA%2ByLMMO%2BLRZK932fNUXA%40mail.gmail.com#6d7cdebd68069cc493c11b9732fd2040
[2] - 
https://www.postgresql.org/message-id/flat/CANtu0oiAtteJ%2BMpPonBg6WfEsJCKrxuLK15P6GsaGDcYGjefVQ%40mail.gmail.com#091fca433185504f2818d5364819f7a4
[3] - 
https://www.postgresql.org/message-id/flat/CANtu0oh28mX5gy5jburH%2Bn1mcczK5_dCQnhbBnCM%3DPfqh-A26Q%40mail.gmail.com#ecfe5a331a3058f895c0cba698fbc4d3
[4] - 
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..7ebfa0d1a7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1070,6 +1070,12 @@ standby_redo(XLogReaderState *record)
 		running.xids = xlrec->xids;
 
 		ProcArrayApplyRecoveryInfo(&running);
+		if (InHotStandby)
+		{
+			/* Move minRecoveryPoint forward to allow standby set
+			 * hint bits and index-LP_DEAD more aggressively. */
+			XLogFlush(record->currRecPtr);
+		}
 	}
 	else if (info == XLOG_INVALIDATIONS)
 	{
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 92205325fb..14e547ee6b 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -653,17 +653,23 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple L

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Greg Nancarrow
On Wed, Feb 10, 2021 at 8:59 PM Hou, Zhijie  wrote:
>
> > > >
> > > > else if (IsA(node, Query))
> > > > {
> > > > Query  *query = (Query *) node;
> > > >
> > > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > > if (query->rowMarks != NULL)
> > > > {
> > > > context->max_hazard = PROPARALLEL_UNSAFE;
> > > > return true;
> > > > }
> > >
> > > In my understanding, the max_parallel_hazard() query tree walk is to
> > > find constructs that are parallel unsafe in that they call code that
> > > can't run in parallel mode.  For example, FOR UPDATE/SHARE on
> > > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > > support being called in parallel mode.  Likewise ON CONFLICT ... DO
> > > UPDATE calls heap_update() which doesn't support parallelism.  I'm not
> > > aware of any such hazards downstream of ExecValuesScan().
> > >
> > > > >You're trying to
> > > > > add something that bails based on second-guessing that a parallel
> > > > >path  would not be chosen, which I find somewhat objectionable.
> > > > >
> > > > > If the main goal of bailing out is to avoid doing the potentially
> > > > > expensive modification safety check on the target relation, maybe
> > > > > we should try to somehow make the check less expensive.  I
> > > > > remember reading somewhere in the thread about caching the result
> > > > > of this check in relcache, but haven't closely studied the feasibility
> > of doing so.
> > > > >
> > > >
> > > > There's no "second-guessing" involved here.
> > > > There is no underlying way of dividing up the VALUES data of
> > > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > > was updated to produce a parallel-plan for the "INSERT...VALUES"
> > > > case (apart from the fact that spawning off parallel workers to
> > > > insert that data would almost always result in worse performance
> > > > than a non-parallel plan...) The division of work for parallel
> > > > workers is part of the table AM
> > > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> > >
> > > I don't disagree that the planner would not normally assign a parallel
> > > path simply to pull values out of a VALUES list mentioned in the
> > > INSERT command, but deciding something based on the certainty of it in
> > > an earlier planning phase seems odd to me.  Maybe that's just me
> > > though.
> > >
> >
> > I think it is more of a case where neither is a need for parallelism nor
> > we want to support parallelism of it. The other possibility for such a check
> > could be at some earlier phase say in standard_planner [1] where we are
> > doing checks for other constructs where we don't want parallelism (I think
> > the check for 'parse->hasModifyingCTE' is quite similar). If you see in
> > that check as well we just assume other operations to be in the category
> > of parallel-unsafe. I think we should rule out such cases earlier than 
> > later.
> > Do you have better ideas than what Greg has done to avoid parallelism for
> > such cases?
> >
> > [1] -
> > standard_planner()
> > {
> > ..
> > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && IsUnderPostmaster &&
> > parse->commandType == CMD_SELECT &&
> > !parse->hasModifyingCTE &&
> > max_parallel_workers_per_gather > 0 &&
> > !IsParallelWorker())
> > {
> > /* all the cheap tests pass, so scan the query tree */
> > glob->maxParallelHazard = max_parallel_hazard(parse); parallelModeOK =
> > glob->(glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> > }
> > else
> > {
> > /* skip the query tree scan, just assume it's unsafe */
> > glob->maxParallelHazard = PROPARALLEL_UNSAFE; parallelModeOK = false;
> > }
>
> +1.
>
> In the current parallel_dml option patch. I put this check and some 
> high-level check in a separate function called 
> is_parallel_possible_for_modify.
>
>
> - * PROPARALLEL_UNSAFE, PROPARALLEL_RESTRICTED, PROPARALLEL_SAFE
> + * Check at a high-level if parallel mode is able to be used for the 
> specified
> + * table-modification statement.
> + * It's not possible in the following cases:
> + *
> + *  1) enable_parallel_dml is off
> + *  2) UPDATE or DELETE command
> + *  3) INSERT...ON CONFLICT...DO UPDATE
> + *  4) INSERT without SELECT on a relation
> + *  5) the reloption parallel_dml_enabled is not set for the target table
> + *
> + * (Note: we don't do in-depth parallel-safety checks here, we do only the
> + * cheaper tests that can quickly exclude obvious cases for which
> + * parallelism isn't supported, to avoid having to do further parallel-safety
> + * checks for these)
>   */
> +bool
> +is_parallel_possible_for_modify(Query *parse)
>
>
>
>
>
> And I put the function at earlier place like the following:
>
>
> if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> IsUnderPostmaster &&
> (parse->commandType == CMD_SELECT ||
> -   (enable_parallel_dml &&
> -   

Possible dereference null return (src/backend/replication/logical/reorderbuffer.c)

2021-02-10 Thread Ranier Vilela
Hi,

Per Coverity.

If xid is a subtransaction, the setup of base snapshot on the top-level
transaction,
can be not optional, otherwise a Dereference null return value
(NULL_RETURNS) can be raised.

Patch suggestion to fix this.

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8bbc..3c6a81f716 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2993,8 +2993,8 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb,
TransactionId xid,
  */
  txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
  if (rbtxn_is_known_subxact(txn))
- txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
- NULL, InvalidXLogRecPtr, false);
+ txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, true,
+ NULL, InvalidXLogRecPtr, true);
  Assert(txn->base_snapshot == NULL);

  txn->base_snapshot = snap;

regards,
Ranier Vilela


reorderbuffer.patch
Description: Binary data


Possible dereference after null check (src/backend/executor/ExecUtils.c)

2021-02-10 Thread Ranier Vilela
Hi,

Per Coverity.

The functions ExecGetInsertedCols and ExecGetUpdatedCols at ExecUtils.c
only are safe to call if the variable "ri_RangeTableIndex" is  != 0.

Otherwise a possible Dereference after null check (FORWARD_NULL) can be
raised.

Exemple:

void
1718ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
1719TupleTableSlot *
slot,
1720EState *estate)
1721{
1722Oid root_relid;
1723TupleDesc   tupdesc;
1724char   *val_desc;
1725Bitmapset  *modifiedCols;
1726
1727/*
1728
 * If the tuple has been routed, it's been converted to the partition's
1729
 * rowtype, which might differ from the root table's.  We must
convert it
1730
 * back to the root table's rowtype so that val_desc in the
error message
1731 * matches the input tuple.
1732 */

1. Condition resultRelInfo->ri_RootResultRelInfo, taking false branch.

2. var_compare_op: Comparing resultRelInfo->ri_RootResultRelInfo to null
implies that resultRelInfo->ri_RootResultRelInfo might be null.
1733if (resultRelInfo->ri_RootResultRelInfo)
1734{
1735ResultRelInfo *rootrel = resultRelInfo->
ri_RootResultRelInfo;
1736TupleDesc   old_tupdesc;
1737AttrMap*map;
1738
1739root_relid = RelationGetRelid(rootrel->ri_RelationDesc);
1740tupdesc = RelationGetDescr(rootrel->ri_RelationDesc);
1741
1742old_tupdesc = RelationGetDescr(resultRelInfo->
ri_RelationDesc);
1743/* a reverse map */
1744map = build_attrmap_by_name_if_req(old_tupdesc, tupdesc
);
1745
1746/*
1747
 * Partition-specific slot's tupdesc can't be changed,
so allocate a
1748 * new one.
1749 */
1750if (map != NULL)
1751slot = execute_attr_map_slot(map, slot,
1752

MakeTupleTableSlot(tupdesc, &TTSOpsVirtual));
1753modifiedCols = bms_union(ExecGetInsertedCols(rootrel,
estate),
1754
ExecGetUpdatedCols(rootrel, estate));
1755}
1756else
1757{
1758root_relid = RelationGetRelid(resultRelInfo->
ri_RelationDesc);
1759tupdesc = RelationGetDescr(resultRelInfo->
ri_RelationDesc);

CID 1446241 (#1 of 1): Dereference after null check (FORWARD_NULL)3.
var_deref_model: Passing resultRelInfo to ExecGetInsertedCols, which
dereferences null resultRelInfo->ri_RootResultRelInfo. [show details

]
1760modifiedCols = bms_union(ExecGetInsertedCols(
resultRelInfo, estate),
1761
ExecGetUpdatedCols(resultRelInfo, estate));
1762}
1763
1764val_desc = ExecBuildSlotValueDescription(root_relid,
1765

slot,
1766

tupdesc,
1767

modifiedCols,
1768

64);
1769ereport(ERROR,
1770(errcode(ERRCODE_CHECK_VIOLATION),
1771 errmsg(
"new row for relation \"%s\" violates partition constraint",
1772

RelationGetRelationName(resultRelInfo->ri_RelationDesc)),
1773 val_desc ? errdetail("Failing row contains %s."
, val_desc) : 0,
1774 errtable(resultRelInfo->ri_RelationDesc)));
1775}

regards,
Ranier Viela


Re: partial heap only tuples

2021-02-10 Thread Bruce Momjian
On Tue, Feb  9, 2021 at 06:48:21PM +, Bossart, Nathan wrote:
> Hello,
> 
> I'm hoping to gather some early feedback on a heap optimization I've
> been working on.  In short, I'm hoping to add "partial heap only
> tuple" (PHOT) support, which would allow you to skip updating indexes
> for unchanged columns even when other indexes require updates.  Today,

I think it is great you are working on this.  I think it is a major way
to improve performance and I have been disappointed it has not moved
forward since 2016.

> HOT works wonders when no indexed columns are updated.  However, as
> soon as you touch one indexed column, you lose that optimization
> entirely, as you must update every index on the table.  The resulting
> performance impact is a pain point for many of our (AWS's) enterprise
> customers, so we'd like to lend a hand for some improvements in this
> area.  For workloads involving a lot of columns and a lot of indexes,
> an optimization like PHOT can make a huge difference.  I'm aware that
> there was a previous attempt a few years ago to add a similar
> optimization called WARM [0] [1].  However, I only noticed this
> previous effort after coming up with the design for PHOT, so I ended
> up taking a slightly different approach.  I am also aware of a couple
> of recent nbtree improvements that may mitigate some of the impact of
> non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
> complement to those.  I've attached a very early proof-of-concept
> patch with the design described below.

How is your approach different from those of [0] and [1]?  It is
interesting you still see performance benefits even after the btree
duplication improvements.  Did you test with those improvements?

> As far as performance is concerned, it is simple enough to show major
> benefits from PHOT by tacking on a large number of indexes and columns
> to a table.  For a short pgbench run where each table had 5 additional
> text columns and indexes on every column, I noticed a ~34% bump in
> TPS with PHOT [4].  Theoretically, the TPS bump should be even higher

That's a big improvement.

> Next, I'll go into the design a bit.  I've commandeered the two
> remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and
> HEAP_PHOT_TUPLE.  These are analogous to the HEAP_HOT_UPDATED and
> HEAP_ONLY_TUPLE bits.  (If there are concerns about exhausting the
> t_infomask2 bits, I think we could only use one of the remaining bits
> as a "modifier" bit on the HOT ones.  I opted against that for the
> proof-of-concept patch to keep things simple.)  When creating a PHOT
> tuple, we only create new index tuples for updated columns.  These new
> index tuples point to the PHOT tuple.  Following is a simple
> demonstration with a table with two integer columns, each with its own
> index:

Whatever solution you have, you have to be able to handle
adding/removing columns, and adding/removing indexes.

> When it is time to scan through a PHOT chain, there are a couple of
> things to account for.  Sequential scans work out-of-the-box thanks to
> the visibility rules, but other types of scans like index scans
> require additional checks.  If you encounter a PHOT chain when
> performing an index scan, you should only continue following the chain
> as long as none of the columns the index indexes are modified.  If the
> scan does encounter such a modification, we stop following the chain
> and continue with the index scan.  Even if there is a tuple in that

I think in patch [0] and [1], if an index column changes, all the
indexes had to be inserted into, while you seem to require inserts only
into the index that needs it.  Is that correct?

> PHOT chain that should be returned by our index scan, we will still
> find it, as there will be another matching index tuple that points us
> to later in the PHOT chain.  My initial idea for determining which
> columns were modified was to add a new bitmap after the "nulls" bitmap
> in the tuple header.  However, the attached patch simply uses
> HeapDetermineModifiedColumns().  I've yet to measure the overhead of
> this approach versus the bitmap approach, but I haven't noticed
> anything too detrimental in the testing I've done so far.

A bitmap is an interesting approach, but you are right it will need
benchmarking.

I wonder if you should create a Postgres wiki page to document all of
this.  I agree PG 15 makes sense.  I would like to help with this if I
can.  I will need to study this email more later.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: automatic analyze: readahead - add "IO read time" log message

2021-02-10 Thread Stephen Frost
Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 05/02/2021 23:22, Stephen Frost wrote:
> >Unless there's anything else on this, I'll commit these sometime next
> >week.
> 
> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> should call get_tablespace_maintenance_io_concurrency().

Ah, yeah, of course.

Updated patch attached.

Thanks!

Stephen
From 5aa55b44230474e6e61873260e9595a5495a1094 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Fri, 5 Feb 2021 15:59:02 -0500
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze

When logging auto-vacuum and auto-analyze activity, include the I/O
timing if track_io_timing is enabled.  Also, for auto-analyze, add the
read rate and the dirty rate, similar to how that information has
historically been logged for auto-vacuum.

Stephen Frost and Jakub Wartak

Reviewed-By: Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 ++-
 src/backend/access/heap/vacuumlazy.c |  18 +
 src/backend/commands/analyze.c   | 100 +--
 3 files changed, 116 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..2da2f2e32a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7411,9 +7411,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change this
+setting.

   
  
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..3a5e5a1ac2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -440,6 +440,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
@@ -454,6 +456,11 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	{
 		pg_rusage_init(&ru0);
 		starttime = GetCurrentTimestamp();
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
 	}
 
 	if (params->options & VACOPT_VERBOSE)
@@ -674,6 +681,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 (long long) VacuumPageDirty);
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
+			if (track_io_timing)
+			{
+appendStringInfoString(&buf, _("I/O Timings:"));
+if (pgStatBlockReadTime - startreadtime > 0)
+	appendStringInfo(&buf, _(" read=%.3f"),
+	 (double) (pgStatBlockReadTime - startreadtime) / 1000);
+if (pgStatBlockWriteTime - startwritetime > 0)
+	appendStringInfo(&buf, _(" write=%.3f"),
+	 (double) (pgStatBlockWriteTime - startwritetime) / 1000);
+appendStringInfoChar(&buf, '\n');
+			}
 			appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
 			appendStringInfo(&buf,
 			 _("WAL usage: %ld records, %ld full page images, %llu bytes"),
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..0864cda80e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -312,6 +312,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	int64		AnalyzePageHit = VacuumPageHit;
+	int64		AnalyzePageMiss = VacuumPageMiss;
+	int64		AnalyzePageDirty = VacuumPageDirty;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	if (inh)
 		ereport(elevel,
@@ -346,8 +351,14 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
+
 		pg_rusage_init(&ru0);
-		if (params->log_min_duration > 0)
+		if (params->log_min_duration >= 0)
 			starttime = GetCurrentTimestamp();
 	}
 
@@ -682,15 +693,90 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	/* Log the action if appropriate */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
+		TimestampTz endtime = GetCurrentTimestamp();
+
 		if (params->log_min_duration == 0 ||
-			TimestampDifferen

Re: [HACKERS] Custom compression methods

2021-02-10 Thread Robert Haas
On Wed, Feb 10, 2021 at 3:06 PM Robert Haas  wrote:
> I think you have a fairly big problem with row types. Consider this example:
>
> create table t1 (a int, b text compression pglz);
> create table t2 (a int, b text compression lz4);
> create table t3 (x t1);
> insert into t1 values (1, repeat('foo', 1000));
> insert into t2 values (1, repeat('foo', 1000));
> insert into t3 select t1 from t1;
> insert into t3 select row(a, b)::t1 from t2;
>
> rhaas=# select pg_column_compression((t3.x).b) from t3;
>  pg_column_compression
> ---
>  pglz
>  lz4
> (2 rows)
>
> That's not good, because now

...because now I hit send too soon. Also, because now column b has
implicit dependencies on both compression AMs and the rest of the
system has no idea. I think we probably should have a rule that
nothing except pglz is allowed inside of a record, just to keep it
simple. The record overall can be toasted so it's not clear why we
should also be toasting the original columns at all, but I think
precedent probably argues for continuing to allow PGLZ, as it can
already be like that on disk and pg_upgrade is a thing. The same kind
of issue probably exists for arrays and range types.

I poked around a bit trying to find ways of getting data compressed
with one compression method into a column that was marked with another
compression method, but wasn't able to break it.

In CompareCompressionMethodAndDecompress, I think this is still
playing a bit fast and loose with the rules around slots. I think we
can do better. Suppose that at the point where we discover that we
need to decompress at least one attribute, we create the new slot
right then, and also memcpy tts_values and tts_isnull. Then, for that
attribute and any future attributes that need decompression, we reset
tts_values in the *new* slot, leaving the old one untouched. Then,
after finishing all the attributes, the if (decompressed_any) block,
you just have a lot less stuff to do. The advantage of this is that
you haven't tainted the old slot; it's still got whatever contents it
had before, and is in a clean state, which seems better to me.

It's unclear to me whether this function actually needs to
ExecMaterializeSlot(newslot). It definitely does need to
ExecStoreVirtualTuple(newslot) and I think it's a very good idea, if
not absolutely mandatory, for it not to modify anything about the old
slot. But what's the argument that the new slot needs to be
materialized at this point? It may be needed, if the old slot would've
had to be materialized at this point. But it's something to think
about.

The CREATE TABLE documentation says that COMPRESSION is a kind of
column constraint, but that's wrong. For example, you can't write
CREATE TABLE a (b int4 CONSTRAINT thunk COMPRESSION lz4), for example,
contrary to what the syntax summary implies. When you fix this so that
the documentation matches the grammar change, you may also need to
move the longer description further up in create_table.sgml so the
order matches.

The use of VARHDRSZ_COMPRESS in toast_get_compression_oid() appears to
be incorrect. VARHDRSZ_COMPRESS is offsetof(varattrib_4b,
va_compressed.va_data). But what gets externalized in the case of a
compressed datum is just VARDATA(dval), which excludes the length
word, unlike VARHDRSZ_COMPRESS, which does not. This has no
consequences since we're only going to fetch 1 chunk either way, but I
think we should make it correct.

TOAST_COMPRESS_SET_SIZE_AND_METHOD() could Assert something about cm_method.

Small delta patch with a few other suggested changes attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


fixups.patch
Description: Binary data


Re: WIP: WAL prefetch (another approach)

2021-02-10 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@gmail.com) wrote:
> Rebase attached.

> Subject: [PATCH v15 4/6] Prefetch referenced blocks during recovery.
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index 4b60382778..ac27392053 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -3366,6 +3366,64 @@ include_dir 'conf.d'
[...]
> +  xreflabel="recovery_prefetch_fpw">
> +  recovery_prefetch_fpw (boolean)
> +  
> +   recovery_prefetch_fpw configuration 
> parameter
> +  
> +  
> +  
> +   
> +Whether to prefetch blocks that were logged with full page images,
> +during recovery.  Often this doesn't help, since such blocks will not
> +be read the first time they are needed and might remain in the buffer

The "might" above seems slightly confusing- such blocks will remain in
shared buffers until/unless they're forced out, right?

> +pool after that.  However, on file systems with a block size larger
> +than
> +PostgreSQL's, prefetching can avoid a
> +costly read-before-write when a blocks are later written.
> +The default is off.

"when a blocks" above doesn't sound quite right, maybe reword this as:

"prefetching can avoid a costly read-before-write when WAL replay
reaches the block that needs to be written."

> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
> index d1c3893b14..c51c431398 100644
> --- a/doc/src/sgml/wal.sgml
> +++ b/doc/src/sgml/wal.sgml
> @@ -720,6 +720,23 @@
> WAL call being logged to the server log. This
> option might be replaced by a more general mechanism in the future.
>
> +
> +  
> +   The  parameter can
> +   be used to improve I/O performance during recovery by instructing
> +   PostgreSQL to initiate reads
> +   of disk blocks that will soon be needed but are not currently in
> +   PostgreSQL's buffer pool.
> +   The  and
> +settings limit prefetching
> +   concurrency and distance, respectively.  The
> +   prefetching mechanism is most likely to be effective on systems
> +   with full_page_writes set to
> +   off (where that is safe), and where the working
> +   set is larger than RAM.  By default, prefetching in recovery is enabled
> +   on operating systems that have posix_fadvise
> +   support.
> +  
>   



> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c

> @@ -3697,7 +3699,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
>   snprintf(activitymsg, sizeof(activitymsg), "waiting for 
> %s",
>xlogfname);
>   set_ps_display(activitymsg);
> -
>   restoredFromArchive = RestoreArchivedFile(path, 
> xlogfname,
>   
>   "RECOVERYXLOG",
>   
>   wal_segment_size,

> @@ -12566,6 +12585,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
> randAccess,
>   else
>   havedata = false;
>   }
> +
>   if (havedata)
>   {
>   /*

Random whitespace change hunks..?

> diff --git a/src/backend/access/transam/xlogprefetch.c 
> b/src/backend/access/transam/xlogprefetch.c

> +  * The size of the queue is based on the maintenance_io_concurrency
> +  * setting.  In theory we might have a separate queue for each 
> tablespace,
> +  * but it's not clear how that should work, so for now we'll just use 
> the
> +  * general GUC to rate-limit all prefetching.  The queue has space for 
> up
> +  * the highest possible value of the GUC + 1, because our circular 
> buffer
> +  * has a gap between head and tail when full.

Seems like "to" is missing- "The queue has space for up *to* the highest
possible value of the GUC + 1" ?  Maybe also "between the head and the
tail when full".

> +/*
> + * Scan the current record for block references, and consider prefetching.
> + *
> + * Return true if we processed the current record to completion and still 
> have
> + * queue space to process a new record, and false if we saturated the I/O
> + * queue and need to wait for recovery to advance before we continue.
> + */
> +static bool
> +XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
> +{
> + DecodedXLogRecord *record = prefetcher->record;
> +
> + Assert(!XLogPrefetcherSaturated(prefetcher));
> +
> + /*
> +  * We might already have been partway through processing this record 
> when
> +  * our queue became saturated, so we need to start where we left off.
> +  */
> + for (int block_id = prefetcher->next_block_

Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Jan Wieck
On Wed, Feb 10, 2021 at 11:43 AM Robert Haas  wrote:

> On Mon, Jan 25, 2021 at 10:07 AM Jan Wieck  wrote:
> > Our current plan is to create a new set of API calls and hooks that
> allow to register additional wire protocols. The existing backend libpq
> implementation will be modified to register itself using the new API. This
> will serve as a proof of concept as well as ensure that the API definition
> is not slanted towards a specific protocol. It is also similar to the way
> table access methods and compression methods are added.
>
> If we're going to end up with an open source implementation of
> something useful in contrib or whatever, then I think this is fine.
> But, if not, then we're just making it easier for Amazon to do
> proprietary stuff without getting any benefit for the open-source
> project. In fact, in that case PostgreSQL would ensure have to somehow
> ensure that the hooks don't get broken without having any code that
> actually uses them, so not only would the project get no benefit, but
> it would actually incur a small tax. I wouldn't say that's an
> absolutely show-stopper, but it definitely isn't my first choice.
>

At this very moment there are several parts to this. One is the hooks to
make wire protocols into loadable modules, which is what this effort is
about. Another is the TDS protocol as it is being implemented for Babelfish
and third is the Babelfish extension itself. Both will require additional
hooks and APIs I am not going to address here. I consider them not material
to my effort.

As for making the wire protocol itself expandable I really see a lot of
potential outside of what Amazon wants here. And I would not be advertising
it if it would be for Babelfish alone. As I laid out, just the ability for
a third party to add additional messages for special connection pool
support would be enough to make it useful. There also have been discussions
in the JDBC subproject to combine certain messages into one single message.
Why not allow the JDBC project to develop their own, JDBC-optimized backend
side? Last but not least, what would be wrong with listening for MariaDB
clients?

I am planning on a follow up project to this, demoting libpq itself to just
another loadable protocol. Just the way procedural languages are all on the
same level because that is how I developed the loadable, procedural
language handler all those years ago.

Considering how spread out and quite frankly unorganized our wire protocol
handling is, this is not a small order.


Regards, Jan








>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


-- 
Jan Wieck


Re: [HACKERS] Custom compression methods

2021-02-10 Thread Robert Haas
On Wed, Feb 10, 2021 at 9:52 AM Dilip Kumar  wrote:
> [ new patches ]

I think that in both varattrib_4b and toast_internals.h it would be
better to pick a less generic field name. In toast_internals.h it's
just info; in postgres.h it's va_info. But:

[rhaas pgsql]$ git grep info | wc -l
   24552

There are no references in the current source tree to va_info, so at
least that one is greppable, but it's still not very descriptive. I
suggest info -> tcinfo and va_info -> va_tcinfo, where "tc" stands for
"TOAST compression". Looking through 24552 references to info to find
the ones that pertain to this feature might take longer than searching
the somewhat shorter list of references to tcinfo, which prepatch is
just:

[rhaas pgsql]$ git grep tcinfo | wc -l
   0

I don't see why we should allow for datum_decompress to be optional,
as toast_decompress_datum_slice does. Likely every serious compression
method will support that anyway. If not, the compression AM can deal
with the problem, rather than having the core code do it. That will
save some tiny amount of performance, too.

src/backend/access/compression/Makefile is missing a copyright header.

It's really sad that lz4_cmdecompress_slice allocates
VARRAWSIZE_4B_C(value) + VARHDRSZ rather than slicelength + VARHDRSZ
as pglz_cmdecompress_slice() does. Is that a mistake, or is that
necessary for some reason? If it's a mistake, let's fix it. If it's
necessary, let's add a comment about why, probably starting with
"Unfortunately, ".

I think you have a fairly big problem with row types. Consider this example:

create table t1 (a int, b text compression pglz);
create table t2 (a int, b text compression lz4);
create table t3 (x t1);
insert into t1 values (1, repeat('foo', 1000));
insert into t2 values (1, repeat('foo', 1000));
insert into t3 select t1 from t1;
insert into t3 select row(a, b)::t1 from t2;

rhaas=# select pg_column_compression((t3.x).b) from t3;
 pg_column_compression
---
 pglz
 lz4
(2 rows)

That's not good, because now

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: CLUSTER on partitioned index

2021-02-10 Thread Justin Pryzby
On Sat, Feb 06, 2021 at 08:45:49AM -0600, Justin Pryzby wrote:
> On Mon, Jan 18, 2021 at 12:34:59PM -0600, Justin Pryzby wrote:
> > On Sat, Nov 28, 2020 at 08:03:02PM -0600, Justin Pryzby wrote:
> > > On Sun, Nov 15, 2020 at 07:53:35PM -0600, Justin Pryzby wrote:
> > > > On Wed, Nov 04, 2020 at 08:23:56PM -0600, Justin Pryzby wrote:
> > > > > On Tue, Oct 27, 2020 at 07:33:12PM -0500, Justin Pryzby wrote:
> > > > > > I'm attaching a counter-proposal to your catalog change, which 
> > > > > > preserves
> > > > > > indisclustered on children of clustered, partitioned indexes, and 
> > > > > > invalidates
> > > > > > indisclustered when attaching unclustered indexes.
> > > > > 
> > > > > ..and now propagates CLUSTER ON to child indexes.
> > > > > 
> > > > > I left this as separate patches to show what I mean and what's new 
> > > > > while we
> > > > > discuss it.
> > > > 
> > > > This fixes some omissions in the previous patch and error in its test 
> > > > cases.
> > > > 
> > > > CLUSTER ON recurses to children, since I think a clustered parent index 
> > > > means
> > > > that all its child indexes are clustered.  "SET WITHOUT CLUSTER" 
> > > > doesn't have
> > > > to recurse to children, but I did it like that for consistency and it 
> > > > avoids
> > > > the need to special case InvalidOid.
> > > 
> > > The previous patch failed pg_upgrade when restoring a clustered, parent 
> > > index,
> > > since it's marked INVALID until indexes have been built on all child 
> > > tables, so
> > > CLUSTER ON was rejected on invalid index.
> > > 
> > > So I think CLUSTER ON needs to be a separate pg_dump object, to allow 
> > > attaching
> > > the child index (thereby making the parent "valid") to happen before SET
> > > CLUSTER on the parent index.
> > 
> > Rebased on b5913f612 and now a3dc92600.
> 
> This resolves ORDER BY test failure with COLLATE "C".

It occured to me that progress reporting should expose this.

I did this in the style of pg_stat_progress_create_index, adding columns
partitions_total and partitions_done showing the overall progress. The progress
of individual partitions is also visible in {blocks,tuples}_{done,total}.
This seems odd, but that's how the index view behaves.

-- 
Justin
>From 1614cfa411dabd74faa64dc805f52405e2572239 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 26 Nov 2020 14:37:08 -0600
Subject: [PATCH v8 1/8] pg_dump: make CLUSTER ON a separate dump object..

..since it needs to be restored after any child indexes are restored *and
attached*.  The order needs to be:

1) restore child and parent index (order doesn't matter);
2) attach child index;
3) set cluster on child and parent index (order doesn't matter);
---
 src/bin/pg_dump/pg_dump.c  | 86 ++
 src/bin/pg_dump/pg_dump.h  |  8 
 src/bin/pg_dump/pg_dump_sort.c |  8 
 3 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7eb4..e93d2eb828 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -208,6 +208,7 @@ static void dumpSequence(Archive *fout, const TableInfo *tbinfo);
 static void dumpSequenceData(Archive *fout, const TableDataInfo *tdinfo);
 static void dumpIndex(Archive *fout, const IndxInfo *indxinfo);
 static void dumpIndexAttach(Archive *fout, const IndexAttachInfo *attachinfo);
+static void dumpIndexClusterOn(Archive *fout, const IndexClusterInfo *clusterinfo);
 static void dumpStatisticsExt(Archive *fout, const StatsExtInfo *statsextinfo);
 static void dumpConstraint(Archive *fout, const ConstraintInfo *coninfo);
 static void dumpTableConstraintComment(Archive *fout, const ConstraintInfo *coninfo);
@@ -7092,6 +7093,11 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 i_inddependcollversions;
 	int			ntups;
 
+	int	ncluster = 0;
+	IndexClusterInfo *clusterinfo;
+	clusterinfo = (IndexClusterInfo *)
+		pg_malloc0(numTables * sizeof(IndexClusterInfo));
+
 	for (i = 0; i < numTables; i++)
 	{
 		TableInfo  *tbinfo = &tblinfo[i];
@@ -7471,6 +7477,27 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 /* Plain secondary index */
 indxinfo[j].indexconstraint = 0;
 			}
+
+			/* Record each table's CLUSTERed index, if any */
+			if (indxinfo[j].indisclustered)
+			{
+IndxInfo   *index = &indxinfo[j];
+IndexClusterInfo *cluster = &clusterinfo[ncluster];
+
+cluster->dobj.objType = DO_INDEX_CLUSTER_ON;
+cluster->dobj.catId.tableoid = 0;
+cluster->dobj.catId.oid = 0;
+AssignDumpId(&cluster->dobj);
+cluster->dobj.name = pg_strdup(index->dobj.name);
+cluster->dobj.namespace = index->indextable->dobj.namespace;
+cluster->index = index;
+cluster->indextable = &tblinfo[i];
+
+/* The CLUSTER ON depends on its index.. */
+addObjectDependency(&cluster->dobj, index->dobj.dumpId);
+
+ncluster++;
+			}
 		}
 
 		PQclear(res);
@@ -10323,6 +10350,9 @@ dumpDumpableObject(Archive *fout, const Dumpab

Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Jonah H. Harris
On Wed, Feb 10, 2021 at 2:04 PM Tom Lane  wrote:

> That is a spot-on definition of where I do NOT want to end up.  Hooks
> everywhere and enormous extensions that break anytime we change anything
> in the core.  It's not really clear that anybody is going to find that
> more maintainable than a straight fork, except to the extent that it
> enables the erstwhile forkers to shove some of their work onto the PG
> community.
>

Given the work over the last few major releases to make several other
aspects of Postgres pluggable, how is implementing a pluggable protocol API
any different?

To me, this sounds more like a philosophical disagreement with how people
could potentially use Postgres than a technical one. My point is only that,
using current PG functionality, I could equally write a pluggable storage
interface for my Oracle and InnoDB data file readers/writers, which would
similarly allow for the creation of a Postgres franken-Oracle by extension
only.

I don't think anyone is asking for hooks for all the things I mentioned - a
pluggable transaction manager, for example, doesn't make much sense. But,
when it comes to having actually done this vs. posited about its
usefulness, I'd say it has some merit and doesn't really introduce that
much complexity or maintenance overhead to core - whether the extensions
still work properly is up to the extension authors... isn't that the whole
point of extensions?

-- 
Jonah H. Harris


Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Tom Lane
"Jonah H. Harris"  writes:
> On Wed, Feb 10, 2021 at 1:10 PM Tom Lane  wrote:
>> ...  If we start having
>> modes for MySQL identifier quoting, Oracle outer join syntax, yadda
>> yadda, it's going to be way more of a maintenance nightmare than some
>> hook functions.  So if we accept any patch along this line, I want to
>> drive a hard stake in the ground that the answer to that sort of thing
>> will be NO.

> Actually, a substantial amount can be done with hooks. For Oracle, which is
> substantially harder than MySQL, I have a completely separate parser that
> generates a PG-compatible parse tree packaged up as an extension. To handle
> autonomous transactions, database links, hierarchical query conversion,
> hints, and some execution-related items requires core changes.

That is a spot-on definition of where I do NOT want to end up.  Hooks
everywhere and enormous extensions that break anytime we change anything
in the core.  It's not really clear that anybody is going to find that
more maintainable than a straight fork, except to the extent that it
enables the erstwhile forkers to shove some of their work onto the PG
community.

My feeling about this is if you want to use Oracle, go use Oracle.
Don't ask PG to take on a ton of maintenance issues so you can have
a frankenOracle.

regards, tom lane




Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Jonah H. Harris
On Wed, Feb 10, 2021 at 1:10 PM Tom Lane  wrote:

> What I'm actually more concerned about, in this whole line of development,
> is the follow-on requests that will surely occur to kluge up Postgres
> to make its behavior more like $whatever.  As in "well, now that we
> can serve MySQL clients protocol-wise, can't we pretty please have a
> mode that makes the parser act more like MySQL".  If we start having
> modes for MySQL identifier quoting, Oracle outer join syntax, yadda
> yadda, it's going to be way more of a maintenance nightmare than some
> hook functions.  So if we accept any patch along this line, I want to
> drive a hard stake in the ground that the answer to that sort of thing
> will be NO.
>

Actually, a substantial amount can be done with hooks. For Oracle, which is
substantially harder than MySQL, I have a completely separate parser that
generates a PG-compatible parse tree packaged up as an extension. To handle
autonomous transactions, database links, hierarchical query conversion,
hints, and some execution-related items requires core changes. But, the
protocol and parsing can definitely be done with hooks. And, as was
mentioned previously, this isn't tied directly to emulating another
database - it would enable us to support an HTTP-ish interface directly in
the server as an extension as well. A lot of this can be done with
background worker extensions now, which is how my stuff was primarily
architected, but it's hacky when it comes to areas where the items Jan
discussed could clean things up and make them more pluggable.

Assuming we're going to keep to that, though, it seems like people
> doing this sort of thing will inevitably end up with a fork anyway.
> So maybe we should just not bother with the first step either.
>

Perhaps I'm misunderstanding you, but I wouldn't throw this entire idea out
(which enables a substantial addition of extensible functionality with a
limited set of touchpoints) on the premise of future objections.

-- 
Jonah H. Harris


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-02-10 Thread Joel Jacobson
Hi Mark,

On Mon, Feb 8, 2021, at 09:40, Mark Rofail wrote:
>Attachments:
>fk_arrays_elem_v1.patch
>anyarray_anyelement_operators-v1.patch

Nice work!

I have successfully tested both patches against 
e7f42914854926c2afbb89b9cd0e381fd90766be
by cloning all pg_catalog tables, and adding foreign keys
on all columns, including array columns of course.

Here is what e.g. pg_constraint which has quite a few array oid columns looks 
like with foreign keys:

joel=# \d catalog_clone.pg_constraint
 Table "catalog_clone.pg_constraint"
Column |Type| Collation | Nullable | Default
---++---+--+-
oid   | jsonb  |   | not null |
conname   | name   |   |  |
.
.
.
Foreign-key constraints:
"pg_constraint_conexclop_fkey" FOREIGN KEY (EACH ELEMENT OF conexclop) 
REFERENCES catalog_clone.pg_operator(oid)
"pg_constraint_conffeqop_fkey" FOREIGN KEY (EACH ELEMENT OF conffeqop) 
REFERENCES catalog_clone.pg_operator(oid)
"pg_constraint_conpfeqop_fkey" FOREIGN KEY (EACH ELEMENT OF conpfeqop) 
REFERENCES catalog_clone.pg_operator(oid)
"pg_constraint_conppeqop_fkey" FOREIGN KEY (EACH ELEMENT OF conppeqop) 
REFERENCES catalog_clone.pg_operator(oid)
"pg_constraint_conrelid_conkey_fkey" FOREIGN KEY (conrelid, EACH ELEMENT OF 
conkey) REFERENCES catalog_clone.pg_attribute(attrelid, attnum)

Here is my test function that adds foreign keys on catalog tables:

https://github.com/truthly/pg-pit/blob/master/FUNCTIONS/test_referential_integrity.sql

If you want to try it yourself, it is run as part of pit's test suite:

$ git clone https://github.com/truthly/pg-pit.git
$ cd pg-pit
$ make
$ make install
$ make installcheck

== running regression test queries ==
test referential_integrity... ok 1925 ms

/Joel

Re: Online checksums patch - once again

2021-02-10 Thread Heikki Linnakangas

On 10/02/2021 16:25, Magnus Hagander wrote:

On Tue, Feb 9, 2021 at 9:54 AM Heikki Linnakangas  wrote:


(I may have said this before, but) My overall high-level impression of
this patch is that it's really cmmplex for a feature that you use maybe
once in the lifetime of a cluster. I'm happy to review but I'm not
planning to commit this myself. I don't object if some other committer
picks this up (Magnus?).


A fairly large amount of this complexity comes out of the fact that it
now supports restarting and tracks checksums on a per-table basis. We
skipped this in the original patch for exactly this reason (that's not
to say there isn't a fair amount of complexity even without it, but it
did substantially i increase both the size and the complexity of the
patch), but in the review of that i was specifically asked for having
that added. I personally don't think it's worth that complexity but at
the time that seemed to be a pretty strong argument. So I'm not
entirely sure how to move forward with that...

is your impression that it would still be too complicated, even without that?


I'm not sure. It would certainly be a lot better.

Wrt. restartability, I'm also not very happy with the way that works - 
or rather doesn't :-) - in this patch. After shutting down and 
restarting the cluster, you have to manually call 
pg_enable_data_checksums() again to restart the checksumming process.


- Heikki




Re: Online checksums patch - once again

2021-02-10 Thread Bruce Momjian
On Wed, Feb 10, 2021 at 03:25:58PM +0100, Magnus Hagander wrote:
> On Tue, Feb 9, 2021 at 9:54 AM Heikki Linnakangas  wrote:
> >
> > (I may have said this before, but) My overall high-level impression of
> > this patch is that it's really cmmplex for a feature that you use maybe
> > once in the lifetime of a cluster. I'm happy to review but I'm not
> > planning to commit this myself. I don't object if some other committer
> > picks this up (Magnus?).
> 
> A fairly large amount of this complexity comes out of the fact that it
> now supports restarting and tracks checksums on a per-table basis. We
> skipped this in the original patch for exactly this reason (that's not
> to say there isn't a fair amount of complexity even without it, but it
> did substantially i increase both the size and the complexity of the
> patch), but in the review of that i was specifically asked for having
> that added. I personally don't think it's worth that complexity but at
> the time that seemed to be a pretty strong argument. So I'm not
> entirely sure how to move forward with that...
> 
> is your impression that it would still be too complicated, even without that?

I was wondering why this feature has stalled for so long --- now I know.
This does highlight the risk of implementing too many additions to a
feature. I am working against this dynamic in the cluster file
encryption feature I am working on.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Tom Lane
Robert Haas  writes:
> If we're going to end up with an open source implementation of
> something useful in contrib or whatever, then I think this is fine.
> But, if not, then we're just making it easier for Amazon to do
> proprietary stuff without getting any benefit for the open-source
> project. In fact, in that case PostgreSQL would ensure have to somehow
> ensure that the hooks don't get broken without having any code that
> actually uses them, so not only would the project get no benefit, but
> it would actually incur a small tax. I wouldn't say that's an
> absolutely show-stopper, but it definitely isn't my first choice.

As others noted, a test module could be built to add some coverage here.

What I'm actually more concerned about, in this whole line of development,
is the follow-on requests that will surely occur to kluge up Postgres
to make its behavior more like $whatever.  As in "well, now that we
can serve MySQL clients protocol-wise, can't we pretty please have a
mode that makes the parser act more like MySQL".  If we start having
modes for MySQL identifier quoting, Oracle outer join syntax, yadda
yadda, it's going to be way more of a maintenance nightmare than some
hook functions.  So if we accept any patch along this line, I want to
drive a hard stake in the ground that the answer to that sort of thing
will be NO.

Assuming we're going to keep to that, though, it seems like people
doing this sort of thing will inevitably end up with a fork anyway.
So maybe we should just not bother with the first step either.

regards, tom lane




Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Jonah H. Harris
On Wed, Feb 10, 2021 at 11:43 AM Robert Haas  wrote:

> On Mon, Jan 25, 2021 at 10:07 AM Jan Wieck  wrote:
> > Our current plan is to create a new set of API calls and hooks that
> allow to register additional wire protocols. The existing backend libpq
> implementation will be modified to register itself using the new API. This
> will serve as a proof of concept as well as ensure that the API definition
> is not slanted towards a specific protocol. It is also similar to the way
> table access methods and compression methods are added.
>
> If we're going to end up with an open source implementation of
> something useful in contrib or whatever, then I think this is fine.
> But, if not, then we're just making it easier for Amazon to do
> proprietary stuff without getting any benefit for the open-source
> project. In fact, in that case PostgreSQL would ensure have to somehow
> ensure that the hooks don't get broken without having any code that
> actually uses them, so not only would the project get no benefit, but
> it would actually incur a small tax. I wouldn't say that's an
> absolutely show-stopper, but it definitely isn't my first choice.
>

Agreed on adding substantial hooks if they're not likely to be used. While
I haven't yet seen AWS' implementation or concrete proposal, given the
people involved, I assume it's fairly similar to how I implemented it.
Assuming that's correct and it doesn't require substantial redevelopment,
I'd certainly open-source my MySQL-compatible protocol and parser
implementation. From my perspective, it would be awesome if these could be
done as extensions.

While I'm not planning to open source it as of yet, for my
Oracle-compatible stuff, I don't think I'd be able to do anything other
than the protocol as an extension given the core-related changes similar to
what EDB has to do. I don't think there's any easy way to get around that.
But, for the protocol and any type of simple translation to Postgres'
dialect, I think that could easily be hook-based.

-- 
Jonah H. Harris


Re: TRUNCATE on foreign table

2021-02-10 Thread Kazutaka Onishi
That's because using the foreign server is difficult for the user.

For example, the user doesn't always have the permission to login to
the forein server.
In some cases, the foreign table has been created by the administrator that
has permission to access the two servers and the user only uses the local
server.
Then the user has to ask the administrator to run TRUNCATE every time.

Furthermore,there are some fdw extensions which don't support SQL.
mongo_fdw, redis_fdw, etc...
These extensions have been used to provide SQL interfaces to the users.
It's hard for the user to run TRUNCATE after learning each database.

Anyway, it's more useful if the user can run queries in one place, right?
Do you have any concerns?


Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Fabrízio de Royes Mello
On Wed, Feb 10, 2021 at 1:43 PM Robert Haas  wrote:
>
> On Mon, Jan 25, 2021 at 10:07 AM Jan Wieck  wrote:
> > Our current plan is to create a new set of API calls and hooks that
allow to register additional wire protocols. The existing backend libpq
implementation will be modified to register itself using the new API. This
will serve as a proof of concept as well as ensure that the API definition
is not slanted towards a specific protocol. It is also similar to the way
table access methods and compression methods are added.
>
> If we're going to end up with an open source implementation of
> something useful in contrib or whatever, then I think this is fine.
> But, if not, then we're just making it easier for Amazon to do
> proprietary stuff without getting any benefit for the open-source
> project. In fact, in that case PostgreSQL would ensure have to somehow
> ensure that the hooks don't get broken without having any code that
> actually uses them, so not only would the project get no benefit, but
> it would actually incur a small tax. I wouldn't say that's an
> absolutely show-stopper, but it definitely isn't my first choice.

As far I understood Jan's proposal is to add enough hooks on PostgreSQL to
enable us to extend the wire protocol and add a contrib module as an
example (maybe TDS, HTTP or just adding new capabilities to current
implementation).

Regards,

-- 
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com


Re: Extensibility of the PostgreSQL wire protocol

2021-02-10 Thread Robert Haas
On Mon, Jan 25, 2021 at 10:07 AM Jan Wieck  wrote:
> Our current plan is to create a new set of API calls and hooks that allow to 
> register additional wire protocols. The existing backend libpq implementation 
> will be modified to register itself using the new API. This will serve as a 
> proof of concept as well as ensure that the API definition is not slanted 
> towards a specific protocol. It is also similar to the way table access 
> methods and compression methods are added.

If we're going to end up with an open source implementation of
something useful in contrib or whatever, then I think this is fine.
But, if not, then we're just making it easier for Amazon to do
proprietary stuff without getting any benefit for the open-source
project. In fact, in that case PostgreSQL would ensure have to somehow
ensure that the hooks don't get broken without having any code that
actually uses them, so not only would the project get no benefit, but
it would actually incur a small tax. I wouldn't say that's an
absolutely show-stopper, but it definitely isn't my first choice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: 2021-02-11 release announcement draft

2021-02-10 Thread Magnus Hagander
On Wed, Feb 10, 2021 at 4:36 PM Michael Banck  wrote:
>
> Hi,
>
> On Wed, Feb 10, 2021 at 10:15:12AM -0500, Jonathan S. Katz wrote:
> > Please see updated draft.
>
> What about the CVEs, it's my understanding that two security issues have
> been fixed; shouldn't they be mentioned as well? Or are those scheduled
> to be merged into the announcement at the last minute?

Any potential security announcements are handled independently "out of band".

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: 2021-02-11 release announcement draft

2021-02-10 Thread Jonathan S. Katz
On 2/10/21 10:19 AM, Chapman Flack wrote:
> On 02/10/21 10:15, Jonathan S. Katz wrote:
>> On 2/8/21 6:11 PM, Noah Misch wrote:
>>> On Mon, Feb 08, 2021 at 05:40:41PM -0500, Jonathan S. Katz wrote:
 Some of these issues only affect version 13, but may also apply to other
 supported versions.
>>>
>>> Did you want s/may/many/?
>>
>> Nope. The bugs may also apply to other versions. Maybe to be clearer
>> I'll /may/could/?
> 
> If that's what was intended, shouldn't it be "but others may also apply
> to other supported versions"? ^^
> 
> Surely the ones that "only affect version 13" do not affect other versions,
> not even on a 'may' or 'could' basis.

The main goals of the release announcement are to:

* Let people know there are update releases for supported versions that
fix bugs.
* Provide a glimpse at what is fixed so the reader can determine their
level of urgency around updating.
* Direct people on where to download and find out more specifics about
the releases.

I appreciate the suggestions on this sentence, but I don't think the
desired goals hinges on this one word.

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: 2021-02-11 release announcement draft

2021-02-10 Thread Michael Banck
Hi,

On Wed, Feb 10, 2021 at 10:15:12AM -0500, Jonathan S. Katz wrote:
> Please see updated draft.

What about the CVEs, it's my understanding that two security issues have
been fixed; shouldn't they be mentioned as well? Or are those scheduled
to be merged into the announcement at the last minute?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: 2021-02-11 release announcement draft

2021-02-10 Thread Chapman Flack
On 02/10/21 10:15, Jonathan S. Katz wrote:
> On 2/8/21 6:11 PM, Noah Misch wrote:
>> On Mon, Feb 08, 2021 at 05:40:41PM -0500, Jonathan S. Katz wrote:
>>> Some of these issues only affect version 13, but may also apply to other
>>> supported versions.
>>
>> Did you want s/may/many/?
> 
> Nope. The bugs may also apply to other versions. Maybe to be clearer
> I'll /may/could/?

If that's what was intended, shouldn't it be "but others may also apply
to other supported versions"? ^^

Surely the ones that "only affect version 13" do not affect other versions,
not even on a 'may' or 'could' basis.

Regards,
-Chap




Re: 2021-02-11 release announcement draft

2021-02-10 Thread Jonathan S. Katz
On 2/8/21 6:11 PM, Noah Misch wrote:
> On Mon, Feb 08, 2021 at 05:40:41PM -0500, Jonathan S. Katz wrote:
>> This update also fixes over 80 bugs that were reported in the last several
>> months. Some of these issues only affect version 13, but may also apply to 
>> other
>> supported versions.
> 
> Did you want s/may/many/?

Nope. The bugs may also apply to other versions. Maybe to be clearer
I'll /may/could/?

I made that change.

> 
>> * Fix an issue with GiST indexes where concurrent insertions could lead to a
>> corrupt index with entries placed in the wrong pages. You should `REINDEX` 
>> any
>> affected GiST indexes.
> 
> For what it's worth, there's little way for a user to confirm whether an index
> is affected.  (If you've never had more than one connection changing the table
> at a time, the table's indexes would be unaffected.)

So Peter Geoghegan and I had a roughly 30 minute back and forth just on
this point. Based on our discussion, we felt it best to go with this
statement.

I think this one is tough to give guidance around, but I don't think a
blanket "anyone who has had concurrent writes to a GiST index should
reindex" is the right answer.

>> * Fix `CREATE INDEX CONCURRENTLY` to ensure rows from concurrent prepared
>> transactions are included in the index.
> 
> Consider adding a sentence like "Installations that have enabled prepared
> transactions should `REINDEX` any concurrently-built indexes."  The release
> notes say:
> 
> +  In installations that have enabled prepared transactions
> +  (max_prepared_transactions > 0),
> +  it's recommended to reindex any concurrently-built indexes in
> +  case this problem occurred when they were built.

Oops, I must have missed that in my first build of the release notes (or
I just plain missed it). I agree with that.

>> * Fix a failure when a PL/pgSQL procedure used `CALL` on another procedure 
>> that
>> has `OUT` parameters did not call execute a `COMMIT` or `ROLLBACK`.
> 
> The release notes say the failure happened when the callee _did_ execute a
> COMMIT or ROLLBACK:
> 
> + 
> +  A CALL in a PL/pgSQL procedure, to another
> +  procedure that has OUT parameters, would fail if the called
> +  procedure did a COMMIT
> +  or ROLLBACK.
> + 

Oops, good catch. Fixed.

>> For more details, please see the
>> [release notes](https://www.postgresql.org/docs/current/release.html).
> 
> I recommend pointing this to https://www.postgresql.org/docs/release/, since
> the above link now contains only v13 notes.

WFM.

Please see updated draft.

Thanks,

Jonathan

The PostgreSQL Global Development Group has released an update to all supported
versions of our database system, including 13.2, 12.6, 11.11, 10.16, 9.6.21, and
9.5.25. This release fixes over 80 bugs reported over the last three
months.

Additionally, this is the **final release** of PostgreSQL 9.5. If you are
running PostgreSQL 9.5 in a production environment, we suggest that you make
plans to upgrade.

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

Bug Fixes and Improvements
--

This update fixes over 80 bugs that were reported in the last several
months. Some of these issues only affect version 13, but could also apply to
other supported versions.

Some of these fixes include:

* Fix an issue with GiST indexes where concurrent insertions could lead to a
corrupt index with entries placed in the wrong pages. You should `REINDEX` any
affected GiST indexes.
* Fix `CREATE INDEX CONCURRENTLY` to ensure rows from concurrent prepared
transactions are included in the index. Installations that have enabled prepared
transactions should `REINDEX` any concurrently-built indexes.
* Fix for possible incorrect query results when a hash aggregation is spilled to
disk.
* Fix edge case in incremental sort that could lead to sorting results
incorrectly or a "retrieved too many tuples in a bounded sort" error.
* Avoid crash when a `CALL` or `DO` statement that performs a transaction
rollback is executed via extended query protocol, such as from prepared
statements.
* Fix a failure when a PL/pgSQL procedure used `CALL` on another procedure that
has `OUT` parameters that executed a `COMMIT` or `ROLLBACK`.
* Remove errors from `BEFORE UPDATE` triggers on partitioned tables for
restrictions that no longer apply.
* Several fixes for queries with joins that could lead to error messages such as
"no relation entry for relid N" or "failed to build any N-way joins".
* Do not consider parallel-restricted or set-returning functions in an
`ORDER BY` expressions when trying to parallelize sorts.
* Fix `ALTER DEFAULT PRIVILEGES` to handle duplicate arguments safely.
* Several fixes in behavior when `wal_level` is set to `minimal`, including when
tables are rewritten within a transaction.
* Several fixes for `CREATE TABLE LIKE`.
* Ensure that allocated disk space for a dropped relation (e.g. a table) is
rele

Re: 2021-02-11 release announcement draft

2021-02-10 Thread Jonathan S. Katz
On 2/8/21 11:30 PM, e...@xs4all.nl wrote:
>> On 02/08/2021 11:40 PM Jonathan S. Katz  wrote:
>>
>>  
>> Hi,
>>
>> Attached is a draft of the release announcement for the upcoming
>> 2021-02-11 cumulative update release. Please review for technical
> 
> 'closes fixes'  maybe better is:
> 'includes fixes' or 'closes bugs'
> 
> 
> 'also fixes over 80 bugs'
> Maybe drop the 'also'; those same 80 bugs have just been mentioned.

Thanks for the suggestions. I have included them in the updated draft
which I am posting to Noah's reply.

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Is Recovery actually paused?

2021-02-10 Thread Dilip Kumar
On Wed, Feb 10, 2021 at 10:02 AM Dilip Kumar  wrote:
>
> I don't find any problem with this approach as well, but I personally
> feel that the other approach where we don't wait in any API and just
> return the recovery pause state is much simpler and more flexible.  So
> I will make the pending changes in that patch and let's see what are
> the other opinion and based on that we can conclude.  Thanks for the
> patch.

Here is an updated version of the patch which fixes the last two open problems
1. In RecoveryRequiresIntParameter set the recovery pause state in the
loop so that if recovery resumed and pause requested again we can set
to pause again.
2. If the recovery state is already 'paused' then don't set it back to
the 'pause requested'.

One more point is that in 'pg_wal_replay_pause' even if we don't
change the state because it was already set to the 'paused' then also
we call the WakeupRecovery.  But I don't think there is any problem
with that, if we think that this should be changed then we can make
SetRecoveryPause return a bool such that if it doesn't do state change
then it returns false and in that case we can avoid calling
WakeupRecovery, but I felt that is unnecessary.  Any other thoughts on
this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 4f7f3e11c3c6354225ddb80bb06b503e940cd8ff Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 27 Jan 2021 16:46:04 +0530
Subject: [PATCH v13] Provide a new interface to get the recovery pause status

Currently, pg_is_wal_replay_paused, just checks whether the recovery
pause is requested or not but it doesn't actually tell whether the
recovery is actually paused or not.  So basically there is no way for
the user to know the actual status of the pause request.  This patch
provides a new interface pg_get_wal_replay_pause_state that will
return the actual status of the recovery pause i.e.'not paused' if
pause is not requested, 'pause requested' if pause is requested but
recovery is not yet paused and 'paused' if recovery is actually paused.
---
 doc/src/sgml/func.sgml | 32 +++--
 src/backend/access/transam/xlog.c  | 86 +++---
 src/backend/access/transam/xlogfuncs.c | 50 ++--
 src/include/access/xlog.h  | 12 -
 src/include/catalog/pg_proc.dat|  4 ++
 5 files changed, 156 insertions(+), 28 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1ab31a9..9b2c429 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25285,7 +25285,24 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 boolean


-Returns true if recovery is paused.
+Returns true if recovery pause is requested.
+   
+  
+
+  
+   
+
+ pg_get_wal_replay_pause_state
+
+pg_get_wal_replay_pause_state ()
+text
+   
+   
+Returns recovery pause state.  The return values are 
+not paused if pause is not requested, 
+pause requested if pause is requested but recovery is
+not yet paused and, paused if the recovery is
+actually paused.

   
 
@@ -25324,10 +25341,15 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 void


-Pauses recovery.  While recovery is paused, no further database
-changes are applied.  If hot standby is active, all new queries will
-see the same consistent snapshot of the database, and no further query
-conflicts will be generated until recovery is resumed.
+Request to pause recovery.  A request doesn't mean that recovery stops
+right away.  If you want a guarantee that recovery is actually paused,
+you need to check for the recovery pause state returned by
+pg_get_wal_replay_pause_state().  Note that
+pg_is_wal_replay_paused() returns whether a request
+is made.  While recovery is paused, no further database changes are applied.
+If hot standby is active, all new queries will see the same consistent
+snapshot of the database, and no further query conflicts will be generated
+until recovery is resumed.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e3b5df..459a19b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -721,8 +721,8 @@ typedef struct XLogCtlData
 	 * only relevant for replication or archive recovery
 	 */
 	TimestampTz currentChunkStartTime;
-	/* Are we requested to pause recovery? */
-	bool		recoveryPause;
+	/* Recovery pause state */
+	RecoveryPauseState	recoveryPauseState;
 
 	/*
 	 * lastFpwDisableRecPtr points to the start of the last replayed
@@ -894,6 +894,7 @@ static void validateRecoveryParameters(void);
 st

Re: repeated decoding of prepared transactions

2021-02-10 Thread Robert Haas
On Tue, Feb 9, 2021 at 9:32 PM Amit Kapila  wrote:
> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.
>
> OTOH, If by successfully confirmed, you mean that once the subscriber
> has applied the complete transaction (including commit), then you are
> right that it won't be sent again.

I meant - once the subscriber has advanced the flush location.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Online checksums patch - once again

2021-02-10 Thread Magnus Hagander
On Tue, Feb 9, 2021 at 9:54 AM Heikki Linnakangas  wrote:
>
> (I may have said this before, but) My overall high-level impression of
> this patch is that it's really cmmplex for a feature that you use maybe
> once in the lifetime of a cluster. I'm happy to review but I'm not
> planning to commit this myself. I don't object if some other committer
> picks this up (Magnus?).

A fairly large amount of this complexity comes out of the fact that it
now supports restarting and tracks checksums on a per-table basis. We
skipped this in the original patch for exactly this reason (that's not
to say there isn't a fair amount of complexity even without it, but it
did substantially i increase both the size and the complexity of the
patch), but in the review of that i was specifically asked for having
that added. I personally don't think it's worth that complexity but at
the time that seemed to be a pretty strong argument. So I'm not
entirely sure how to move forward with that...

is your impression that it would still be too complicated, even without that?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-02-10 Thread Bharath Rupireddy
On Fri, Feb 5, 2021 at 6:51 PM japin  wrote:
> On Fri, 05 Feb 2021 at 17:50, Bharath Rupireddy 
>  wrote:
> We will get cell == NULL when we iterate all items in publist.  I use it
> to check whether the dropped publication is in publist or not.
>
> > If you
> > have a strong reasong retain this error errmsg("publication name
> > \"%s\" do not in subscription", then there's a typo
> > errmsg("publication name \"%s\" does not exists in subscription".
>
> Fixed.

I think we still have a typo in 0002, it's
+ errmsg("publication name \"%s\" does not exist
in subscription",
instead of
+ errmsg("publication name \"%s\" does not exists
in subscription",

IIUC, with the current patch, the new ALTER SUBSCRIPTION ... ADD/DROP
errors out on the first publication that already exists/that doesn't
exist right? What if there are multiple publications given in the
ADD/DROP list, and few of them exist/don't exist. Isn't it good if we
loop over the subscription's publication list and show all the already
existing/not existing publications in the error message, instead of
just erroring out for the first existing/not existing publication?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] logical decoding of two-phase transactions

2021-02-10 Thread Amit Kapila
On Wed, Feb 10, 2021 at 3:59 PM Peter Smith  wrote:
>
> PSA the new patch set v38*.
>
> This patch set has been rebased to use the most recent tablesync patch
> from other thread [1]
> (i.e. notice that v38-0001 is an exact copy of that thread's tablesync
> patch v31)
>

I see one problem which might lead to the skip of prepared xacts for
some of the subscriptions. The problem is that we skip the prepared
xacts based on GID and the same prepared transaction arrives on the
subscriber for different subscriptions. And even if we wouldn't have
skipped the prepared xact, it would have lead to an error "transaction
identifier "p1" is already in use". See the scenario below:

On Publisher:
===
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT 0 1
postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
INSERT 0 1
postgres=*# COMMIT;
COMMIT
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
INSERT 0 1
postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 2);
INSERT 0 1
postgres=*# Commit;
COMMIT
postgres=# CREATE PUBLICATION mypub1 FOR TABLE mytbl1;
CREATE PUBLICATION
postgres=# CREATE PUBLICATION mypub2 FOR TABLE mytbl2;
CREATE PUBLICATION

On Subscriber:

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
postgres=# CREATE SUBSCRIPTION mysub1
postgres-#  CONNECTION 'host=localhost port=5432 dbname=postgres'
postgres-# PUBLICATION mypub1;
NOTICE:  created replication slot "mysub1" on publisher
CREATE SUBSCRIPTION
postgres=# CREATE SUBSCRIPTION mysub2
postgres-#  CONNECTION 'host=localhost port=5432 dbname=postgres'
postgres-# PUBLICATION mypub2;
NOTICE:  created replication slot "mysub2" on publisher
CREATE SUBSCRIPTION

On Publisher:

postgres=# Begin;
BEGIN
postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT 0 1
postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 3);
INSERT 0 1
postgres=*# Prepare Transaction 'myprep1';

After this step, wait for few seconds and then perform Commit Prepared
'myprep1'; on Publisher and you will notice following error in the
subscriber log:  "ERROR:  prepared transaction with identifier
"myprep1" does not exist"

One idea to avoid this is that we use subscription_id along with GID
on subscription for prepared xacts. Let me know if you have any better
ideas to handle this?

Few other minor comments on
v38-0004-Add-support-for-apply-at-prepare-time-to-built-i:
==
1.
- * Mark the prepared transaction as valid.  As soon as xact.c marks
- * MyProc as not running our XID (which it will do immediately after
- * this function returns), others can commit/rollback the xact.
+ * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+ * as not running our XID (which it will do immediately after this
+ * function returns), others can commit/rollback the xact.

Why this change in this patch? Is it due to pgindent? If so, you need
to exclude this change?

2.
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,

  pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);

- /* send the flags field (unused for now) */
+ /* send the flags field */
  pq_sendbyte(out, flags);

Is there a reason to change the above comment?

-- 
With Regards,
Amit Kapila.




parse_slash_copy doesn't support psql variables substitution

2021-02-10 Thread Pavel Stehule
Hi

Is there some reason why \copy statement (parse_slash_copy parser) doesn't
support psql variables?

Regards

Pavel


Re: Asynchronous Append on postgres_fdw nodes.

2021-02-10 Thread Etsuro Fujita
On Wed, Feb 10, 2021 at 7:31 PM Etsuro Fujita  wrote:
> Attached is an updated version of the patch.  Sorry for the delay.

I noticed that I forgot to add new files.  :-(.  Please find attached
an updated patch.

Best regards,
Etsuro Fujita


async-wip-2021-02-10-v2.patch
Description: Binary data


Re: Support for NSS as a libpq TLS backend

2021-02-10 Thread Daniel Gustafsson
> On 10 Feb 2021, at 08:23, Michael Paquier  wrote:
> 
> On Tue, Feb 09, 2021 at 10:30:52AM +0100, Daniel Gustafsson wrote:
>> It can be, it's not the most pressing patch scope reduction but everything
>> helps of course.
> 
> Okay.  I have spent some time on this one and finished it.

Thanks, I'll post a rebased version on top of this soon.

>> Thanks.  That patch is slightly more interesting in terms of reducing scope
>> here, but I also think it makes the test code a bit easier to digest when
>> certificate management is abstracted into the API rather than the job of the
>> testfile to perform.
> 
> That's my impression.  Still, I am wondering if there could be a
> different approach.  I need to think more about that first..

Another option could be to roll SSL config into PostgresNode and expose SSL
connections to every subsystem tested with TAP. Something like:

$node = get_new_node(..);
$node->setup_ssl(..);
$node->set_certificate(..);

That is a fair bit more work though, but perhaps we could then easier find
(and/or prevent) bugs like the one fixed in a45bc8a4f6495072bc48ad40a5aa03.

--
Daniel Gustafsson   https://vmware.com/





Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-10 Thread Ranier Vilela
Em qua., 10 de fev. de 2021 às 01:44, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Tue, 9 Feb 2021 22:01:45 -0300, Ranier Vilela 
> wrote in
> > Hi Hackers,
> >
> > Per Coverity.
> >
> > Coverity complaints about pg_cryptohash_final function.
> > And I agree with Coverity, it's a bad design.
> > Its allows this:
> >
> > #define MY_RESULT_LENGTH 32
> >
> > function pgtest(char * buffer, char * text) {
> > pg_cryptohash_ctx *ctx;
> > uint8 digest[MY_RESULT_LENGTH];
> >
> > ctx = pg_cryptohash_create(PG_SHA512);
> > pg_cryptohash_init(ctx);
> > pg_cryptohash_update(ctx, (uint8 *) buffer, text);
> > pg_cryptohash_final(ctx, digest); // <--  CID 1446240 (#1 of 1):
> > Out-of-bounds access (OVERRUN)
> > pg_cryptohash_free(ctx);
> > return
> > }
>
> It seems to me that the above just means the caller must provide a
> digest buffer that fits the use. In the above example digest just must
> be 64 byte.  If Coverity complains so, what should do for the
> complaint is to fix the caller to provide a digest buffer of the
> correct size.
>
Exactly.


> Could you show the detailed context where Coverity complained?
>
Coverity complains about call memcpy with fixed size, in a context with
buffer variable size supplied by the caller.


>
> > Attached has a patch with suggestions to make things better.
>
> So it doesn't seem to me the right direction. Even if we are going to
> make pg_cryptohash_final to take the buffer length, it should
> error-out or assert-out if the length is too small rather than copy a
> part of the digest bytes. (In short, it would only be assertion-use.)
>
It is necessary to correct the interfaces. To caller, inform the size of
the buffer it created.
I think it should be error-out, because the buffer can be malloc.

regards,
Ranier Vilela


Re: pg_get_first_normal_oid()?

2021-02-10 Thread Masahiko Sawada
On Wed, Feb 10, 2021 at 4:43 PM Joel Jacobson  wrote:
>
> Hi,
>
> I need to filter out any system catalog objects from SQL,
> and I've learned it's not possible to simply filter based on namespace name,
> since there are objects such as pg_am that don't have any namespace belonging,
> except indirectly via their handler, but since you can define a new access 
> method
> using an existing handler from the system catalog, there is no way to 
> distinguish
> your user created access handler from the system catalog access handlers
> only based on namespace name based filtering.
>
> After some digging I found this
>
>#define FirstNormalObjectId 16384
>
> in src/include/access/transam.h, which pg_dump.c and 14 other files are using 
> at 27 different places in the sources.
>
> Seems to be a popular and important fellow.
>
> I see this value doesn't change often, it was added back in 2005-04-13 in 
> commit 2193a856a229026673cbc56310cd0bddf7b5ea25.
>
> Is it safe to just hard-code in application code needing to know this cut-off 
> value?
>
> Or will we have a Bill Gates "640K ought to be enough for anybody" moment in 
> the foreseeable future,
> where this limit needs to be increased?
>
> If there is a risk we will, then maybe we should add a function such as 
> $SUBJECT to expose this value to SQL users who needs it?
>
> I see there has been a related discussion in the thread "Identifying 
> user-created objects"
>
> 
> https://www.postgresql.org/message-id/flat/CA%2Bfd4k7Zr%2ByQLYWF3O_KjAJyYYUZFBZ_dFchfBvq5bMj9GgKQw%40mail.gmail.com

As mentioned in that thread, it's still hard to distinguish between
user objects and system objects using only OID since we can create
objects with OID lower than FirstNormalObjectId by creating objects in
single-user mode. It was not enough for security purposes. I think
providing concrete use cases of the function would support this
proposal.

>
> However, this thread focused on security and wants to know if a specific oid 
> is user defined or not.
>
> I think pg_get_first_normal_oid() would be more useful than 
> pg_is_user_object(oid),
> since with pg_get_first_normal_oid() you could do filtering based on oid 
> indexes.
>
> Compare e.g.:
>
> SELECT * FROM pg_class WHERE oid >= pg_get_first_normal_oid()
>
> with..
>
> SELECT * FROM pg_class WHERE pg_is_user_object(oid) IS TRUE
>
> The first query could use the index on pg_class.oid,
> whereas I'm not mistaken, the second query would need a seq_scan to evaluate 
> pg_is_user_object() for each oid.

Yes. I've also considered the former approach but I prioritized
readability and extensibility; it requires prior knowledge for users
that OIDs greater than the first normal OID are used during normal
multi-user operation. Also in a future if we have similar functions
for other OID bounds such as FirstGenbkiObjectId and
FirstBootstrapObjectId we will end up doing like 'WHERE oid >=
pg_get_first_bootstrap_oid() and oid < pg_get_first_normal_oid()',
which is not intuitive.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Hou, Zhijie
> > > It did have performance gain, but I think it's not huge enough to
> > > ignore the extra's index cost.
> > > What do you think ?
> >
> > Yes... as you suspect, I'm afraid the benefit from parallel bitmap
> > scan may not compensate for the loss of the parallel insert operation.
> >
> > The loss is probably due to 1) more index page splits, 2) more buffer
> > writes (table and index), and 3) internal locks for things such as
> > relation extension and page content protection.  To investigate 3), we
> > should want something like [1], which tells us the wait event
> > statistics (wait count and time for each wait event) per session or
> > across the instance like Oracke, MySQL and EDB provides.  I want to
> continue this in the near future.
> 
> What would the result look like if you turn off
> parallel_leader_participation?  If the leader is freed from
> reading/writing the table and index, the index page splits and internal
> lock contention may decrease enough to recover part of the loss.
> 
> https://www.postgresql.org/docs/devel/parallel-plans.html
> 
> "In a parallel bitmap heap scan, one process is chosen as the leader. That
> process performs a scan of one or more indexes and builds a bitmap indicating
> which table blocks need to be visited. These blocks are then divided among
> the cooperating processes as in a parallel sequential scan. In other words,
> the heap scan is performed in parallel, but the underlying index scan is
> not."

If I disable parallel_leader_participation.

For max_parallel_workers_per_gather = 4, It still have performance degradation.

For max_parallel_workers_per_gather = 2, the performance degradation will not 
happen in most of the case.
There is sometimes a noise(performance degradation), but most of result(about 
80%) is good.

Best regards,
houzj


postgres=# explain (ANALYZE, BUFFERS, VERBOSE, WAL)  insert into testscan 
select a from x where a<8 or (a%2=0 and a>19990);
   QUERY PLAN
-
 Gather  (cost=4272.20..1269111.15 rows=79918 width=0) (actual 
time=381.764..382.715 rows=0 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   Buffers: shared hit=407094 read=4 dirtied=1085 written=1158
   WAL: records=260498 bytes=17019359
   ->  Insert on public.testscan  (cost=3272.20..1260119.35 rows=0 width=0) 
(actual time=378.227..378.229 rows=0 loops=5)
 Buffers: shared hit=407094 read=4 dirtied=1085 written=1158
 WAL: records=260498 bytes=17019359
 Worker 0:  actual time=376.638..376.640 rows=0 loops=1
   Buffers: shared hit=88281 dirtied=236 written=337
   WAL: records=56167 bytes=3674994
 Worker 1:  actual time=377.889..377.892 rows=0 loops=1
   Buffers: shared hit=81231 dirtied=213 written=99
   WAL: records=52175 bytes=3386885
 Worker 2:  actual time=377.388..377.389 rows=0 loops=1
   Buffers: shared hit=82544 dirtied=232 written=279
   WAL: records=52469 bytes=3443843
 Worker 3:  actual time=377.733..377.734 rows=0 loops=1
   Buffers: shared hit=87728 dirtied=225 written=182
   WAL: records=56307 bytes=3660758
 ->  Parallel Bitmap Heap Scan on public.x  (cost=3272.20..1260119.35 
rows=19980 width=8) (actual time=5.832..14.787 rows=26000 loops=5)
   Output: x.a, NULL::integer
   Recheck Cond: ((x.a < 8) OR (x.a > 19990))
   Filter: ((x.a < 8) OR (((x.a % 2) = 0) AND (x.a > 
19990)))
   Rows Removed by Filter: 1
   Heap Blocks: exact=167
   Buffers: shared hit=1475
   Worker 0:  actual time=5.203..14.921 rows=28028 loops=1
 Buffers: shared hit=209
   Worker 1:  actual time=5.165..14.403 rows=26039 loops=1
 Buffers: shared hit=196
   Worker 2:  actual time=5.188..14.284 rows=26177 loops=1
 Buffers: shared hit=195
   Worker 3:  actual time=5.151..14.760 rows=28104 loops=1
 Buffers: shared hit=208
   ->  BitmapOr  (cost=3272.20..3272.20 rows=174813 width=0) 
(actual time=8.288..8.290 rows=0 loops=1)
 Buffers: shared hit=500
 ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1468.38 
rows=79441 width=0) (actual time=3.681..3.681 rows=7 loops=1)
   Index Cond: (x.a < 8)
   Buffers: shared hit=222
 ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1763.86 
rows=95372 width=0) (actual time=4.605..4.605 rows=10 loops=1)
   Index Cond: (x.a > 19990)
   Buffers: shared hit=278
 Planning:
   Buffers: shared hit=19
 Planning Time: 0.173 ms
 Execution Time: 382.776 ms
(47 r

Re: Asynchronous Append on postgres_fdw nodes.

2021-02-10 Thread Etsuro Fujita
On Thu, Feb 4, 2021 at 7:21 PM Etsuro Fujita  wrote:
> On Mon, Feb 1, 2021 at 12:06 PM Etsuro Fujita  wrote:
> > Rather than doing so, I'd like to propose to allow
> > FDWs to disable async execution of them in problematic cases by
> > themselves during executor startup in the first cut.  What I have in
> > mind for that is:
> >
> > 1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
> > to record, for each of the async-capable and non-async-capable
> > ForeignScan(s), the information on a connection to be used for the
> > ForeignScan into EState during BeginForeignScan().
> >
> > 2) After doing ExecProcNode() to each SubPlan and the main query tree
> > in InitPlan(), we give the FDW a chance to a) reconsider, for each of
> > the async-capable ForeignScan(s), whether the ForeignScan can be
> > executed asynchronously as planned, based on the information stored
> > into EState in #1, and then b) disable async execution of the
> > ForeignScan if not.
>
> s/ExecProcNode()/ExecInitNode()/.  Sorry for that.  I’ll post an
> updated patch for this in a few days.

I created a WIP patch for this.  For #2, I added a new callback
routine ReconsiderAsyncForeignScan().  The routine for postgres_fdw
postgresReconsiderAsyncForeignScan() is pretty simple: async execution
of an async-capable ForeignScan is disabled if the connection used for
it is used in other parts of the query plan tree except async subplans
just below the parent Append.  Here is a running example:

postgres=# create table t1 (a int, b int, c text);
postgres=# create table t2 (a int, b int, c text);
postgres=# create foreign table p1 (a int, b int, c text) server
server1 options (table_name 't1');
postgres=# create foreign table p2 (a int, b int, c text) server
server2 options (table_name 't2');
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# alter table pt attach partition p1 for values from (10) to (20);
postgres=# alter table pt attach partition p2 for values from (20) to (30);
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM')
from generate_series(0, 99) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM')
from generate_series(0, 99) i;
postgres=# analyze pt;
postgres=# create table loct (a int, b int);
postgres=# create foreign table ft (a int, b int) server server1
options (table_name 'loct');
postgres=# insert into ft select i, i from generate_series(0, 99) i;
postgres=# analyze ft;
postgres=# create view v as select * from ft;

postgres=# explain verbose select * from pt, v where pt.b = v.b and v.b = 99;
   QUERY PLAN
-
 Nested Loop  (cost=200.00..306.84 rows=2 width=21)
   Output: pt.a, pt.b, pt.c, ft.a, ft.b
   ->  Foreign Scan on public.ft  (cost=100.00..102.27 rows=1 width=8)
 Output: ft.a, ft.b
 Remote SQL: SELECT a, b FROM public.loct WHERE ((b = 99))
   ->  Append  (cost=100.00..204.55 rows=2 width=13)
 ->  Foreign Scan on public.p1 pt_1  (cost=100.00..102.27
rows=1 width=13)
   Output: pt_1.a, pt_1.b, pt_1.c
   Remote SQL: SELECT a, b, c FROM public.t1 WHERE ((b = 99))
 ->  Async Foreign Scan on public.p2 pt_2
(cost=100.00..102.27 rows=1 width=13)
   Output: pt_2.a, pt_2.b, pt_2.c
   Remote SQL: SELECT a, b, c FROM public.t2 WHERE ((b = 99))
(12 rows)

For this query, while p2 is executed asynchronously, p1 isn’t as it
uses the same connection with ft.  BUT:

postgres=# create role view_owner SUPERUSER;
postgres=# create user mapping for view_owner server server1;
postgres=# alter view v owner to view_owner;

postgres=# explain verbose select * from pt, v where pt.b = v.b and v.b = 99;
   QUERY PLAN
-
 Nested Loop  (cost=200.00..306.84 rows=2 width=21)
   Output: pt.a, pt.b, pt.c, ft.a, ft.b
   ->  Foreign Scan on public.ft  (cost=100.00..102.27 rows=1 width=8)
 Output: ft.a, ft.b
 Remote SQL: SELECT a, b FROM public.loct WHERE ((b = 99))
   ->  Append  (cost=100.00..204.55 rows=2 width=13)
 ->  Async Foreign Scan on public.p1 pt_1
(cost=100.00..102.27 rows=1 width=13)
   Output: pt_1.a, pt_1.b, pt_1.c
   Remote SQL: SELECT a, b, c FROM public.t1 WHERE ((b = 99))
 ->  Async Foreign Scan on public.p2 pt_2
(cost=100.00..102.27 rows=1 width=13)
   Output: pt_2.a, pt_2.b, pt_2.c
   Remote SQL: SELECT a, b, c FROM public.t2 WHERE ((b = 99))
(12 rows)

in this setup, p1 is executed asynchronously as ft doesn’t use the
same connection with p1.

I added to postgresReconsiderAsyncForeignScan() this as well: even if
the connection isn’t used in the other parts, async execution of an
async-capable ForeignScan is disabled if the subplans of the Append
are a

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Hou, Zhijie
> What are the results if disable the bitmap heap scan(Set enable_bitmapscan
> = off)? If that happens to be true, then we might also want to consider
> if in some way we can teach parallel insert to cost more in such cases.
> Another thing we can try is to integrate a parallel-insert patch with the
> patch on another thread [1] and see if that makes any difference but not
> sure if we want to go there at this stage unless it is simple to try that
> out?

If we diable bitmapscan, the performance degradation seems will not happen.

[Parallel]
postgres=# explain (ANALYZE, BUFFERS, VERBOSE, WAL)  insert into testscan 
select a from x where a<8 or (a%2=0 and a>19990);
QUERY PLAN
---
 Gather  (cost=1000.00..2090216.68 rows=81338 width=0) (actual 
time=0.226..5488.455 rows=0 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   Buffers: shared hit=393364 read=1079535 dirtied=984 written=1027
   WAL: records=260400 bytes=16549513
   ->  Insert on public.testscan  (cost=0.00..2081082.88 rows=0 width=0) 
(actual time=5483.113..5483.114 rows=0 loops=4)
 Buffers: shared hit=393364 read=1079535 dirtied=984 written=1027
 WAL: records=260400 bytes=16549513
 Worker 0:  actual time=5483.116..5483.117 rows=0 loops=1
   Buffers: shared hit=36306 read=264288 dirtied=100 written=49
   WAL: records=23895 bytes=1575860
 Worker 1:  actual time=5483.220..5483.222 rows=0 loops=1
   Buffers: shared hit=39750 read=280476 dirtied=101 written=106
   WAL: records=26141 bytes=1685083
 Worker 2:  actual time=5482.844..5482.845 rows=0 loops=1
   Buffers: shared hit=38660 read=263713 dirtied=105 written=250
   WAL: records=25318 bytes=1657396
 Worker 3:  actual time=5483.272..5483.274 rows=0 loops=1
   Buffers: shared hit=278648 read=271058 dirtied=678 written=622
   WAL: records=185046 bytes=11631174
 ->  Parallel Seq Scan on public.x  (cost=0.00..2081082.88 rows=20334 
width=8) (actual time=4001.641..5287.248 rows=32500 loops=4)
   Output: x.a, NULL::integer
   Filter: ((x.a < 8) OR (((x.a % 2) = 0) AND (x.a > 
19990)))
   Rows Removed by Filter: 49967500
   Buffers: shared hit=1551 read=1079531
   Worker 0:  actual time=5335.456..5340.757 rows=11924 loops=1
 Buffers: shared hit=281 read=264288
   Worker 1:  actual time=5335.559..5341.766 rows=13049 loops=1
 Buffers: shared hit=281 read=280476
   Worker 2:  actual time=5335.534..5340.964 rows=12636 loops=1
 Buffers: shared hit=278 read=263712
   Worker 3:  actual time=0.015..5125.503 rows=92390 loops=1
 Buffers: shared hit=711 read=271055
 Planning:
   Buffers: shared hit=19
 Planning Time: 0.175 ms
 Execution Time: 5488.493 ms

[Serial]
postgres=# explain (ANALYZE, BUFFERS, VERBOSE, WAL)  insert into testscan 
select a from x where a<8 or (a%2=0 and a>19990);
QUERY PLAN
---
 Insert on public.testscan  (cost=0.00..5081085.52 rows=0 width=0) (actual 
time=19311.642..19311.643 rows=0 loops=1)
   Buffers: shared hit=392485 read=1079694 dirtied=934 written=933
   WAL: records=260354 bytes=16259841
   ->  Seq Scan on public.x  (cost=0.00..5081085.52 rows=81338 width=8) (actual 
time=0.010..18997.317 rows=12 loops=1)
 Output: x.a, NULL::integer
 Filter: ((x.a < 8) OR (((x.a % 2) = 0) AND (x.a > 19990)))
 Rows Removed by Filter: 199870001
 Buffers: shared hit=1391 read=1079691
 Planning:
   Buffers: shared hit=10
 Planning Time: 0.125 ms
 Execution Time: 19311.700 ms

Best regards,
houzj




Re: 64-bit XIDs in deleted nbtree pages

2021-02-10 Thread Masahiko Sawada
On Wed, Feb 10, 2021 at 10:53 AM Peter Geoghegan  wrote:
>
> On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan  wrote:
> > The first patch teaches nbtree to use 64-bit transaction IDs here, and
> > so makes it impossible to leak deleted nbtree pages. This patch is the
> > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> > due to exactly the same set of problems.

Thank you for working on this!

>
> There is an unresolved question for my deleted page XID patch: what
> should it do about the vacuum_cleanup_index_scale_factor feature,
> which added an XID to the metapage (its btm_oldest_btpo_xact field). I
> refer to the work done by commit 857f9c36cda for Postgres 11 by
> Masahiko. It would be good to get your opinion on this as the original
> author of that feature, Masahiko.
>
> To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
> all deleted pages in the index, so clearly it needs to be carefully
> considered in my patch to make the XIDs 64-bit. Even still, v1 of my
> patch from today more or less ignores the issue -- it just gets a
> 32-bit version of the oldest value as determined by the oldestBtpoXact
> XID tracking stuff (which is largely unchanged, except that it works
> with 64-bit Full Transaction Ids now).
>
> Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
> field to wrap around in v1 of my patch. The obvious thing to do here
> is to add a new epoch metapage field, effectively making
> btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
> idea. The only reason that we have the btm_oldest_btpo_xact field in
> the first place is to ameliorate the problem that the patch
> comprehensively solves! We should stop storing *any* XIDs in the
> metapage. (Besides, adding a new "epoch" field to the metapage would
> be relatively messy.)

I agree that btm_oldest_btpo_xact will no longer be necessary in terms
of recycling deleted pages.

The purpose of btm_oldest_btpo_xact is to prevent deleted pages from
being leaked. As you mentioned, it has the oldest btpo.xact in
BTPageOpaqueData among all deleted pages in the index. Looking back to
the time when we develop INDEX_CLEANUP option if we skip index cleanup
(meaning both ambulkdelete and amvaucumcleanup), there was a problem
in btree indexes that deleted pages could never be recycled if XID
wraps around. So the idea behind btm_oldest_btpo_xact is, we remember
the oldest btpo.xact among the all deleted pages and do btvacuumscan()
if this value is older than global xmin (meaning there is at least one
recyclable page). That way, we can recycle the deleted pages without
leaking the pages (of course, unless INDEX_CLEANUP is disabled).

Given that we can guarantee that deleted pages never be leaked by
using 64-bit XID, I also think we don't need this value. We can do
amvacuumcleanup only if the table receives enough insertions to update
the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
think this is a more desirable behavior. Not skipping amvacuumcleanup
if there is even one deleted page that we can recycle is very
conservative.

Considering your idea of keeping newly deleted pages in the meta page,
I can see a little value that keeping btm_oldest_btpo_xact and making
it 64-bit XID. I described below.

>
> Here is a plan that allows us to stop storing any kind of XID in the
> metapage in all cases:
>
> 1. Stop maintaining the oldest XID among all deleted pages in the
> entire nbtree index during VACUUM. So we can remove all of the
> BTVacState.oldestBtpoXact XID tracking stuff, which is currently
> something that even _bt_pagedel() needs special handling for.
>
> 2. Stop considering the btm_oldest_btpo_xact metapage field in
> _bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
> cares about maintaining reasonably accurate statistics for the index.
> Which is really how the vacuum_cleanup_index_scale_factor feature was
> intended to work all along, anyway -- ISTM that the oldestBtpoXact
> stuff was always just an afterthought to paper-over this annoying
> 32-bit XID issue.
>
> 3. We cannot actually remove the btm_oldest_btpo_xact XID field from
> the metapage, because of course that would change the BTMetaPageData
> struct layout, which breaks on-disk compatibility. But why not use it
> for something useful instead? _bt_update_meta_cleanup_info() can use
> the same field to store the number of "newly deleted" pages from the
> last btbulkdelete() instead. (See my email from earlier for the
> definition of "newly deleted".)
>
> 4. Now _bt_vacuum_needs_cleanup() can once again consider the
> btm_oldest_btpo_xact metapage field -- except in a totally different
> way, because now it means something totally different: "newly deleted
> pages during last btbulkdelete() call" (per item 3). If this # pages
> is very high then we probably should do a full call to btvacuumscan()
> -- _bt_vacuum_needs_cleanup() will return true to make that happen.
>
> It's unlikely but st

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Amit Langote
On Wed, Feb 10, 2021 at 5:50 PM Amit Langote  wrote:
> On Wed, Feb 10, 2021 at 5:24 PM Amit Kapila  wrote:
> > On Wed, Feb 10, 2021 at 1:00 PM Amit Langote  
> > wrote:
> > > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow  
> > > wrote:
> > > > It's parallel UNSAFE because it's not safe or even possible to have a
> > > > parallel plan for this.
> > > > (as UNSAFE is the max hazard level, no point in referencing
> > > > context->max_interesting).
> > > > And there are existing cases of bailing out and not doing further
> > > > safety checks (even your v15_delta.diff patch placed code - for
> > > > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
> > > > existing case in max_parallel_hazard_walker()):
> > > >
> > > > else if (IsA(node, Query))
> > > > {
> > > > Query  *query = (Query *) node;
> > > >
> > > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > > if (query->rowMarks != NULL)
> > > > {
> > > > context->max_hazard = PROPARALLEL_UNSAFE;
> > > > return true;
> > > > }
> > >
> > > In my understanding, the max_parallel_hazard() query tree walk is to
> > > find constructs that are parallel unsafe in that they call code that
> > > can't run in parallel mode.  For example, FOR UPDATE/SHARE on
> > > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > > support being called in parallel mode.  Likewise ON CONFLICT ... DO
> > > UPDATE calls heap_update() which doesn't support parallelism.  I'm not
> > > aware of any such hazards downstream of ExecValuesScan().
> > >
> > > > >You're trying to
> > > > > add something that bails based on second-guessing that a parallel path
> > > > > would not be chosen, which I find somewhat objectionable.
> > > > >
> > > > > If the main goal of bailing out is to avoid doing the potentially
> > > > > expensive modification safety check on the target relation, maybe we
> > > > > should try to somehow make the check less expensive.  I remember
> > > > > reading somewhere in the thread about caching the result of this check
> > > > > in relcache, but haven't closely studied the feasibility of doing so.
> > > > >
> > > >
> > > > There's no "second-guessing" involved here.
> > > > There is no underlying way of dividing up the VALUES data of
> > > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > > > (apart from the fact that spawning off parallel workers to insert that
> > > > data would almost always result in worse performance than a
> > > > non-parallel plan...)
> > > > The division of work for parallel workers is part of the table AM
> > > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> > >
> > > I don't disagree that the planner would not normally assign a parallel
> > > path simply to pull values out of a VALUES list mentioned in the
> > > INSERT command, but deciding something based on the certainty of it in
> > > an earlier planning phase seems odd to me.  Maybe that's just me
> > > though.
> > >
> >
> > I think it is more of a case where neither is a need for parallelism
> > nor we want to support parallelism of it. The other possibility for
> > such a check could be at some earlier phase say in standard_planner
> > [1] where we are doing checks for other constructs where we don't want
> > parallelism (I think the check for 'parse->hasModifyingCTE' is quite
> > similar). If you see in that check as well we just assume other
> > operations to be in the category of parallel-unsafe. I think we should
> > rule out such cases earlier than later. Do you have better ideas than
> > what Greg has done to avoid parallelism for such cases?
> >
> > [1] -
> > standard_planner()
> > {
> > ..
> > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> > IsUnderPostmaster &&
> > parse->commandType == CMD_SELECT &&
> > !parse->hasModifyingCTE &&
> > max_parallel_workers_per_gather > 0 &&
> > !IsParallelWorker())
> > {
> > /* all the cheap tests pass, so scan the query tree */
> > glob->maxParallelHazard = max_parallel_hazard(parse);
> > glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> > }
> > else
> > {
> > /* skip the query tree scan, just assume it's unsafe */
> > glob->maxParallelHazard = PROPARALLEL_UNSAFE;
> > glob->parallelModeOK = false;
> > }
>
> Yeah, maybe having the block I was commenting on, viz.:
>
> +   /*
> +* If there is no underlying SELECT, a parallel table-modification
> +* operation is not possible (nor desirable).
> +*/
> +   hasSubQuery = false;
> +   foreach(lc, parse->rtable)
> +   {
> +   rte = lfirst_node(RangeTblEntry, lc);
> +   if (rte->rtekind == RTE_SUBQUERY)
> +   {
> +   hasSubQuery = true;
> +   break;
> +   }
> +   }
> +   if (!hasSubQuery)
> +   return PROPARALLEL_UNSAFE;
>
> before the standard_planner() block you quoted might be a good 

Re: repeated decoding of prepared transactions

2021-02-10 Thread Amit Kapila
On Wed, Feb 10, 2021 at 1:40 PM Markus Wanner
 wrote:
>
> On 10.02.21 07:32, Amit Kapila wrote:
> > On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian  wrote:
> >> But the other side of the problem is that ,without this, if the
> >> prepared transaction is prior to a consistent snapshot when decoding
> >> starts/restarts, then only the "commit prepared" is sent to downstream
> >> (as seen in the test scenario I shared above), and downstream has to
> >> error away the commit prepared because it does not have the
> >> corresponding prepared transaction.
> >
> > I think it is not only simple error handling, it is required for
> > data-consistency. We need to send the transactions whose commits are
> > encountered after a consistent snapshot is reached.
>
> I'm with Ashutosh here.  If a replica is properly in sync, it knows
> about prepared transactions and all the gids of those.  Sending the
> transactional changes and the prepare again is inconsistent.
>
> The point of a two-phase transaction is to have two phases.  An output
> plugin must have the chance of treating them as independent events.
>

I am not sure I understand what problem you are facing to deal with
this in the output plugin, it is explained in docs and Ajin also
pointed out the same. Ajin and I have explained to you the design
constraints on the publisher-side due to which we have done this way.
Do you have any better ideas to deal with this?

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Amit Langote
On Wed, Feb 10, 2021 at 5:52 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Amit Langote 
> > Just to be clear, I'm not suggesting that we should put effort into
> > making INSERT ... VALUES run in parallel.  I'm just raising my concern
> > about embedding the assumption in max_parallel_hazard() that it will
> > never make sense to do so.
>
> I'm sorry I misunderstood your suggestion.  So, you're suggesting that it may 
> be better to place the VALUES existence check outside max_parallel_hazard().  
> (I may be a bit worried if I may misunderstanding in a different way.)

To add context to my comments, here's the block of code in the patch I
was commenting on:

+   /*
+* If there is no underlying SELECT, a parallel table-modification
+* operation is not possible (nor desirable).
+*/
+   hasSubQuery = false;
+   foreach(lc, parse->rtable)
+   {
+   rte = lfirst_node(RangeTblEntry, lc);
+   if (rte->rtekind == RTE_SUBQUERY)
+   {
+   hasSubQuery = true;
+   break;
+   }
+   }
+   if (!hasSubQuery)
+   return PROPARALLEL_UNSAFE;

For a modification query, this makes max_parallel_hazard() return that
it is unsafe to parallelize the query because it doesn't contain a
subquery RTE, or only contains a VALUES RTE.

I was trying to say that inside max_parallel_hazard() seems to be a
wrong place to reject parallelism for modification if only because
there are no subquery RTEs in the query.  Although now I'm thinking
that maybe it's okay as long as it's appropriately placed.  I shared
one suggestion in my reply to Amit K's email.

> The description of max_parallel_hazard() gave me an impression that this is 
> the right place to check VALUES, because its role can be paraphrased in 
> simpler words like "Tell you if the given Query is safe for parallel 
> execution."
>
> In that regard, the standard_planner()'s if conditions that check 
> Query->commandType and Query->hasModifyingCTE can be moved into 
> max_parallel_hazard() too.
>
> But looking closer to the description, it says "Returns the worst function 
> hazard."  Function hazard?  Should this function only check functions?  Or do 
> we want to modify this description and get max_parallel_hazard() to provide 
> more service?

Yeah, updating the description to be more general may make sense.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Hou, Zhijie
> > >
> > > else if (IsA(node, Query))
> > > {
> > > Query  *query = (Query *) node;
> > >
> > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > if (query->rowMarks != NULL)
> > > {
> > > context->max_hazard = PROPARALLEL_UNSAFE;
> > > return true;
> > > }
> >
> > In my understanding, the max_parallel_hazard() query tree walk is to
> > find constructs that are parallel unsafe in that they call code that
> > can't run in parallel mode.  For example, FOR UPDATE/SHARE on
> > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > support being called in parallel mode.  Likewise ON CONFLICT ... DO
> > UPDATE calls heap_update() which doesn't support parallelism.  I'm not
> > aware of any such hazards downstream of ExecValuesScan().
> >
> > > >You're trying to
> > > > add something that bails based on second-guessing that a parallel
> > > >path  would not be chosen, which I find somewhat objectionable.
> > > >
> > > > If the main goal of bailing out is to avoid doing the potentially
> > > > expensive modification safety check on the target relation, maybe
> > > > we should try to somehow make the check less expensive.  I
> > > > remember reading somewhere in the thread about caching the result
> > > > of this check in relcache, but haven't closely studied the feasibility
> of doing so.
> > > >
> > >
> > > There's no "second-guessing" involved here.
> > > There is no underlying way of dividing up the VALUES data of
> > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > was updated to produce a parallel-plan for the "INSERT...VALUES"
> > > case (apart from the fact that spawning off parallel workers to
> > > insert that data would almost always result in worse performance
> > > than a non-parallel plan...) The division of work for parallel
> > > workers is part of the table AM
> > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> >
> > I don't disagree that the planner would not normally assign a parallel
> > path simply to pull values out of a VALUES list mentioned in the
> > INSERT command, but deciding something based on the certainty of it in
> > an earlier planning phase seems odd to me.  Maybe that's just me
> > though.
> >
> 
> I think it is more of a case where neither is a need for parallelism nor
> we want to support parallelism of it. The other possibility for such a check
> could be at some earlier phase say in standard_planner [1] where we are
> doing checks for other constructs where we don't want parallelism (I think
> the check for 'parse->hasModifyingCTE' is quite similar). If you see in
> that check as well we just assume other operations to be in the category
> of parallel-unsafe. I think we should rule out such cases earlier than later.
> Do you have better ideas than what Greg has done to avoid parallelism for
> such cases?
> 
> [1] -
> standard_planner()
> {
> ..
> if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && IsUnderPostmaster &&
> parse->commandType == CMD_SELECT &&
> !parse->hasModifyingCTE &&
> max_parallel_workers_per_gather > 0 &&
> !IsParallelWorker())
> {
> /* all the cheap tests pass, so scan the query tree */
> glob->maxParallelHazard = max_parallel_hazard(parse); parallelModeOK =
> glob->(glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> }
> else
> {
> /* skip the query tree scan, just assume it's unsafe */
> glob->maxParallelHazard = PROPARALLEL_UNSAFE; parallelModeOK = false;
> }

+1.

In the current parallel_dml option patch. I put this check and some high-level 
check in a separate function called is_parallel_possible_for_modify.


- * PROPARALLEL_UNSAFE, PROPARALLEL_RESTRICTED, PROPARALLEL_SAFE
+ * Check at a high-level if parallel mode is able to be used for the specified
+ * table-modification statement.
+ * It's not possible in the following cases:
+ *
+ *  1) enable_parallel_dml is off
+ *  2) UPDATE or DELETE command
+ *  3) INSERT...ON CONFLICT...DO UPDATE
+ *  4) INSERT without SELECT on a relation
+ *  5) the reloption parallel_dml_enabled is not set for the target table
+ *
+ * (Note: we don't do in-depth parallel-safety checks here, we do only the
+ * cheaper tests that can quickly exclude obvious cases for which
+ * parallelism isn't supported, to avoid having to do further parallel-safety
+ * checks for these)
  */
+bool
+is_parallel_possible_for_modify(Query *parse)





And I put the function at earlier place like the following:


if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
IsUnderPostmaster &&
(parse->commandType == CMD_SELECT ||
-   (enable_parallel_dml &&
-   IsModifySupportedInParallelMode(parse->commandType))) &&
+   is_parallel_possible_for_modify(parse)) &&
!parse->hasModifyingCTE &&
max_parallel_workers_per_gather > 0 &&
!IsParallelWorker())


If this looks good, may

Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR

2021-02-10 Thread Massimo Fidanza
Hi John,

I never build postgresql from source, so I must get some information on how
to apply your patch and do some test. I can't review your code because I
know nothing about Postgresql internals and just basic C. I am mainly a
PL/SQL programmer, with experience with PHP, Python and Javascript. If I
can give some contribution I will be happy, but I need some help.

Massimo

Il giorno dom 7 feb 2021 alle ore 22:35 Dent John  ha
scritto:

> Hi Massimo,
>
> Thanks for the interest, and my apologies for the late reply.
>
> I’m not particularly abandoning it, but I don’t have particular reason to
> make further changes at the moment. Far as I’m concerned it works, and the
> main question is whether it is acceptable and useful.
>
> I’d be happy if you have feedback that evolves it or might push it up the
> queue for commitfest review.
>
> d.
>
> On 18 Jan 2021, at 23:09, Massimo Fidanza  wrote:
>
> This is an interesting feature, but it seems that the author has abandoned
> development, what happens now? Will this be postponed from commitfest to
> commitfest and never be taken over by anyone?
>
> Massimo.
>
> Il giorno ven 6 mar 2020 alle ore 22:36 Dent John  ha
> scritto:
>
>> > On 22 Feb 2020, at 10:38, Dent John  wrote:
>> >
>> >> On 18 Feb 2020, at 03:03, Thomas Munro  wrote:
>> >>
>> >> From the trivialities department, I see a bunch of warnings about
>> >> local declaration placement (we're still using C90 rules for those by
>> >> project policy):
>> >>
>> >> […]
>> >
>> > […]
>>
>> My bad. I missed on declaration.
>>
>> Another patch attached.
>>
>> d.
>>
>
>


PostgreSQL and Flashback Database

2021-02-10 Thread ROS Didier
Hi
  My company is looking for a team of developers to implement the 
"flashback database" functionality in PostgreSQL.
  Do you think it's feasible to implement? how many days of 
development?

  Thanks in advance

Best Regards
Didier ROS
E.D.F





Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread tsunakawa.ta...@fujitsu.com
From: Amit Langote 
> Just to be clear, I'm not suggesting that we should put effort into
> making INSERT ... VALUES run in parallel.  I'm just raising my concern
> about embedding the assumption in max_parallel_hazard() that it will
> never make sense to do so.

I'm sorry I misunderstood your suggestion.  So, you're suggesting that it may 
be better to place the VALUES existence check outside max_parallel_hazard().  
(I may be a bit worried if I may misunderstanding in a different way.)

The description of max_parallel_hazard() gave me an impression that this is the 
right place to check VALUES, because its role can be paraphrased in simpler 
words like "Tell you if the given Query is safe for parallel execution."  In 
that regard, the standard_planner()'s if conditions that check 
Query->commandType and Query->hasModifyingCTE can be moved into 
max_parallel_hazard() too.

But looking closer to the description, it says "Returns the worst function 
hazard."  Function hazard?  Should this function only check functions?  Or do 
we want to modify this description and get max_parallel_hazard() to provide 
more service?


/*
 * max_parallel_hazard
 *  Find the worst parallel-hazard level in the given query
 *
 * Returns the worst function hazard property (the earliest in this list:
 * PROPARALLEL_UNSAFE, PROPARALLEL_RESTRICTED, PROPARALLEL_SAFE) that can
 * be found in the given parsetree.  We use this to find out whether the query
 * can be parallelized at all.  The caller will also save the result in
 * PlannerGlobal so as to short-circuit checks of portions of the querytree
 * later, in the common case where everything is SAFE.
 */
char
max_parallel_hazard(Query *parse)


Regards
Takayuki Tsunakawa




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Amit Langote
On Wed, Feb 10, 2021 at 5:24 PM Amit Kapila  wrote:
> On Wed, Feb 10, 2021 at 1:00 PM Amit Langote  wrote:
> > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow  wrote:
> > > It's parallel UNSAFE because it's not safe or even possible to have a
> > > parallel plan for this.
> > > (as UNSAFE is the max hazard level, no point in referencing
> > > context->max_interesting).
> > > And there are existing cases of bailing out and not doing further
> > > safety checks (even your v15_delta.diff patch placed code - for
> > > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
> > > existing case in max_parallel_hazard_walker()):
> > >
> > > else if (IsA(node, Query))
> > > {
> > > Query  *query = (Query *) node;
> > >
> > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > if (query->rowMarks != NULL)
> > > {
> > > context->max_hazard = PROPARALLEL_UNSAFE;
> > > return true;
> > > }
> >
> > In my understanding, the max_parallel_hazard() query tree walk is to
> > find constructs that are parallel unsafe in that they call code that
> > can't run in parallel mode.  For example, FOR UPDATE/SHARE on
> > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > support being called in parallel mode.  Likewise ON CONFLICT ... DO
> > UPDATE calls heap_update() which doesn't support parallelism.  I'm not
> > aware of any such hazards downstream of ExecValuesScan().
> >
> > > >You're trying to
> > > > add something that bails based on second-guessing that a parallel path
> > > > would not be chosen, which I find somewhat objectionable.
> > > >
> > > > If the main goal of bailing out is to avoid doing the potentially
> > > > expensive modification safety check on the target relation, maybe we
> > > > should try to somehow make the check less expensive.  I remember
> > > > reading somewhere in the thread about caching the result of this check
> > > > in relcache, but haven't closely studied the feasibility of doing so.
> > > >
> > >
> > > There's no "second-guessing" involved here.
> > > There is no underlying way of dividing up the VALUES data of
> > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > > (apart from the fact that spawning off parallel workers to insert that
> > > data would almost always result in worse performance than a
> > > non-parallel plan...)
> > > The division of work for parallel workers is part of the table AM
> > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> >
> > I don't disagree that the planner would not normally assign a parallel
> > path simply to pull values out of a VALUES list mentioned in the
> > INSERT command, but deciding something based on the certainty of it in
> > an earlier planning phase seems odd to me.  Maybe that's just me
> > though.
> >
>
> I think it is more of a case where neither is a need for parallelism
> nor we want to support parallelism of it. The other possibility for
> such a check could be at some earlier phase say in standard_planner
> [1] where we are doing checks for other constructs where we don't want
> parallelism (I think the check for 'parse->hasModifyingCTE' is quite
> similar). If you see in that check as well we just assume other
> operations to be in the category of parallel-unsafe. I think we should
> rule out such cases earlier than later. Do you have better ideas than
> what Greg has done to avoid parallelism for such cases?
>
> [1] -
> standard_planner()
> {
> ..
> if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> IsUnderPostmaster &&
> parse->commandType == CMD_SELECT &&
> !parse->hasModifyingCTE &&
> max_parallel_workers_per_gather > 0 &&
> !IsParallelWorker())
> {
> /* all the cheap tests pass, so scan the query tree */
> glob->maxParallelHazard = max_parallel_hazard(parse);
> glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> }
> else
> {
> /* skip the query tree scan, just assume it's unsafe */
> glob->maxParallelHazard = PROPARALLEL_UNSAFE;
> glob->parallelModeOK = false;
> }

Yeah, maybe having the block I was commenting on, viz.:

+   /*
+* If there is no underlying SELECT, a parallel table-modification
+* operation is not possible (nor desirable).
+*/
+   hasSubQuery = false;
+   foreach(lc, parse->rtable)
+   {
+   rte = lfirst_node(RangeTblEntry, lc);
+   if (rte->rtekind == RTE_SUBQUERY)
+   {
+   hasSubQuery = true;
+   break;
+   }
+   }
+   if (!hasSubQuery)
+   return PROPARALLEL_UNSAFE;

before the standard_planner() block you quoted might be a good idea.
So something like this:

+   /*
+* If there is no underlying SELECT, a parallel table-modification
+* operation is not possible (nor desirable).
+*/
+   rangeTablehasSubQuery = false;
+   foreach(lc, parse->rtable)
+   {
+   rte = lfirst_node(RangeTblEntry, lc);

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Amit Kapila
On Wed, Feb 10, 2021 at 11:13 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
>
> The loss is probably due to 1) more index page splits, 2) more buffer writes 
> (table and index), and 3) internal locks for things such as relation 
> extension and page content protection.  To investigate 3), we should want 
> something like [1], which tells us the wait event statistics (wait count and 
> time for each wait event) per session or across the instance like Oracke, 
> MySQL and EDB provides.  I want to continue this in the near future.
>
>

I think we might mitigate such losses with a different patch where we
can do a bulk insert for Insert .. Select something like we are
discussing in the other thread [1]. I wonder if these performance
characteristics are due to the reason of underlying bitmap heap scan.
What are the results if disable the bitmap heap scan(Set
enable_bitmapscan = off)? If that happens to be true, then we might
also want to consider if in some way we can teach parallel insert to
cost more in such cases. Another thing we can try is to integrate a
parallel-insert patch with the patch on another thread [1] and see if
that makes any difference but not sure if we want to go there at this
stage unless it is simple to try that out?

[1] - 
https://www.postgresql.org/message-id/20200508072545.GA9701%40telsasoft.com

-- 
With Regards,
Amit Kapila.




Re: WIP: WAL prefetch (another approach)

2021-02-10 Thread Thomas Munro
On Thu, Feb 4, 2021 at 1:40 PM Tomas Vondra
 wrote:
> With master, it'd take ~16000 seconds to catch up. I don't have the
> exact number, because I got tired of waiting, but the estimate is likely
> accurate (judging by other tests and how regular the progress is).
>
> With WAL prefetching enabled (I bumped up the buffer to 2MB, and
> prefetch limit to 500, but that was mostly just arbitrary choice), it
> finishes in ~3200 seconds. This includes replication of the pgbench
> initialization, which took ~200 seconds and where prefetching is mostly
> useless. That's a damn pretty improvement, I guess!

Hi Tomas,

Sorry for my slow response -- I've been catching up after some
vacation time.  Thanks very much for doing all this testing work!
Those results are very good, and it's nice to see such compelling
cases even with FPI enabled.

I'm hoping to commit this in the next few weeks.  There are a few
little todos to tidy up, and I need to do some more review/testing of
the error handling and edge cases.  Any ideas on how to battle test it
are very welcome.  I'm also currently testing how it interacts with
some other patches that are floating around.  More soon.

> #recovery_prefetch = on  # whether to prefetch pages logged with FPW
> #recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW
>
> but clearly that comment is only for recovery_prefetch_fpw, the first
> GUC enables prefetching in general.

Ack, thanks.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Amit Kapila
On Wed, Feb 10, 2021 at 1:00 PM Amit Langote  wrote:
>
> On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow  wrote:
> > On Wed, Feb 10, 2021 at 2:39 PM Amit Langote  
> > wrote:
> > >
> > > > The code check that you have identified above ensures that the INSERT
> > > > has an underlying SELECT, because the planner won't (and shouldn't
> > > > anyway) generate a parallel plan for INSERT...VALUES, so there is no
> > > > point doing any parallel-safety checks in this case.
> > > > It just so happens that the problem test case uses INSERT...VALUES -
> > > > and it shouldn't have triggered the parallel-safety checks for
> > > > parallel INSERT for this case anyway, because INSERT...VALUES can't
> > > > (and shouldn't) be parallelized.
> > >
> > > AFAICS, max_parallel_hazard() path never bails from doing further
> > > safety checks based on anything other than finding a query component
> > > whose hazard level crosses context->max_interesting.
> >
> > It's parallel UNSAFE because it's not safe or even possible to have a
> > parallel plan for this.
> > (as UNSAFE is the max hazard level, no point in referencing
> > context->max_interesting).
> > And there are existing cases of bailing out and not doing further
> > safety checks (even your v15_delta.diff patch placed code - for
> > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
> > existing case in max_parallel_hazard_walker()):
> >
> > else if (IsA(node, Query))
> > {
> > Query  *query = (Query *) node;
> >
> > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > if (query->rowMarks != NULL)
> > {
> > context->max_hazard = PROPARALLEL_UNSAFE;
> > return true;
> > }
>
> In my understanding, the max_parallel_hazard() query tree walk is to
> find constructs that are parallel unsafe in that they call code that
> can't run in parallel mode.  For example, FOR UPDATE/SHARE on
> traditional heap AM tuples calls AssignTransactionId() which doesn't
> support being called in parallel mode.  Likewise ON CONFLICT ... DO
> UPDATE calls heap_update() which doesn't support parallelism.  I'm not
> aware of any such hazards downstream of ExecValuesScan().
>
> > >You're trying to
> > > add something that bails based on second-guessing that a parallel path
> > > would not be chosen, which I find somewhat objectionable.
> > >
> > > If the main goal of bailing out is to avoid doing the potentially
> > > expensive modification safety check on the target relation, maybe we
> > > should try to somehow make the check less expensive.  I remember
> > > reading somewhere in the thread about caching the result of this check
> > > in relcache, but haven't closely studied the feasibility of doing so.
> > >
> >
> > There's no "second-guessing" involved here.
> > There is no underlying way of dividing up the VALUES data of
> > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > (apart from the fact that spawning off parallel workers to insert that
> > data would almost always result in worse performance than a
> > non-parallel plan...)
> > The division of work for parallel workers is part of the table AM
> > (scan) implementation, which is not invoked for "INSERT...VALUES".
>
> I don't disagree that the planner would not normally assign a parallel
> path simply to pull values out of a VALUES list mentioned in the
> INSERT command, but deciding something based on the certainty of it in
> an earlier planning phase seems odd to me.  Maybe that's just me
> though.
>

I think it is more of a case where neither is a need for parallelism
nor we want to support parallelism of it. The other possibility for
such a check could be at some earlier phase say in standard_planner
[1] where we are doing checks for other constructs where we don't want
parallelism (I think the check for 'parse->hasModifyingCTE' is quite
similar). If you see in that check as well we just assume other
operations to be in the category of parallel-unsafe. I think we should
rule out such cases earlier than later. Do you have better ideas than
what Greg has done to avoid parallelism for such cases?

[1] -
standard_planner()
{
..
if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
IsUnderPostmaster &&
parse->commandType == CMD_SELECT &&
!parse->hasModifyingCTE &&
max_parallel_workers_per_gather > 0 &&
!IsParallelWorker())
{
/* all the cheap tests pass, so scan the query tree */
glob->maxParallelHazard = max_parallel_hazard(parse);
glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
}
else
{
/* skip the query tree scan, just assume it's unsafe */
glob->maxParallelHazard = PROPARALLEL_UNSAFE;
glob->parallelModeOK = false;
}

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread Amit Langote
On Wed, Feb 10, 2021 at 5:03 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Amit Langote 
> > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow 
> > wrote:
> > > There's no "second-guessing" involved here.
> > > There is no underlying way of dividing up the VALUES data of
> > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > > (apart from the fact that spawning off parallel workers to insert that
> > > data would almost always result in worse performance than a
> > > non-parallel plan...)
> > > The division of work for parallel workers is part of the table AM
> > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> >
> > I don't disagree that the planner would not normally assign a parallel
> > path simply to pull values out of a VALUES list mentioned in the
> > INSERT command, but deciding something based on the certainty of it in
> > an earlier planning phase seems odd to me.  Maybe that's just me
> > though.
>
> In terms of competitiveness, Oracle does not run INSERT VALUES in parallel:
>
> https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-6626C70C-876C-47A4-8C01-9B66574062D8
>
> "The INSERT VALUES statement is never executed in parallel."
>
>
> And SQL Server doesn't either:
>
> https://docs.microsoft.com/en-us/sql/relational-databases/query-processing-architecture-guide?view=sql-server-ver15
>
> "Starting with SQL Server 2016 (13.x) and database compatibility level 130, 
> the INSERT … SELECT statement can be executed in parallel when inserting into 
> heaps or clustered columnstore indexes (CCI), and using the TABLOCK hint."

Just to be clear, I'm not suggesting that we should put effort into
making INSERT ... VALUES run in parallel.  I'm just raising my concern
about embedding the assumption in max_parallel_hazard() that it will
never make sense to do so.

Although, maybe there are other more pressing concerns to resolve, so
I will not insist too much on doing anything about this.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: repeated decoding of prepared transactions

2021-02-10 Thread Markus Wanner

On 10.02.21 07:32, Amit Kapila wrote:

On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian  wrote:

But the other side of the problem is that ,without this, if the
prepared transaction is prior to a consistent snapshot when decoding
starts/restarts, then only the "commit prepared" is sent to downstream
(as seen in the test scenario I shared above), and downstream has to
error away the commit prepared because it does not have the
corresponding prepared transaction.


I think it is not only simple error handling, it is required for
data-consistency. We need to send the transactions whose commits are
encountered after a consistent snapshot is reached.


I'm with Ashutosh here.  If a replica is properly in sync, it knows 
about prepared transactions and all the gids of those.  Sending the 
transactional changes and the prepare again is inconsistent.


The point of a two-phase transaction is to have two phases.  An output 
plugin must have the chance of treating them as independent events. 
Once a PREPARE is confirmed, it must not be sent again.  Even if the 
transaction is still in-progress and its changes are not yet visible on 
the origin node.


Regards

Markus




Offline activation of checksums via standby switchover (was: Online checksums patch - once again)

2021-02-10 Thread Michael Banck
Hi,

Am Mittwoch, den 10.02.2021, 15:06 +0900 schrieb Michael Paquier:
> On Tue, Feb 09, 2021 at 10:54:50AM +0200, Heikki Linnakangas wrote:
> > (I may have said this before, but) My overall high-level impression of this
> > patch is that it's really cmmplex for a feature that you use maybe once in
> > the lifetime of a cluster. I'm happy to review but I'm not planning to
> > commit this myself. I don't object if some other committer picks this up
> > (Magnus?).
> 
> I was just looking at the latest patch set as a matter of curiosity,
> and I have a shared feeling.  

I think this still would be a useful feature; not least for the online
deactivation - having to shut down the instance is sometimes just not an
option in production, even for just a few seconds.

However, there is also the shoot-the-whole-database-into-WAL (at least,
that is what happens, AIUI) issue which has not been discussed that much
either, the patch allows throttling, but I think the impact on actual
production workloads are not very clear yet.

> I think that this is a lot of complication in-core for what would be a
> one-time operation, particularly knowing that there are other ways to
> do it already with the offline checksum tool, even if that is more
> costly: 
> - Involve logical replication after initializing the new instance with
> --data-checksums, or in an upgrade scenatio with pg_upgrade.

Logical replication is still somewhat unpractical for such a (possibly)
routine task, and I don't understand your pg_upgrade scenario, can
expand on that a bit?

> - Involve physical replication: stop the standby cleanly, enable
> checksums on it and do a switchover.

I would like to focus on this, so I changed the subject in order not to
derail the online acivation patch thread.

If this is something we support, then we should document it.

I have to admit that this possiblity escaped me when we first committed
offline (de)activation, it was brought to my attention via 
https://twitter.com/samokhvalov/status/1281312586219188224 and the
following discussion.

So if we think this (to recap: shut down the standby, run pg_checksums
on it, start it up again, wait until it is back in sync, then
switchover) is a safe way to activate checksums on a streaming
replication setup, then we should document it I think. However, I have
only seen sorta hand-waiving on this so far and no deeper analysis of
what could possibly go wrong (but doesn't).

Anybody did some further work/tests on this and/or has something written
up to contribute to the documentation? Or do we think this is not
appropriate to document? I think once we agree this is safe, it is not
more complicated than the rsync-the-standby-after-pg_upgrade recipe we
did document.

> Another thing we could do is to improve pg_checksums with a parallel
> mode.  The main design question would be how to distribute the I/O,
> and that would mean balancing at least across tablespaces.

Right. I thought about this a while ago, but didn't have time to work on
it so far.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz





Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-10 Thread Michael Paquier
On Wed, Feb 10, 2021 at 12:58:05AM -0600, Justin Pryzby wrote:
> If I'm not wrong, you meant to ORDER BY attrelid::regclass::text, attnum;

Indeed, I meant that.  Thanks, Justin!
--
Michael


signature.asc
Description: PGP signature


RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-10 Thread tsunakawa.ta...@fujitsu.com
From: Amit Langote 
> On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow 
> wrote:
> > There's no "second-guessing" involved here.
> > There is no underlying way of dividing up the VALUES data of
> > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > (apart from the fact that spawning off parallel workers to insert that
> > data would almost always result in worse performance than a
> > non-parallel plan...)
> > The division of work for parallel workers is part of the table AM
> > (scan) implementation, which is not invoked for "INSERT...VALUES".
> 
> I don't disagree that the planner would not normally assign a parallel
> path simply to pull values out of a VALUES list mentioned in the
> INSERT command, but deciding something based on the certainty of it in
> an earlier planning phase seems odd to me.  Maybe that's just me
> though.

In terms of competitiveness, Oracle does not run INSERT VALUES in parallel:

https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-6626C70C-876C-47A4-8C01-9B66574062D8

"The INSERT VALUES statement is never executed in parallel."


And SQL Server doesn't either:

https://docs.microsoft.com/en-us/sql/relational-databases/query-processing-architecture-guide?view=sql-server-ver15

"Starting with SQL Server 2016 (13.x) and database compatibility level 130, the 
INSERT … SELECT statement can be executed in parallel when inserting into heaps 
or clustered columnstore indexes (CCI), and using the TABLOCK hint."


Regards
Takayuki Tsunakawa