Re: Index-only scan for "f(x)" without "x"

2020-01-22 Thread Malthe
On Thu, 23 Jan 2020 at 00:00, Tom Lane  wrote:
> The problem is that the planner's initial analysis of the query tree
> concludes that the scan of "tab" has to return "x", because it looks
> through the tree for plain Vars, and "x" is what it's going to find.
> This conclusion is in fact true for any plan that doesn't involve
> scanning a suitable index on f(x), so it's not something we can just
> dispense with.

If f(x) was actually added to the table as a virtual generated column
V then the planner could perhaps reasonably find that a particular
expression depends in part on V (rather than the set of real columns
underlying the virtual generated column).

That is, the planner now knows that it needs to return V = "f(x)"
rather than "x" and so it can match the requirement to our index and
decide to use an index-only scan.

> To back up from that and conclude that the indexscan doesn't really
> need to return "x" because every use of it is in the context "f(x)"
> seems like a pretty expensive proposition, especially once you start
> considering index expressions that are more complex than a single
> function call.  A brute-force matching search could easily be O(N^2)
> or worse in the size of the query tree, which is a lot to pay when
> there's a pretty high chance of failure.  (IOW, we have to expect
> that most queries on "tab" are in fact not going to be able to use
> that index, so we can't afford to spend a lot of planning time just
> because the index exists.)

In the approach outlined above, the set of expressions to match on
would be limited by the set of virtual generated columns defined on
the table — some sort of prefix tree that allows efficient matching on
a given expression syntax tree? There would be no cost to this on a
table that had no virtual generated columns.

The question is whether this is a reasonable take on virtual generated columns.

   ---regards




Re: Run-time pruning for ModifyTable

2020-01-22 Thread Amit Langote
Sorry, I didn't notice this email until now.

On Wed, Nov 27, 2019 at 5:17 PM Michael Paquier  wrote:
> On Tue, Nov 05, 2019 at 04:04:25PM +1300, Thomas Munro wrote:
> > On Thu, Sep 12, 2019 at 10:10 AM Alvaro Herrera
> >  wrote:
> > > Here's a rebased version of this patch (it had a trivial conflict).
> >
> > Hi, FYI partition_prune.sql currently fails (maybe something to do
> > with commit d52eaa09?):
>
> David, perhaps you did not notice that?  For now I have moved this
> patch to next CF waiting on author to look after the failure.
>
> Amit, Kato-san, both of you are marked as reviewers of this patch.
> Are you planning to look at it?

Sorry, I never managed to look at the patch closely.  As I commented
up-thread, the functionality added by this patch would be unnecessary
if we were to move forward with the other project related to UPDATE
and DELETE over inheritance trees:

https://www.postgresql.org/message-id/357.1550612935%40sss.pgh.pa.us

I had volunteered to submit a patch in that thread and even managed to
write one but didn't get time to get it in good enough shape to post
it to the list, like I couldn't make it handle foreign child tables.
The gist of the new approach is that ModifyTable will always have
*one* subplan under ModifyTable, not N for N target partitions as
currently.  That one subplan being the same plan as one would get if
the query were SELECT instead of UPDATE/DELETE, it would automatically
take care of run-time pruning if needed, freeing ModifyTable itself
from having to do it.

Now, the chances of such a big overhaul of how UPDATEs of inheritance
trees are handled getting into PG 13 seem pretty thin even if I post
the patch in few days, so perhaps it would make sense to get this
patch in so that we can give users run-time pruning for UPDATE/DELETE
in PG 13, provided the code is not very invasive.  If and when the
aforesaid overhaul takes place, that code would go away along with a
lot of other code.

Thanks,
Amit




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Michael Paquier
On Thu, Jan 23, 2020 at 12:56:41PM +0800, Craig Ringer wrote:
> We could definitely improve on that by exposing a view that integrates
> everything that holds down xmin and catalog_xmin. It'd show
> 
> * the datfrozenxid and datminmxid for the oldest database
>   * if that database is the current database, the relation(s) with the
> oldest relfrozenxid and relminmxd
>   * ... and the catalog relation(s) with the oldest relfrozenxid and
> relminmxid if greater
> * the absolute xid and xid-age positions of entries in pg_replication_slots
> * pg_stat_replication connections (joined to pg_stat_replication if
> connected) with their feedback xmin
> * pg_stat_activity backend_xid and backend_xmin for the backend(s)
> with oldest values; this may be different sets of backends
> * pg_prepared_xacts entries by oldest xid

It seems to me that what you are describing here is a set of
properties good for a monitoring tool that we don't necessarily need
to maintain in core.  There are already tools able to do that in ways
I think are better than what we could ever design, like
check_pgactivity and such.  And there are years of experience behind
that from the maintainers of such scripts and/or extensions.

The argument about Postgres getting more and more complex is true as
the code grows, but I am not really convinced that we need to make it
grow more with more layers that we think are good, because we may
finish by piling up stuff which are not actually that good in the long
term.  I'd rather just focus in the core code on the basics with views
that map directly to what we have in memory and/or disk.
--
Michael


signature.asc
Description: PGP signature


Re: vacuum verbose detail logs are unclear; log at *start* of each stage; show allvisible/frozen/hintbits

2020-01-22 Thread Michael Paquier
On Wed, Jan 22, 2020 at 02:34:57PM +0900, Michael Paquier wrote:
> Shouldn't the last part of the sentence be "we should mark it so"
> instead of "we should so mark it"?  I would rephrase the whole as
> follows:
> "If the all-visible page is all-frozen but not marked as such yet,
> mark it as all-frozen."

Applied this one to HEAD after chewing on it a bit.
--
Michael


signature.asc
Description: PGP signature


Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?

2020-01-22 Thread Michael Paquier
On Wed, Jan 22, 2020 at 11:18:12PM +0100, Daniel Gustafsson wrote:
> Once my plate clears up a bit I will return to this one, but feel free to mark
> it rwf for this cf.

Thanks for the update.  I have switched the patch status to returned
with feedback.
--
Michael


signature.asc
Description: PGP signature


Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Craig Ringer
On Thu, 23 Jan 2020 at 01:20, Tom Lane  wrote:
>
> Bruce Momjian  writes:
> > I think the big question is whether we want to make active prepared
> > transactions more visible to administrators, either during server start
> > or idle duration.
>
> There's already the pg_prepared_xacts view ...

I think Bruce has a point here. We shouldn't go around "resolving"
prepared xacts, but the visibility of them is a problem for users.
I've seen that myself quite enough times, even now that they cannot be
used by default.

Our monitoring and admin views are not keeping up with Pg's
complexity. Resource retention is one area where that's becoming a
usability and admin challenge. If a user has growing bloat (and have
managed to figure that out, since we don't make it easy to do that
either) or unexpected WAL retention they may find it hard to quickly
work out why.

We could definitely improve on that by exposing a view that integrates
everything that holds down xmin and catalog_xmin. It'd show

* the datfrozenxid and datminmxid for the oldest database
  * if that database is the current database, the relation(s) with the
oldest relfrozenxid and relminmxd
  * ... and the catalog relation(s) with the oldest relfrozenxid and
relminmxid if greater
* the absolute xid and xid-age positions of entries in pg_replication_slots
* pg_stat_replication connections (joined to pg_stat_replication if
connected) with their feedback xmin
* pg_stat_activity backend_xid and backend_xmin for the backend(s)
with oldest values; this may be different sets of backends
* pg_prepared_xacts entries by oldest xid

... probably sorted by xid age.

It'd be good to expose some internal state too, which would usually
correspond to the oldest values found in the above, but is useful for
cross-checking:

* RecentGlobalXmin and RecentGlobalDataXmin to show the xmin and
catalog_xmin actually used
* procArray->replication_slot_xmin and procArray->replication_slot_catalog_xmin

I'm not sure whether WAL retention (lsn tracking) should be in the
same view or a different one, but I lean toward different.

I already have another TODO kicking around for me to write a view that
generates a blocking locks graph, since pg_locks is really more of a
building block than a directly useful view for admins to understand
the system's state. And if that's not enough I also want to write a
decent bloat-checking view to include in the system views, since IMO
lock-blocking, bloat, and resource retention are real monitoring pain
points right now.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Error message inconsistency

2020-01-22 Thread Amit Kapila
On Wed, Jan 22, 2020 at 8:48 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > I wonder if we shouldn't be using errtablecol() here instead of (in
> > addition to?) patching the errmsg() to include the table name.
>
> > Discussion: If we really like having the table names in errtable(), then
> > we should have psql display it by default, and other tools will follow
> > suit; in that case we should remove the table name from error messages,
> > or at least not add it to even more messages.
>
> > If we instead think that errtable() is there just for programmatically
> > knowing the affected table, then we should add the table name to all
> > errmsg() where relevant, as in the patch under discussion.
>
> > What do others think?
>
> I believe that the intended use of errtable() and friends is so that
> applications don't have to parse those names out of a human-friendly
> message.  We should add calls to them in cases where (a) an application
> can tell from the SQLSTATE that some particular table is involved
> and (b) it's likely that the app would wish to know which table that is.
> I don't feel a need to sprinkle every single ereport() in the backend
> with errtable(), just ones where there's a plausible use-case for the
> extra cycles that will be spent.
>
> On the other side of the coin, whether we use errtable() is not directly
> a factor in deciding what the human-friendly messages should say.
> I do find it hard to envision a case where we'd want to use errtable()
> and *not* put the table name in the error message, just because if
> applications need to know something then humans probably do too.
>

makes sense.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-22 Thread Tom Lane
Arthur Zakirov  writes:
> On 2020/01/23 7:11, Tom Lane wrote:
>> Closer examination shows that the "max" argument is pretty bogus as
>> well.  It doesn't do anything except confuse the reader, because there
>> are no cases where the value passed is less than the maximum array entry
>> length, so it never acts to change seq_search's behavior.  So we should
>> just drop that behavior from seq_search, too, and redefine "max" as
>> having no purpose except to specify how much of the string to show in
>> error messages.  There's still a question of what that should be for
>> non-English cases, but at least we now have a clear idea of what we
>> need the value to do.

> Shouldn't we just show all remaining string instead of truncating it? 

That would avoid a bunch of arbitrary decisions, for sure.
Anybody have an objection?

regards, tom lane




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Craig Ringer
On Wed, 22 Jan 2020 at 23:22, Tom Lane  wrote:
>
> Thomas Kellerer  writes:
> > Tom Lane schrieb am 22.01.2020 um 16:05:
> >> Right.  It's the XA transaction manager's job not to forget uncommitted
> >> transactions.  Reasoning as though no TM exists is not only not very
> >> relevant, but it might lead you to put in features that actually
> >> make the TM's job harder.  In particular, a timeout (or any other
> >> mechanism that leads PG to abort or commit a prepared transaction
> >> of its own accord) does that.
>
> > That's a fair point, but the reality is that not all XA transaction managers
> > do a good job with that.
>
> If you've got a crappy XA manager, you should get a better one, not
> ask us to put in features that make PG unsafe to use with well-designed
> XA managers.

Agreed. Or use some bespoke script that does the cleanup that you
think is appropriate for your particular environment and set of bugs.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Error message inconsistency

2020-01-22 Thread Amit Kapila
On Wed, Jan 22, 2020 at 3:23 PM MBeena Emerson  wrote:
>
> > -
> > 14. src/backend/commands/tablecmds.c
> >
> >  5310 else
> >  5311 ereport(ERROR,
> >  5312 (errcode(ERRCODE_CHECK_VIOLATION),
> >  5313  errmsg("partition constraint is violated 
> > by some row")));
> >
> > Added relation name for this error.  This can be verified by below example:
> > Ex:
> > CREATE TABLE list_parted (a int,b char)PARTITION BY LIST (a);
> > CREATE TABLE part_1 (LIKE list_parted);
> > INSERT INTO part_1 VALUES (3, 'a');
> > ALTER TABLE list_parted ATTACH PARTITION part_1 FOR VALUES IN (2);
> >
> > Without patch:
> > ERROR:  partition constraint is violated by some row
> > With patch:
> > ERROR:  partition constraint "part_1" is violated by some row
>
> Here it seems as if "part_1" is the constraint name.
>

I agree.

>  It would be
> better to change it to:
>
> partition constraint is violated by some row in relation "part_1" OR
> partition constraint of relation "part_1" is violated b some row
>

+1 for the second option suggested by Beena.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-01-22 Thread Amit Kapila
On Wed, Jan 22, 2020 at 10:07 PM Alvaro Herrera
 wrote:
>
> I looked at this patchset and it seemed natural to apply 0008 next
> (adding work_mem to subscriptions).
>

I am not so sure whether we need this patch as the exact scenario
where it can help is not very clear to me and neither did anyone
explained.  I have raised this concern earlier as well [1].  The point
is that 'logical_decoding_work_mem' applies to the entire
ReorderBuffer in the publisher's side and how will a parameter from a
particular subscription help in that?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1J%2B3kab6RSZrgj0YiQV1r%2BH3FWVaNjKhWvpEe5-bpZiBw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-22 Thread Arthur Zakirov

On 2020/01/23 7:11, Tom Lane wrote:

Closer examination shows that the "max" argument is pretty bogus as
well.  It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior.  So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages.  There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.


Shouldn't we just show all remaining string instead of truncating it? 
For example I get the following output:


=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH ');
ERROR:  invalid value "ЯНВА" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for 
this field.


This behavior is reproduced without the patch though (on Postgres 12).

And the English case might confuse too I think:

=# select to_date('3 JANUZRY 1999', 'DD MONTH ');
ERROR:  invalid value "JANUZRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for 
this field.


Users might think what means "1" in "JANUZRY 1" string.

I attached the draft patch, which shows all remaining string. So the 
query above will show the following:


=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH ');
ERROR:  invalid value "ЯНВАРЯ 1999" for "MONTH"
DETAIL:  The remaining value did not match any of the allowed values for 
this field.



The 0001 patch attached cleans up all the above complaints.  I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.


+1 on the patch.

--
Arthur
diff --git a/src/backend/utils/adt/formatting.c 
b/src/backend/utils/adt/formatting.c
index ef21b7e426..63e773c4c1 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -2539,17 +2539,11 @@ from_char_seq_search(int *dest, const char **src, const 
char *const *array,
 
if (len <= 0)
{
-   charcopy[DCH_MAX_ITEM_SIZ + 1];
-
-   /* Use multibyte-aware truncation to avoid generating a bogus 
string */
-   max = pg_mbcliplen(*src, strlen(*src), max);
-   strlcpy(copy, *src, max + 1);
-
RETURN_ERROR(ereport(ERROR,
 
(errcode(ERRCODE_INVALID_DATETIME_FORMAT),
  errmsg("invalid value 
\"%s\" for \"%s\"",
-copy, 
node->key->name),
- errdetail("The given 
value did not match any of "
+*src, 
node->key->name),
+ errdetail("The 
remaining value did not match any of "

"the allowed values for this field.";
}
*src += len;


Re: Improve search for missing parent downlinks in amcheck

2020-01-22 Thread Alexander Korotkov
On Fri, Jan 17, 2020 at 1:05 AM Tomas Vondra
 wrote:
> On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote:
> >On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
> >> The revised patch seems to fix all of above.
> >
> >The latest patch is failing to apply.  Please provide a rebase.
>
> This still does not apply (per cputube). Can you provide a fixed patch?

Rebased patch is attached.  Sorry for so huge delay.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


amcheck-btree-improve-missing-parent-downlinks-check-4.patch
Description: Binary data


Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan  wrote:
> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
>  wrote:
> > As far as I know, a page split could happen at any random element in the 
> > page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this 
> > is not a problem, as we already read all matching elements in our local 
> > buffer and we'll return those. But the skip scan currently:
> > a) checks the lo-key of the page to see if the next prefix can be found on 
> > the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have 
> > returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
-- 
Peter Geoghegan




Re: A rather hackish POC for alternative implementation of WITH TIES

2020-01-22 Thread Andrew Gierth
> "Alvaro" == Alvaro Herrera  writes:

 Alvaro> My own inclination is that Andrew's implementation, being more
 Alvaro> general in nature, would be the better one to have in the
 Alvaro> codebase; but we don't have a complete patch yet. Can we reach
 Alvaro> some compromise such as if Andrew's patch is not completed then
 Alvaro> we push Surafel's?

Mine needs some attention to where exactly in planning the necessary
transformation work should be done; right now the planner part is a
hack, intended to demonstrate the idea (and to let the executor changes
work) rather than actually be the final version. As I mentioned before,
some stuff does need to be pushed out to an InitPlan to make it work
without multiple-evaluation problems.

(A second opinion from another planner expert would be welcome on that
part)

I was largely holding off on doing further work hoping for some
discussion of which way we should go. If you think my approach is worth
pursuing (I haven't seriously tested the performance, but I'd expect it
to be slower than Surafel's - the price you pay for flexibility) then I
can look at it further, but figuring out the planner stuff will take
some time.

-- 
Andrew.




Re: error context for vacuum to include block number

2020-01-22 Thread Justin Pryzby
On Mon, Jan 20, 2020 at 11:11:20AM -0800, Andres Freund wrote:
> > @@ -966,8 +986,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, 
> > LVRelStats *vacrelstats,
> > /* Work on all the indexes, then the heap */
> > +   /* Don't use the errcontext handler outside this 
> > function */
> > +   error_context_stack = errcallback.previous;
> > lazy_vacuum_all_indexes(onerel, Irel, indstats,
> > 
> > vacrelstats, lps, nindexes);
> > +   error_context_stack = 
> 
> Alternatively we could push another context for each index inside
> lazy_vacuum_all_indexes(). There's been plenty bugs in indexes
> triggering problems, so that could be worthwhile.

Is the callback for index vacuum useful without a block number?

FYI, I have another patch which would add DEBUG output before each stage, which
would be just as much information, and without needing to use a callback.
It's 0004 here:

https://www.postgresql.org/message-id/20200121134934.GY26045%40telsasoft.com
@@ -1752,9 +1753,12 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
{
int idx;

-   for (idx = 0; idx < nindexes; idx++)
+   for (idx = 0; idx < nindexes; idx++) {
+   ereport(DEBUG1, (errmsg("\"%s\": vacuuming index",
+   
RelationGetRelationName(Irel[idx];
lazy_vacuum_index(Irel[idx], [idx], 
vacrelstats->dead_tuples,





Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan
On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

I suggest reading the nbtree README file's description of backwards
scans. Read the paragraph that begins with 'We support the notion of
an ordered "scan" of an index...'. I also suggest that you read a bit
of the stuff in the large section on page deletion. Certainly read the
paragraph that begins with 'Moving left in a backward scan is
complicated because...'.

It's important to grok why it's okay that we don't "couple" or "crab"
buffer locks as we descend the tree with Lehman & Yao's design -- we
can get away with having *no* interlock against page splits (e.g.,
pin, buffer lock) when we are "between" levels of the tree. This is
safe, since the page that we land on must still be "substantively the
same page", no matter how much time passes. That is, it must at least
cover the leftmost portion of the keyspace covered by the original
version of the page that we saw that we needed to descend to within
the parent page. The worst that can happen is that we have to recover
from a concurrent page split by moving right one or more times.
(Actually, page deletion can change the contents of a page entirely,
but that's not really an exception to the general rule -- page
deletion is careful about recycling pages that an in flight index scan
might land on.)

Lehman & Yao don't have backwards scans (or left links, or page
deletion). Unlike nbtree. This is why the usual Lehman & Yao
guarantees don't quite work with backward scans. We must therefore
compensate as described by the README file (basically, we check and
re-check for races, possibly returning to the original page when we
think that we might have overlooked something and need to make sure).
It's an exception to the general rule, you could say.


--
Peter Geoghegan




Re: Index-only scan for "f(x)" without "x"

2020-01-22 Thread Tom Lane
Malthe  writes:
> Referencing the example given in the documentation for index-only
> scans [0], we consider an index:
> CREATE INDEX tab_f_x ON tab (f(x));
> This index currently will not be used for an index-scan for the
> following query since the planner isn't smart enough to know that "x"
> is not needed:
> SELECT f(x) FROM tab WHERE f(x) < 1;
> However, any function applied to a column for an index expression is
> required to be immutable so as far as I can tell the planner doesn't
> have to be very smart to know that the index can indeed be used for an
> index-only scan (without having "x" included).

The problem is that the planner's initial analysis of the query tree
concludes that the scan of "tab" has to return "x", because it looks
through the tree for plain Vars, and "x" is what it's going to find.
This conclusion is in fact true for any plan that doesn't involve
scanning a suitable index on f(x), so it's not something we can just
dispense with.

To back up from that and conclude that the indexscan doesn't really
need to return "x" because every use of it is in the context "f(x)"
seems like a pretty expensive proposition, especially once you start
considering index expressions that are more complex than a single
function call.  A brute-force matching search could easily be O(N^2)
or worse in the size of the query tree, which is a lot to pay when
there's a pretty high chance of failure.  (IOW, we have to expect
that most queries on "tab" are in fact not going to be able to use
that index, so we can't afford to spend a lot of planning time just
because the index exists.)

> What's required in order to move forward on these capabilities?

Some non-brute-force solution to the above problem.

regards, tom lane




Index-only scan for "f(x)" without "x"

2020-01-22 Thread Malthe
Referencing the example given in the documentation for index-only
scans [0], we consider an index:

CREATE INDEX tab_f_x ON tab (f(x));

This index currently will not be used for an index-scan for the
following query since the planner isn't smart enough to know that "x"
is not needed:

SELECT f(x) FROM tab WHERE f(x) < 1;

However, any function applied to a column for an index expression is
required to be immutable so as far as I can tell the planner doesn't
have to be very smart to know that the index can indeed be used for an
index-only scan (without having "x" included).

One interesting use-case for this is to be able to create
space-efficient indexes for raw log data. For example, for each type
of message (which might be encoded as JSON), one could create a
partial index with the relevant fields extracted and converted into
native data types and use index-only scanning to query. This is not
particularly attractive today because the message itself would need to
be added to the index effectively duplicating the log data.

In the same vein, being able to add this auxiliary data (which is
basically immutable expressions on one or more columns) explicitly
using INCLUDE would make the technique actually reliable. This is not
possible right now since expression are not supported as included
columns.

What's required in order to move forward on these capabilities?

[0] https://www.postgresql.org/docs/current/indexes-index-only-scans.html




Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan
On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee  wrote:
> The loose index scan shouldn't return a tuple twice. It should only be able 
> to skip 'further', so that shouldn't be a problem. Out of curiosity, why 
> can't index scans return the same tuple twice? Is there something in the 
> executor that isn't able to handle this?

I have no reason to believe that the executor has a problem with index
scans that return a tuple more than once, aside from the very obvious:
in general, that will often be wrong. It might not be wrong when the
scan happens to be input to a unique node anyway, or something like
that.

I'm not particularly concerned about it. Just wanted to be clear on
our assumptions for loose index scans -- if loose index scans were
allowed to return a tuple more than once, then that would at least
have to at least be considered in the wider context of the executor
(but apparently they're not, so no need to worry about it). This may
have been mentioned somewhere already. If it is then I must have
missed it.

-- 
Peter Geoghegan




Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?

2020-01-22 Thread Daniel Gustafsson
> On 26 Nov 2019, at 06:44, Michael Paquier  wrote:

Re this patch being in WoA state for some time [0]:

> The regression tests of contrib/test_decoding are still failing here:
> +ERROR:  could not resolve cmin/cmax of catalog tuple

This is the main thing left with this patch, and I've been unable so far to
figure it out.  I have an unscientific hunch that this patch is shaking out
something (previously unexercised) in the logical decoding code supporting
multi-inserts in the catalog.  If anyone has ideas or insights, I would love
the help.

Once my plate clears up a bit I will return to this one, but feel free to mark
it rwf for this cf.

cheers ./daniel

[0] postgr.es/m/20200121144937.n24oacjkegu4pnpe@development



Re: Allow to_date() and to_timestamp() to accept localized names

2020-01-22 Thread Tom Lane
I wrote:
> One thing I completely don't understand is why it's sufficient for
> seq_search_localized to do "initcap" semantics.  Shouldn't it cover
> the all-upper and all-lower cases as well, as the existing seq_search
> code does?  (That is, why is it okay to ignore the "type" argument?)

I took a closer look at this and decided you were probably doing that
just because the seq_search code uses initcap-style case folding to
match month and day names, relying on the assumption that the constant
arrays it's passed are cased that way.  But we shouldn't (and the patch
doesn't) assume that the localized names we'll get from pg_locale.c are
cased that way.

However ... it seems to me that the way seq_search is defined is
plenty bogus.  In the first place, it scribbles on its source string,
which is surely not okay.  You can see that in error messages that
are thrown on match failures:

regression=# select to_date('3 JANUZRY 1999', 'DD MONTH ');
ERROR:  invalid value "JanuzRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for this field.

Now, maybe somebody out there thinks this is a cute way of highlighting
how much of the string we examined, but it looks more like a bug from
where I sit.  It's mere luck that what's being clobbered is a local
string copy created by do_to_timestamp(), and not the original input
data passed to to_date().

In the second place, there's really no need at all for seq_search to
rely on any assumptions about the case state of the array it's
searching.  pg_toupper() is pretty cheap already, and we can make it
guaranteed cheap if we use pg_ascii_toupper() instead.  So I think
we ought to just remove the "type" argument altogether and have
seq_search dynamically convert all the strings it examines to upper
case (or lower case would work as well, at least for English).

> On the other hand, you probably *should* be ignoring the "max"
> argument, because AFAICS the values that are passed for that are
> specific to the English ASCII spellings of the day and month names.
> It seems very likely that they could be too small for some sets of
> non-English names.

Closer examination shows that the "max" argument is pretty bogus as
well.  It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior.  So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages.  There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

I also noted while I was looking at it that from_char_seq_search()
would truncate at "max" bytes even when dealing with multibyte input.
That has a clear risk of generating an invalidly-encoded error message.

The 0001 patch attached cleans up all the above complaints.  I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

I'm tempted to go a bit further than 0001 does, and remove the 'max'
argument from from_char_seq_search() altogether.  Since we only need
'max' in error cases, which don't need to be super-quick, we could drop
the requirement for the caller to specify that and instead compute it
when we do need it as the largest of the constant array's string
lengths.  That'd carry over into the localized-names case in a
reasonably straightforward way (though we might need to count characters
not bytes for that to work nicely).

Because of the risk of incorrectly-encoded error messages, I'm rather
tempted to claim that these patches (or at least 0001) are a bug fix
and should be back-patched.  In any case I think we should apply this
to HEAD and then rebase the localized-names patch over it.  It makes
a whole lot more sense, IMO, for the localized-names comparison logic
to match what this is doing than what seq_search does today.

Comments?

regards, tom lane

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index ca3c48d..3ef10dc 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL};
  * Flags & Options:
  * --
  */
-#define ONE_UPPER		1		/* Name */
-#define ALL_UPPER		2		/* NAME */
-#define ALL_LOWER		3		/* name */
-
 #define MAX_MONTH_LEN	9
 #define MAX_MON_LEN		3
 #define MAX_DAY_LEN		9
@@ -1068,9 +1064,9 @@ static int	from_char_parse_int_len(int *dest, char **src, const int len,
 	FormatNode *node, bool *have_error);
 static int	from_char_parse_int(int *dest, char **src, FormatNode *node,
 bool *have_error);
-static int	seq_search(char *name, const char *const *array, int type, int 

Re: Online checksums patch - once again

2020-01-22 Thread Robert Haas
On Wed, Jan 22, 2020 at 3:28 PM Magnus Hagander  wrote:
> > I think the argument about adding catalog flags adding overhead is
> > pretty much bogus. Fixed-width fields in catalogs are pretty cheap.
>
> If that's the general view, then yeah our "cost calculations" were
> off. I guess I may have been colored by the cost of adding statistics
> counters, and had that influence the thinking. Incorrect judgement on
> that cost certainly contributed to the decision. back then.

For either statistics or for pg_class, the amount of data that we have
to manage is proportional to the number of relations (which could be
big) multiplied by the data stored for each relation. But the
difference is that the stats file has to be rewritten, at least on a
per-database basis, very frequently, while pg_class goes through
shared-buffers and so doesn't provoke the same stupid
write-the-whole-darn-thing behavior. That is a pretty key difference,
IMHO.

Now, it would be nice to fix the stats system, but until we do, here we are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: A rather hackish POC for alternative implementation of WITH TIES

2020-01-22 Thread Alvaro Herrera
Hello

As this is a valuable feature, it would be good to have something happen
here.  I wouldn't like to have pg13 ship with no implementation of WITH
TIES at all.

My own inclination is that Andrew's implementation, being more general
in nature, would be the better one to have in the codebase; but we don't
have a complete patch yet.  Can we reach some compromise such as if
Andrew's patch is not completed then we push Surafel's?

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Fix crash in BRIN inclusion op functions, due to missing datum c

2020-01-22 Thread Alvaro Herrera
On 2020-Jan-20, Heikki Linnakangas wrote:

> That case arises at least with the range_union() function, when one of
> the arguments is an 'empty' range:
> 
> CREATE TABLE brintest (n numrange);
> CREATE INDEX brinidx ON brintest USING brin (n);
> INSERT INTO brintest VALUES ('empty');
> INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric));
> INSERT INTO brintest VALUES ('(-1, 0)');
> 
> SELECT brin_desummarize_range('brinidx', 0);
> SELECT brin_summarize_range('brinidx', 0);

I noticed that this test increases line-wise coverage of
brin_inclusion.c by a few percentage points, so I added it.

Again, thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Index Skip Scan

2020-01-22 Thread Dmitry Dolgov
> On Wed, Jan 22, 2020 at 09:08:59PM +, Floris Van Nee wrote:
> > Note in particular that index scans cannot return the same index tuple 
> > twice -
> > - processing a page at a time ensures that that cannot happen.
> >
> > Can a loose index scan return the same tuple (i.e. a tuple with the same 
> > heap
> > TID) to the executor more than once?
> >
>
> The loose index scan shouldn't return a tuple twice. It should only be able 
> to skip 'further', so that shouldn't be a problem.

Yes, it shouldn't happen.




Re: Index Skip Scan

2020-01-22 Thread Dmitry Dolgov
> On Wed, Jan 22, 2020 at 05:24:43PM +, Floris Van Nee wrote:
>
> > > Anyone please correct me if I'm wrong, but I think one case where the 
> > > current patch relies on some data from the page it has locked before it 
> > > in checking this hi/lo key. I think it's possible for the following 
> > > sequence to happen. Suppose we have a very simple one leaf-page btree 
> > > containing four elements: leaf page 1 = [2,4,6,8]
> > > We do a backwards index skip scan on this and have just returned our 
> > > first tuple (8). The buffer is left pinned but unlocked. Now, someone 
> > > else comes in and inserts a tuple (value 5) into this page, but suppose 
> > > the page happens to be full. So a page split occurs. As far as I know, a 
> > > page split could happen at any random element in the page. One of the 
> > > situations we could be left with is:
> > > Leaf page 1 = [2,4]
> > > Leaf page 2 = [5,6,8]
> > > However, our scan is still pointing to leaf page 1.
> 
> > In case if we just returned a tuple, the next action would be either
> >> check the next page for another key or search down to the tree. Maybe
> 
> But it won't look at the 'next page for another key', but rather at the 'same 
> page or another key', right? In the _bt_scankey_within_page shortcut we're 
> taking, there's no stepping to a next page involved. It just locks the page 
> again that it previously also locked.

Yep, it would look only on the same page. Not sure what do you mean by
"another key", if the current key is not found within the current page
at the first stage, we restart from the root.

> > I'm missing something in your scenario, but the latter will land us on a
> > required page (we do not point to any leaf here), and before the former
> > there is a check for high/low key. Is there anything else missing?
> 
> Let me try to clarify. After we return the first tuple, so->currPos.buf is 
> pointing to page=1 in my example (it's the only page after all). We've 
> returned item=8. Then the split happens and the items get rearranged as in my 
> example. We're still pointing with so->currPos.buf to page=1, but the page 
> now contains [2,4]. The split happened to the right, so there's a page=2 with 
> [5,6,8], however the ongoing index scan is unaware of that.
> Now _bt_skip gets called to fetch the next tuple. It starts by checking 
> _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the 
> result of which will be 'true': we're comparing the skip key to the low key 
> of the page. So it thinks the next key can be found on the current page. It 
> locks the page and does a _binsrch, finding item=4 to be returned.
> 
> The problem here is that _bt_scankey_within_page mistakenly returns true, 
> thereby limiting the search to just the page that it's pointing to already.
> It may be fine to just fix this function to return the proper value (I guess 
> it'd also need to look at the high key in this example). It could also be 
> fixed by not looking at the lo/hi key of the page, but to use the local tuple 
> buffer instead. We already did a _read_page once, so if we have any matching 
> tuples on that specific page, we have them locally in the buffer already. 
> That way we never need to lock the same page twice.

Oh, that's what you mean. Yes, I was somehow tricked by the name of this
function and didn't notice that it checks only one boundary, so in case
of backward scan it returns wrong result. I think in the situation
you've describe it would actually not find any item on the current page
and restart from the root, but nevertheless we need to check for both
keys in _bt_scankey_within_page.




RE: Index Skip Scan

2020-01-22 Thread Floris Van Nee
> Note in particular that index scans cannot return the same index tuple twice -
> - processing a page at a time ensures that that cannot happen.
> 
> Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> TID) to the executor more than once?
> 

The loose index scan shouldn't return a tuple twice. It should only be able to 
skip 'further', so that shouldn't be a problem. Out of curiosity, why can't 
index scans return the same tuple twice? Is there something in the executor 
that isn't able to handle this?

-Floris



Re: [PATCH] Windows port, fix some resources leaks

2020-01-22 Thread Ranier Vilela
Hi,
After review the patches and build all and run regress checks for each
patch, those are the ones that don't break.
Not all leaks detected by Coverity are fixed.

regards,
Ranier Vilela


auth_leak.patch
Description: Binary data


logging_leaks.patch
Description: Binary data


postmaster_leak.patch
Description: Binary data


regex_leak.patch
Description: Binary data


restricted_token_leaks.patch
Description: Binary data


win32_shmem_leak.patch
Description: Binary data


Re: Online checksums patch - once again

2020-01-22 Thread Magnus Hagander
On Wed, Jan 22, 2020 at 12:20 PM Robert Haas  wrote:
>
> On Wed, Jan 22, 2020 at 2:50 PM Magnus Hagander  wrote:
> > The reasoning that led us to *not* doing that is that it's a one-off
> > operation. That along with the fact that we hope to at some point be
> > able to change the default to chekcsums on  (and t wouldn't be
> > necessary for the transition on->off as that is very fast), it would
> > become an increasingly rate on-off operation. And by adding these
> > flags to the catalogs, everybody is paying the overhead for this
> > one-off rare operation. Another option would be to add the flag on the
> > pg_database level which would decrease the overhead, but our guess was
> > that this would also decrease the usefulness in most cases to make it
> > not worth it (most people with big databases don't have many big
> > databases in the same cluster -- it's usually just one or two, so in
> > the end the results would be more or less the same was we have now as
> > it would have to keep re-doing the big ones)
>
> I understand, but the point for me is that the patch does not seem
> robust as written. Nobody's going to be happy if there are reasonably
> high-probability scenarios where it turns checksums part way on and
> then just stops. Now, that can probably be improved to some degree
> without adding catalog flags, but I bet it can be improved more and
> for less effort if we do add catalog flags. Maybe being able to
> survive a cluster restart without losing track of progress is not a
> hard requirement for this feature, but it certainly seems nice. And I

It's certainly nice, but that is of course a cost/benefit tradeoff
calculation. Our thoughts on that was that the cost was higher than
the benefit -- which may of course be wrong, and in that case it's
better to have it changed.


> would venture to say that continuing to run without giving up if there
> happen to be no background workers available for a while IS a hard
> requirement, because that can easily happen due to normal use of
> parallel query. We do not normally commit features if, without any
> error occurring, they might just give up part way through the
> operation.

That part I agree with, but I don't think that in itself requires
per-relation level tracking.


> I think the argument about adding catalog flags adding overhead is
> pretty much bogus. Fixed-width fields in catalogs are pretty cheap.

If that's the general view, then yeah our "cost calculations" were
off. I guess I may have been colored by the cost of adding statistics
counters, and had that influence the thinking. Incorrect judgement on
that cost certainly contributed to the decision. back then.

But as noted, work is being done on adding it, so let's see what that
ends up looking like in reality.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Online checksums patch - once again

2020-01-22 Thread Robert Haas
On Wed, Jan 22, 2020 at 2:50 PM Magnus Hagander  wrote:
> The reasoning that led us to *not* doing that is that it's a one-off
> operation. That along with the fact that we hope to at some point be
> able to change the default to chekcsums on  (and t wouldn't be
> necessary for the transition on->off as that is very fast), it would
> become an increasingly rate on-off operation. And by adding these
> flags to the catalogs, everybody is paying the overhead for this
> one-off rare operation. Another option would be to add the flag on the
> pg_database level which would decrease the overhead, but our guess was
> that this would also decrease the usefulness in most cases to make it
> not worth it (most people with big databases don't have many big
> databases in the same cluster -- it's usually just one or two, so in
> the end the results would be more or less the same was we have now as
> it would have to keep re-doing the big ones)

I understand, but the point for me is that the patch does not seem
robust as written. Nobody's going to be happy if there are reasonably
high-probability scenarios where it turns checksums part way on and
then just stops. Now, that can probably be improved to some degree
without adding catalog flags, but I bet it can be improved more and
for less effort if we do add catalog flags. Maybe being able to
survive a cluster restart without losing track of progress is not a
hard requirement for this feature, but it certainly seems nice. And I
would venture to say that continuing to run without giving up if there
happen to be no background workers available for a while IS a hard
requirement, because that can easily happen due to normal use of
parallel query. We do not normally commit features if, without any
error occurring, they might just give up part way through the
operation.

I think the argument about adding catalog flags adding overhead is
pretty much bogus. Fixed-width fields in catalogs are pretty cheap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: making the backend's json parser work in frontend code

2020-01-22 Thread Alvaro Herrera
On 2020-Jan-22, Robert Haas wrote:

> On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera  
> wrote:
> > I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
> > seems to me that you could move back all the contents from jsonapi.h
> > into json.h, and everything would work just as well.  (Evidently the
> > Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
> > that prototype in jsonfuncs.h would work.)
> >
> > I don't really object to your 0001 patch as posted, though.
> 
> The goal is to make it possible to use the JSON parser in the
> frontend, and we can't do that if the header files that would have to
> be included on the frontend side rely on things that only work in the
> backend. As written, the patch series leaves json.h with a dependency
> on Datum, so the stuff that it leaves in jsonapi.h (which is intended
> to be the header that gets moved to src/common and included by
> frontend code) can't be merged with it.

Right, I agree with that goal, and as I said, I don't object to your
patch as posted.

> Now, we could obviously rearrange that. I don't think any of the file
> naming here is great. But I think we probably want, as far as
> possible, for the code in FOO.c to correspond to the prototypes in
> FOO.h. What I'm thinking we should work towards is:
> 
> json.c/h - support for the 'json' data type
> jsonb.c/h - support for the 'jsonb' data type
> jsonfuncs.c/h - backend code that doesn't fit in either of the above
> jsonapi.c/h - lexing/parsing code that can be used in either the
> frontend or the backend

... it would probably require more work to make this 100% attainable,
but I don't really care all that much.

> I'm not wedded to that. It just looks like the most natural thing from
> where we are now.

Let's go with it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Amcheck: do rightlink verification with lock coupling

2020-01-22 Thread Peter Geoghegan
On Fri, Jan 17, 2020 at 5:43 PM Peter Geoghegan  wrote:
> I tried to come up with a specific example of how this could be
> unsafe, but my explanation was all over the place (this could have had
> something to do with it being Friday evening). Even still, it's up to
> the patch to justify why it's safe, and that seems even more
> difficult.

I can't see a way around this problem, so I'm marking the patch rejected.

Thanks
-- 
Peter Geoghegan




Re: making the backend's json parser work in frontend code

2020-01-22 Thread Robert Haas
On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera  wrote:
> I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
> seems to me that you could move back all the contents from jsonapi.h
> into json.h, and everything would work just as well.  (Evidently the
> Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
> that prototype in jsonfuncs.h would work.)
>
> I don't really object to your 0001 patch as posted, though.

The goal is to make it possible to use the JSON parser in the
frontend, and we can't do that if the header files that would have to
be included on the frontend side rely on things that only work in the
backend. As written, the patch series leaves json.h with a dependency
on Datum, so the stuff that it leaves in jsonapi.h (which is intended
to be the header that gets moved to src/common and included by
frontend code) can't be merged with it.

Now, we could obviously rearrange that. I don't think any of the file
naming here is great. But I think we probably want, as far as
possible, for the code in FOO.c to correspond to the prototypes in
FOO.h. What I'm thinking we should work towards is:

json.c/h - support for the 'json' data type
jsonb.c/h - support for the 'jsonb' data type
jsonfuncs.c/h - backend code that doesn't fit in either of the above
jsonapi.c/h - lexing/parsing code that can be used in either the
frontend or the backend

I'm not wedded to that. It just looks like the most natural thing from
where we are now.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Online checksums patch - once again

2020-01-22 Thread Magnus Hagander
 On Mon, Jan 20, 2020 at 12:14 PM Robert Haas  wrote:
>
> On Sat, Jan 18, 2020 at 6:18 PM Daniel Gustafsson  wrote:
> > Thanks again for reviewing (and working on the infrastructure required for 
> > this
> > patch to begin with)!  Regarding the persisting the progress; that would be 
> > a
> > really neat feature but I don't have any suggestion on how to do that safely
> > for real use-cases.
>
> Leaving to one side the question of how much work is involved, could
> we do something conceptually similar to relfrozenxid/datfrozenxid,
> i.e. use catalog state to keep track of which objects have been
> handled and which not?
>
> Very rough sketch:
>
> * set a flag indicating that checksums must be computed for all page writes
> * use barriers and other magic to make sure everyone has gotten the
> memo from the previous step
> * use new catalog fields pg_class.relhaschecksums and
> pg_database.dathaschecksums to track whether checksums are enabled
> * keep launching workers for databases where !pg_class.dathaschecksums
> until none remain
> * mark checksums as fully enabled
> * party

We did discuss this back when we started work on this (I can't
remember if it was just me and Daniel and someone else or on a list --
but that's not important atm).

The reasoning that led us to *not* doing that is that it's a one-off
operation. That along with the fact that we hope to at some point be
able to change the default to chekcsums on  (and t wouldn't be
necessary for the transition on->off as that is very fast), it would
become an increasingly rate on-off operation. And by adding these
flags to the catalogs, everybody is paying the overhead for this
one-off rare operation. Another option would be to add the flag on the
pg_database level which would decrease the overhead, but our guess was
that this would also decrease the usefulness in most cases to make it
not worth it (most people with big databases don't have many big
databases in the same cluster -- it's usually just one or two, so in
the end the results would be more or less the same was we have now as
it would have to keep re-doing the big ones)

Unless we actually want to support running systems more or less
permanently with some tables with checksums and other tables without
checksums. But that's going to have an effect on the validation of
checksums that would generate a huge overhead (since each buffer check
would have to look up the pg_class entry).

FYI, Daniel is working on an update that will include this -- so we
can see what the actual outcome is of it in th case of complexity as
well. Should hopefully be ready soon.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: making the backend's json parser work in frontend code

2020-01-22 Thread Alvaro Herrera
On 2020-Jan-22, Robert Haas wrote:

> Here is a new version that is, I think, much closer what I would
> consider a final form. 0001 through 0003 are as before, and unless
> somebody says pretty soon that they see a problem with those or want
> more time to review them, I'm going to commit them; David Steele has
> endorsed all three, and they seem like independently sensible
> cleanups.

I'm not sure I see the point of keeping json.h split from jsonapi.h.  It
seems to me that you could move back all the contents from jsonapi.h
into json.h, and everything would work just as well.  (Evidently the
Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting
that prototype in jsonfuncs.h would work.)

I don't really object to your 0001 patch as posted, though.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] kqueue

2020-01-22 Thread Tom Lane
I wrote:
> This just says it doesn't lock up, of course.  I've not attempted
> any performance-oriented tests.

I've now done some light performance testing -- just stuff like
pgbench -S -M prepared -c 20 -j 20 -T 60 bench

I cannot see any improvement on either FreeBSD 12 or NetBSD 8.1,
either as to net TPS or as to CPU load.  If anything, the TPS
rate is a bit lower with the patch, though I'm not sure that
that effect is above the noise level.

It's certainly possible that to see any benefit you need stress
levels above what I can manage on the small box I've got these
OSes on.  Still, it'd be nice if a performance patch could show
some improved performance, before we take any portability risks
for it.

regards, tom lane




Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan  wrote:
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

Note in particular that index scans cannot return the same index tuple
twice -- processing a page at a time ensures that that cannot happen.

Can a loose index scan return the same tuple (i.e. a tuple with the
same heap TID) to the executor more than once?

-- 
Peter Geoghegan




Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan
On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
 wrote:
> Anyone please correct me if I'm wrong, but I think one case where the current 
> patch relies on some data from the page it has locked before it in checking 
> this hi/lo key. I think it's possible for the following sequence to happen. 
> Suppose we have a very simple one leaf-page btree containing four elements: 
> leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first 
> tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in 
> and inserts a tuple (value 5) into this page, but suppose the page happens to 
> be full. So a page split occurs. As far as I know, a page split could happen 
> at any random element in the page. One of the situations we could be left 
> with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1. For non-skip scans this 
> is not a problem, as we already read all matching elements in our local 
> buffer and we'll return those. But the skip scan currently:
> a) checks the lo-key of the page to see if the next prefix can be found on 
> the leaf page 1
> b) finds out that this is actually true
> c) does a search on the page and returns value=4 (while it should have 
> returned value=6)
>
> Peter, is my understanding about the btree internals correct so far?

This is a good summary. This is the kind of scenario I had in mind
when I expressed a general concern about "stopping between pages".
Processing a whole page at a time is a crucial part of how
_bt_readpage() currently deals with concurrent page splits.

Holding a buffer pin on a leaf page is only effective as an interlock
against VACUUM completely removing a tuple, which could matter with
non-MVCC scans.

> Now that I look at the patch again, I fear there currently may also be such a 
> dependency in the "Advance forward but read backward"-case. It saves the 
> offset number of a tuple in a variable, then does a _bt_search (releasing the 
> lock and pin on the page). At this point, anything can happen to the tuples 
> on this page - the page may be compacted by vacuum such that the offset 
> number you have in your variable does not match the actual offset number of 
> the tuple on the page anymore. Then, at the check for (nextOffset == 
> startOffset) later, there's a possibility the offsets are different even 
> though they relate to the same tuple.

If skip scan is restricted to heapkeyspace indexes (i.e. those created
on Postgres 12+), then it might be reasonable to save an index tuple,
and relocate it within the same page using a fresh binary search that
uses a scankey derived from the same index tuple -- without unsetting
scantid/the heap TID scankey attribute. I suppose that you'll need to
"find your place again" after releasing the buffer lock on a leaf page
for a time. Also, I think that this will only be safe with MVCC scans,
because otherwise the page could be concurrently deleted by VACUUM.

-- 
Peter Geoghegan




Re: Index Skip Scan

2020-01-22 Thread Peter Geoghegan
On Tue, Jan 21, 2020 at 9:06 AM Jesper Pedersen
 wrote:
> If you apply the attached patch on master it will fail the test suite;
> did you mean something else ?

Yeah, this is exactly what I had in mind for the _bt_readpage() assertion.

As I said, it isn't a great sign that this kind of assertion is even
necessary in index access method code (code like bufmgr.c is another
matter). Usually it's just obvious that a buffer lock is held. I can't
really blame this patch for that, though. You could say the same thing
about the existing "buffer pin held" _bt_readpage() assertion. It's
good that it verifies what is actually a fragile assumption, even
though I'd prefer to not make a fragile assumption.

-- 
Peter Geoghegan




Re: Duplicate Workers entries in some EXPLAIN plans

2020-01-22 Thread Georgios Kokolatos
>> +   int num_workers = planstate->worker_instrument->num_workers;
>> +   int n;
>> +   worker_strs = (StringInfo *) palloc0(num_workers * 
>> sizeof(StringInfo));
>> +   for (n = 0; n < num_workers; n++) {
>>
>> I think C99 would be better here. Also no parenthesis needed.
>
>
> Pardon my C illiteracy, but what am I doing that's not valid C99 here?

My bad, I should have been more clear. I meant that it is preferable to use
the C99 standard which calls for declaring variables in the scope that you
need them. In this case, 'n' is needed only in the for loop, so something like

for (int n = 0; n < num_workers; n++) 

is preferable. To be clear, your code was perfectly valid. It was only the
style I was referring to.

>> +   for (n = 0; n < w->num_workers; ++n)
>>
>> I think C99 would be better here.
>
>
> And here (if it's not the same problem)?

Exactly the same as above. 

>> int indent; /* current indentation level */
>> List   *grouping_stack; /* format-specific grouping state */
>> +   boolprint_workers;  /* whether current node has worker metadata 
>> */
>>
>> Hmm.. commit  introduced `hide_workers` in the struct. Having 
>> both
>> names in the struct so far apart even seems a bit confusing and sloppy. Do 
>> you
>> think it would be possible to combine or rename?
>
>
> I noticed that. I was thinking about combining them, but
> "hide_workers" seems to be about "pretend there is no worker output
> even if there is" and "print_workers" is "keep track of whether or not
> there is worker output to print". Maybe I'll rename to
> "has_worker_output"?

The rename sounds a bit better in my humble opinion. Thanks.

Re: Index Skip Scan

2020-01-22 Thread Floris Van Nee
Hi Dmitry,


> > On Wed, Jan 22, 2020 at 07:50:30AM +, Floris Van Nee wrote:
> >
> > Anyone please correct me if I'm wrong, but I think one case where the 
> > current patch relies on some data from the page it has locked before it in 
> > checking this hi/lo key. I think it's possible for the following sequence 
> > to happen. Suppose we have a very simple one leaf-page btree containing 
> > four elements: leaf page 1 = [2,4,6,8]
> > We do a backwards index skip scan on this and have just returned our first 
> > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes 
> > in and inserts a tuple (value 5) into this page, but suppose the page 
> > happens to be full. So a page split occurs. As far as I know, a page split 
> > could happen at any random element in the page. One of the situations we 
> > could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1.

> In case if we just returned a tuple, the next action would be either
>> check the next page for another key or search down to the tree. Maybe

But it won't look at the 'next page for another key', but rather at the 'same 
page or another key', right? In the _bt_scankey_within_page shortcut we're 
taking, there's no stepping to a next page involved. It just locks the page 
again that it previously also locked.

> I'm missing something in your scenario, but the latter will land us on a
> required page (we do not point to any leaf here), and before the former
> there is a check for high/low key. Is there anything else missing?

Let me try to clarify. After we return the first tuple, so->currPos.buf is 
pointing to page=1 in my example (it's the only page after all). We've returned 
item=8. Then the split happens and the items get rearranged as in my example. 
We're still pointing with so->currPos.buf to page=1, but the page now contains 
[2,4]. The split happened to the right, so there's a page=2 with [5,6,8], 
however the ongoing index scan is unaware of that.
Now _bt_skip gets called to fetch the next tuple. It starts by checking 
_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the 
result of which will be 'true': we're comparing the skip key to the low key of 
the page. So it thinks the next key can be found on the current page. It locks 
the page and does a _binsrch, finding item=4 to be returned.

The problem here is that _bt_scankey_within_page mistakenly returns true, 
thereby limiting the search to just the page that it's pointing to already.
It may be fine to just fix this function to return the proper value (I guess 
it'd also need to look at the high key in this example). It could also be fixed 
by not looking at the lo/hi key of the page, but to use the local tuple buffer 
instead. We already did a _read_page once, so if we have any matching tuples on 
that specific page, we have them locally in the buffer already. That way we 
never need to lock the same page twice.




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Tom Lane
Bruce Momjian  writes:
> I think the big question is whether we want to make active prepared
> transactions more visible to administrators, either during server start
> or idle duration.

There's already the pg_prepared_xacts view ...

regards, tom lane




Re: [HACKERS] kqueue

2020-01-22 Thread Tom Lane
Matteo Beccati  writes:
> On 22/01/2020 17:06, Tom Lane wrote:
>> Matteo Beccati  writes:
>>> I had a NetBSD 8.0 VM lying around and I gave the patch a spin on latest
>>> master.
>>> With the kqueue patch, a pgbench -c basically hangs the whole postgres
>>> instance. Not sure if it's a kernel issue, HyperVM issue o what, but
>>> when it hangs, I can't even kill -9 the postgres processes or get the VM
>>> to properly shutdown. The same doesn't happen, of course, with vanilla
>>> postgres.

>> I'm a bit confused about what you are testing --- the kqueue patch
>> as per this thread, or that plus the WaitLatch refactorizations in
>> the other thread you point to above?

> my bad, I tested the v14 patch attached to the email.

Thanks for clarifying.

FWIW, I can't replicate the problem here using NetBSD 8.1 amd64
on bare metal.  I tried various pgbench parameters up to "-c 20 -j 20"
(on a 4-cores-plus-hyperthreading CPU), and it seems fine.

One theory is that NetBSD fixed something since 8.0, but I trawled
their 8.1 release notes [1], and the only items mentioning kqueue
or kevent are for fixes in the pty and tun drivers, neither of which
seem relevant.  (But wait ... could your VM setup be dependent on
a tunnel network interface for outside-the-VM connectivity?  Still
hard to see the connection though.)

My guess is that what you're seeing is a VM bug.

regards, tom lane

[1] https://cdn.netbsd.org/pub/NetBSD/NetBSD-8.1/CHANGES-8.1




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Bruce Momjian
On Wed, Jan 22, 2020 at 10:22:21AM -0500, Tom Lane wrote:
> Thomas Kellerer  writes:
> > Tom Lane schrieb am 22.01.2020 um 16:05:
> >> Right.  It's the XA transaction manager's job not to forget uncommitted
> >> transactions.  Reasoning as though no TM exists is not only not very
> >> relevant, but it might lead you to put in features that actually
> >> make the TM's job harder.  In particular, a timeout (or any other
> >> mechanism that leads PG to abort or commit a prepared transaction
> >> of its own accord) does that.
> 
> > That's a fair point, but the reality is that not all XA transaction managers
> > do a good job with that.
> 
> If you've got a crappy XA manager, you should get a better one, not
> ask us to put in features that make PG unsafe to use with well-designed
> XA managers.

I think the big question is whether we want to make active prepared
transactions more visible to administrators, either during server start
or idle duration.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Duplicate Workers entries in some EXPLAIN plans

2020-01-22 Thread Maciek Sakrejda
Thanks! I'll fix the brace issues. Re: the other items:

> +   int num_workers = planstate->worker_instrument->num_workers;
> +   int n;
> +   worker_strs = (StringInfo *) palloc0(num_workers * 
> sizeof(StringInfo));
> +   for (n = 0; n < num_workers; n++) {
>
> I think C99 would be better here. Also no parenthesis needed.

Pardon my C illiteracy, but what am I doing that's not valid C99 here?

> +   /* Prepare worker general execution details */
> +   if (es->analyze && es->verbose && planstate->worker_instrument)
> +   {
> +   WorkerInstrumentation *w = planstate->worker_instrument;
> +   int n;
> +
> +   for (n = 0; n < w->num_workers; ++n)
>
> I think C99 would be better here.

And here (if it's not the same problem)?

> +   ExplainCloseGroup("Workers", "Workers", false, es);
> +   // do we have any other cleanup to do?
>
> This comment does not really explain anything. Either remove
> or rephrase. Also C style comments.

Good catch, thanks--I had put this in to remind myself (and reviewers)
about cleanup, but I don't think there's anything else to do, so I'll
just drop it.

> int indent; /* current indentation level */
> List   *grouping_stack; /* format-specific grouping state */
> +   boolprint_workers;  /* whether current node has worker metadata */
>
> Hmm.. commit  introduced `hide_workers` in the struct. Having 
> both
> names in the struct so far apart even seems a bit confusing and sloppy. Do you
> think it would be possible to combine or rename?

I noticed that. I was thinking about combining them, but
"hide_workers" seems to be about "pretend there is no worker output
even if there is" and "print_workers" is "keep track of whether or not
there is worker output to print". Maybe I'll rename to
"has_worker_output"?

> +extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es);
> +extern void ExplainCloseWorker(ExplainState *es);
> +extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, 
> ExplainState *es);
>
> No need to expose those, is there? I feel there should be static.

Good call, I'll update.




Re: [HACKERS] Restricting maximum keep segments by repslots

2020-01-22 Thread Jehan-Guillaume de Rorthais
Hi,

First, it seems you did not reply to Alvaro's concerns in your new set of
patch. See:

https://www.postgresql.org/message-id/20190917195800.GA16694%40alvherre.pgsql

On Tue, 24 Dec 2019 21:26:14 +0900 (JST)
Kyotaro Horiguchi  wrote:
[...]
> > Indeed, "loosing" is a better match for this state.
> >
> > However, what's the point of this state from the admin point of view? In
> > various situation, the admin will have no time to react immediately and fix
> > whatever could help.
> > 
> > How useful is this specific state?  
> 
> If we assume "losing" segments as "lost", a segment once "lost" can
> return to "keeping" or "streaming" state. That is intuitively
> impossible. On the other hand if we assume it as "keeping", it should
> not be removed by the next checkpoint but actually it can be
> removed. The state "losing" means such a unstable state different from
> both "lost" and "keeping".

OK, indeed.

But I'm still unconfortable with this "unstable" state. It would be better if
we could grab a stable state: either "keeping" or "lost".

> > +  Availability of WAL records claimed by this
> > +  slot. streaming, keeping,
> > 
> > Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of
> > WAL files claimed by this slot"?  
> 
> I choosed "record" since a slot points a record. I'm not sure but I'm
> fine with "file". Fixed catalogs.sgml and config.sgml that way.

Thanks.

> > +  streaming means that the claimed records are
> > +  available within max_wal_size. keeping means
> > 
> > I wonder if streaming is the appropriate name here. The WALs required might
> > be available for streaming, but the slot not active, thus not "streaming".
> > What about merging with the "active" field, in the same fashion as
> > pg_stat_activity.state? We would have an enum "pg_replication_slots.state"
> > with the following states:
> > * inactive: non active slot
> > * active: activated, required WAL within max_wal_size
> > * keeping: activated, max_wal_size is exceeded but still held by replication
> >   slots or wal_keep_segments.
> > * lost: some WAL are definitely lost
> > 
> > Thoughts?  
> 
> In the first place, I realized that I was missed a point about the
> relationship between max_wal_size and max_slot_wal_keep_size
> here. Since the v15 of this patch, GetLsnAvailablity returns
> "streaming" when the restart_lsn is within max_wal_size. That behavior
> makes sense when max_slot_wal_keep_size > max_wal_size. However, in
> the contrary case, restart_lsn could be lost even it is
> withinmax_wal_size. So we would see "streaming" (or "normal") even
> though restart_lsn is already lost. That is broken.
> 
> In short, the "streaming/normal" state is useless if
> max_slot_wal_keep_size < max_wal_size.

Good catch!

> Finally I used the following wordings.
> 
> (there's no "inactive" wal_state)
> 
> * normal: required WAL within max_wal_size when max_slot_wal_keep_size
>   is larger than max_wal_size.
> * keeping: required segments are held by replication slots or
>   wal_keep_segments.
> 
> * losing: required segments are about to be removed or may be already
>   removed but streaming is not dead yet.

As I wrote, I'm still uncomfortable with this state. Maybe we should ask
other reviewers opinions on this.

[...]
> >   WARNING:  some replication slots have lost required WAL segments
> >   DETAIL:  Slot slot_limit_st lost 177 segment(s)
> > 
> > I wonder if this is useful to show these messages for slots that were
> > already dead before this checkpoint?  
> 
> Makes sense. I changed KeepLogSeg so that it emits the message only on
> slot_names changes.

Thanks.

Bellow some code review.

In regard with FindOldestXLogFileSegNo(...):

> /*
>  * Return the oldest WAL segment file.
>  *
>  * The returned value is XLogGetLastRemovedSegno() + 1 when the function
>  * returns a valid value.  Otherwise this function scans over WAL files and
>  * finds the oldest segment at the first time, which could be very slow.
>  */
> XLogSegNo
> FindOldestXLogFileSegNo(void)

The comment is not clear to me. I suppose "at the first time" might better be
expressed as "if none has been removed since last startup"?

Moreover, what about patching XLogGetLastRemovedSegno() itself instead of
adding a function?

In regard with GetLsnAvailability(...):

> /*
>  * Detect availability of the record at given targetLSN.
>  *
>  * targetLSN is restart_lsn of a slot.

Wrong argument name. It is called "restart_lsn" in the function
declaration.

>  * restBytes is the pointer to uint64 variable, to store the remaining bytes
>  * until the slot goes into "losing" state.

I'm not convinced with this argument name. What about "remainingBytes"? Note
that you use remaining_bytes elsewhere in your patch.

>  * -1 is stored to restBytes if the values is useless.

What about returning a true negative value when the slot is really lost?

All in all, I feel like this function is on the fence between being generic
because of its 

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-01-22 Thread Alvaro Herrera
I looked at this patchset and it seemed natural to apply 0008 next
(adding work_mem to subscriptions).  Attached is Dilip's latest version,
plus my review changes.  This will break the patch tester's logic; sorry
about that.

What part of this change is what sets the process's
logical_decoding_work_mem to the given value?  I was unable to figure
that out.  Is it missing or am I just stupid?

Changes:
* the patch adds logical_decoding_work_mem SGML, but that has already
  been applied (cec2edfa7859); remove dupe.

* parse_subscription_options() comment says that it will raise an error if a
  caller does not pass the pointer for an option but option list
  specifies that option.  It does not really implement that behavior (an
  existing problem): instead, if the pointer is not passed, the option
  is ignored.  Moreover, this new patch continued to fail to handle
  things as the comment says.  I decided to implement the documented
  behavior instead; it's now inconsistent with how the other options are
  implemented.  I think we should fix the other options to behave as the
  comment says, because it's a more convenient API; if we instead opted
  to update the code comment to match the code, each caller would have
  to be checked to verify that the correct options are passed, which is
  pointless and error prone.

* the parse_subscription_options API is a mess.  I reordered the
  arguments a little bit; also change the argument layout in callers so
  that each caller is grouped more sensibly.  Also added comments to
  simplify reading the argument lists.  I think this could be fixed by
  using an ad-hoc struct to pass in and out.  Didn't get around to doing
  that, seems an unrelated potential improvement.

* trying to do own range checking in pgoutput and subscriptioncmds.c
  seems pointless and likely to get out of sync with guc.c.  Simpler is
  to call set_config_option() to verify that the argument is in range.
  (Note a further problem in the patch series: the range check in
  subscriptioncmds.c is only added in patch 0009).

* parsing integers using scanint8() seemed weird (error messages there
  do not correspond to what we want).  After a couple of false starts, I
  decided to rely on guc.c's set_config_option() followed by parse_int().
  That also has the benefit that you can give it units.

* psql \dRs+ should display the work_mem; patch failed to do that.
  Added.  Unit display is done by pg_size_pretty(), which might be
  different from what guc.c does, but I think it works OK.
  It's the first place where we use pg_size_pretty to show a memory
  limit, however.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a31b4ebd90dd7a4c94a35f2b3452258078c30e37 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 22 Jan 2020 12:44:13 -0300
Subject: [PATCH 1/2] Dilip's original

---
 doc/src/sgml/config.sgml  | 21 +
 doc/src/sgml/ref/create_subscription.sgml | 12 +
 src/backend/catalog/pg_subscription.c |  1 +
 src/backend/commands/subscriptioncmds.c   | 44 ---
 .../libpqwalreceiver/libpqwalreceiver.c   |  3 ++
 src/backend/replication/logical/worker.c  |  1 +
 src/backend/replication/pgoutput/pgoutput.c   | 30 -
 src/include/catalog/pg_subscription.h |  3 ++
 src/include/replication/walreceiver.h |  1 +
 9 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3ccacd528b..163cc77d1d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1751,6 +1751,27 @@ include_dir 'conf.d'
   
  
 
+ 
+  logical_decoding_work_mem (integer)
+  
+   logical_decoding_work_mem configuration parameter
+  
+  
+  
+   
+Specifies the maximum amount of memory to be used by logical decoding,
+before some of the decoded changes are either written to local disk.
+This limits the amount of memory used by logical streaming replication
+connections. It defaults to 64 megabytes (64MB).
+Since each replication connection only uses a single buffer of this size,
+and an installation normally doesn't have many such connections
+concurrently (as limited by max_wal_senders), it's
+safe to set this value significantly higher than work_mem,
+reducing the amount of decoded changes written to disk.
+   
+  
+ 
+
  
   max_stack_depth (integer)
   
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1a90c244fb..91790b0c95 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION subscription_name
 

+
+   
+work_mem (integer)
+
+ 
+  Limits the 

Re: [HACKERS] kqueue

2020-01-22 Thread Matteo Beccati
On 22/01/2020 17:06, Tom Lane wrote:
> Matteo Beccati  writes:
>> On 21/01/2020 02:06, Thomas Munro wrote:
>>> [1] 
>>> https://www.postgresql.org/message-id/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
> 
>> I had a NetBSD 8.0 VM lying around and I gave the patch a spin on latest
>> master.
>> With the kqueue patch, a pgbench -c basically hangs the whole postgres
>> instance. Not sure if it's a kernel issue, HyperVM issue o what, but
>> when it hangs, I can't even kill -9 the postgres processes or get the VM
>> to properly shutdown. The same doesn't happen, of course, with vanilla
>> postgres.
> 
> I'm a bit confused about what you are testing --- the kqueue patch
> as per this thread, or that plus the WaitLatch refactorizations in
> the other thread you point to above?

my bad, I tested the v14 patch attached to the email.

The quoted url was just above the patch name in the email client and
somehow my brain thought I was quoting the v14 patch name.


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/




Re: [HACKERS] kqueue

2020-01-22 Thread Tom Lane
Matteo Beccati  writes:
> On 21/01/2020 02:06, Thomas Munro wrote:
>> [1] 
>> https://www.postgresql.org/message-id/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com

> I had a NetBSD 8.0 VM lying around and I gave the patch a spin on latest
> master.
> With the kqueue patch, a pgbench -c basically hangs the whole postgres
> instance. Not sure if it's a kernel issue, HyperVM issue o what, but
> when it hangs, I can't even kill -9 the postgres processes or get the VM
> to properly shutdown. The same doesn't happen, of course, with vanilla
> postgres.

I'm a bit confused about what you are testing --- the kqueue patch
as per this thread, or that plus the WaitLatch refactorizations in
the other thread you point to above?

I've gotten through check-world successfully with the v14 kqueue patch
atop yesterday's HEAD on:

* macOS Catalina 10.15.2 (current release)
* FreeBSD/amd64 12.0-RELEASE-p12
* NetBSD/amd64 8.1
* NetBSD/arm 8.99.41
* OpenBSD/amd64 6.5

(These OSes are all on bare metal, no VMs involved)

This just says it doesn't lock up, of course.  I've not attempted
any performance-oriented tests.

regards, tom lane




Re: Index Skip Scan

2020-01-22 Thread Dmitry Dolgov
> On Wed, Jan 22, 2020 at 07:50:30AM +, Floris Van Nee wrote:
>
> Anyone please correct me if I'm wrong, but I think one case where the current 
> patch relies on some data from the page it has locked before it in checking 
> this hi/lo key. I think it's possible for the following sequence to happen. 
> Suppose we have a very simple one leaf-page btree containing four elements: 
> leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first 
> tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in 
> and inserts a tuple (value 5) into this page, but suppose the page happens to 
> be full. So a page split occurs. As far as I know, a page split could happen 
> at any random element in the page. One of the situations we could be left 
> with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1.

In case if we just returned a tuple, the next action would be either
check the next page for another key or search down to the tree. Maybe
I'm missing something in your scenario, but the latter will land us on a
required page (we do not point to any leaf here), and before the former
there is a check for high/low key. Is there anything else missing?

> Now that I look at the patch again, I fear there currently may also be such a 
> dependency in the "Advance forward but read backward"-case. It saves the 
> offset number of a tuple in a variable, then does a _bt_search (releasing the 
> lock and pin on the page). At this point, anything can happen to the tuples 
> on this page - the page may be compacted by vacuum such that the offset 
> number you have in your variable does not match the actual offset number of 
> the tuple on the page anymore. Then, at the check for (nextOffset == 
> startOffset) later, there's a possibility the offsets are different even 
> though they relate to the same tuple.

Interesting point. The original idea here was to check that we're not
returned to the same position after jumping, so maybe instead of offsets
we can check a tuple we found.




Re: pgsql: walreceiver uses a temporary replication slot by default

2020-01-22 Thread Sergei Kornilov
Hello

> In short, the following things:
> - wal_receiver_create_temp_slot should be made PGC_POSTMASTER,
> similarly to primary_slot_name and primary_conninfo.
> - WalReceiverMain() should not load the parameter from the GUC context
> by itself.
> - RequestXLogStreaming(), called by the startup process, should be in
> charge of defining if a temp slot should be used or not.

I would like to cross-post here a patch with such changes that I posted in 
"allow online change primary_conninfo" thread.
This thread is more appropriate for discussion about 
wal_receiver_create_temp_slot.

PS: I posted this patch in both threads mostly to make cfbot happy.

regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01e80..14992a08d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4137,9 +4137,7 @@ ANY num_sync ( ).
 The default is on.  The only reason to turn this off would be if the
 remote instance is currently out of available replication slots.  This
-parameter can only be set in the postgresql.conf
-file or on the server command line.  Changes only take effect when the
-WAL receiver process starts a new connection.
+parameter can only be set at server start.

   
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..55e9294ae3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -283,6 +283,7 @@ bool		StandbyModeRequested = false;
 char	   *PrimaryConnInfo = NULL;
 char	   *PrimarySlotName = NULL;
 char	   *PromoteTriggerFile = NULL;
+bool		wal_receiver_create_temp_slot = true;
 
 /* are we currently in standby mode? */
 bool		StandbyMode = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a5e85d32f3..264b544194 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -73,7 +73,6 @@
 
 
 /* GUC variables */
-bool		wal_receiver_create_temp_slot;
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
@@ -349,42 +348,23 @@ WalReceiverMain(void)
 		WalRcvFetchTimeLineHistoryFiles(startpointTLI, primaryTLI);
 
 		/*
-		 * Create temporary replication slot if no slot name is configured or
-		 * the slot from the previous run was temporary, unless
-		 * wal_receiver_create_temp_slot is disabled.  We also need to handle
-		 * the case where the previous run used a temporary slot but
-		 * wal_receiver_create_temp_slot was changed in the meantime.  In that
-		 * case, we delete the old slot name in shared memory.  (This would
+		 * Create temporary replication slot if requested.  In that
+		 * case, we update slot name in shared memory.  (This would
 		 * all be a bit easier if we just didn't copy the slot name into
 		 * shared memory, since we won't need it again later, but then we
 		 * can't see the slot name in the stats views.)
 		 */
-		if (slotname[0] == '\0' || is_temp_slot)
+		if (is_temp_slot)
 		{
-			bool		changed = false;
+			snprintf(slotname, sizeof(slotname),
+	 "pg_walreceiver_%lld",
+	 (long long int) walrcv_get_backend_pid(wrconn));
 
-			if (wal_receiver_create_temp_slot)
-			{
-snprintf(slotname, sizeof(slotname),
-		 "pg_walreceiver_%lld",
-		 (long long int) walrcv_get_backend_pid(wrconn));
-
-walrcv_create_slot(wrconn, slotname, true, 0, NULL);
-changed = true;
-			}
-			else if (slotname[0] != '\0')
-			{
-slotname[0] = '\0';
-changed = true;
-			}
+			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
 
-			if (changed)
-			{
-SpinLockAcquire(>mutex);
-strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
-walrcv->is_temp_slot = wal_receiver_create_temp_slot;
-SpinLockRelease(>mutex);
-			}
+			SpinLockAcquire(>mutex);
+			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
+			SpinLockRelease(>mutex);
 		}
 
 		/*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..a36bc83d2c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -248,10 +248,16 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	else
 		walrcv->conninfo[0] = '\0';
 
-	if (slotname != NULL)
+	if (slotname != NULL && slotname[0] != '\0')
+	{
 		strlcpy((char *) walrcv->slotname, slotname, NAMEDATALEN);
+		walrcv->is_temp_slot = false;
+	}
 	else
+	{
 		walrcv->slotname[0] = '\0';
+		walrcv->is_temp_slot = wal_receiver_create_temp_slot;
+	}
 
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9f179a9129..c4b36eb2ad 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1994,7 +1994,7 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
-		{"wal_receiver_create_temp_slot", PGC_SIGHUP, 

Re: allow online change primary_conninfo

2020-01-22 Thread Sergei Kornilov
Hello

> Yeah, you are right. I was not paying much attention but something
> does not stick here. My understanding is that we should have the WAL
> receiver receive the value it needs to use from the startup process
> (aka via RequestXLogStreaming from xlog.c), and that we ought to make
> this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is
> inconsistent here.

Thank you!

I attached two patches:
- first changes wal_receiver_create_temp_slot to PGC_POSTMASTER and moved the 
logic to RequestXLogStreaming
- second is based on last published v6 version of main patch. It changes 
wal_receiver_create_temp_slot back to PGC_SIGHUP along with primary_conninfo 
and primary_slot_name and will restart walreceiver if need.

Regarding the main patch: we have several possibilities for moving 
RequestXLogStreaming call. We need to choose one.
And, of course, changes in the text...

regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 14992a08d7..19f36a3318 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4000,9 +4000,15 @@ ANY num_sync ( 
@@ -4017,9 +4023,13 @@ ANY num_sync ( ).
-  This parameter can only be set at server start.
+  This parameter can only be set in the postgresql.conf
+  file or on the server command line.
   This setting has no effect if primary_conninfo is not
-  set.
+  set or the server is not in standby mode.
+ 
+ 
+  The WAL receiver is restarted after an update of primary_slot_name.
  
 

@@ -4137,7 +4147,11 @@ ANY num_sync ( ).
 The default is on.  The only reason to turn this off would be if the
 remote instance is currently out of available replication slots.  This
-parameter can only be set at server start.
+parameter can only be set in the postgresql.conf
+file or on the server command line.
+   
+   
+The WAL receiver is restarted after an update of wal_receiver_create_temp_slot.

   
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55e9294ae3..2861b9f22a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -803,6 +803,12 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
 
+/*
+ * Need for restart running WalReceiver due the configuration change.
+ * Suitable only for XLOG_FROM_STREAM source
+ */
+static bool pendingWalRcvRestart = false;
+
 typedef struct XLogPageReadPrivate
 {
 	int			emode;
@@ -11818,6 +11824,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	for (;;)
 	{
 		int			oldSource = currentSource;
+		bool		startWalReceiver = false;
 
 		/*
 		 * First check if we failed to read from the current source, and
@@ -11851,54 +11858,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	if (!StandbyMode)
 		return false;
 
-	/*
-	 * If primary_conninfo is set, launch walreceiver to try
-	 * to stream the missing WAL.
-	 *
-	 * If fetching_ckpt is true, RecPtr points to the initial
-	 * checkpoint location. In that case, we use RedoStartLSN
-	 * as the streaming start position instead of RecPtr, so
-	 * that when we later jump backwards to start redo at
-	 * RedoStartLSN, we will have the logs streamed already.
-	 */
-	if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-	{
-		XLogRecPtr	ptr;
-		TimeLineID	tli;
-
-		if (fetching_ckpt)
-		{
-			ptr = RedoStartLSN;
-			tli = ControlFile->checkPointCopy.ThisTimeLineID;
-		}
-		else
-		{
-			ptr = RecPtr;
-
-			/*
-			 * Use the record begin position to determine the
-			 * TLI, rather than the position we're reading.
-			 */
-			tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-			if (curFileTLI > 0 && tli < curFileTLI)
-elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
-	 (uint32) (tliRecPtr >> 32),
-	 (uint32) tliRecPtr,
-	 tli, curFileTLI);
-		}
-		curFileTLI = tli;
-		RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-			 PrimarySlotName);
-		receivedUpto = 0;
-	}
-
 	/*
 	 * Move to XLOG_FROM_STREAM state in either case. We'll
 	 * get immediate failure if we didn't launch walreceiver,
 	 * and move on to the next state.
 	 */
 	currentSource = XLOG_FROM_STREAM;
+	startWalReceiver = true;
 	break;
 
 case XLOG_FROM_STREAM:
@@ -12044,7 +12010,69 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	Assert(StandbyMode);
 
 	/*
-	 * Check if WAL receiver is still active.
+	 * shutdown WAL receiver if restart is requested.
+	 */
+	if 

Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Tom Lane
Thomas Kellerer  writes:
> Tom Lane schrieb am 22.01.2020 um 16:05:
>> Right.  It's the XA transaction manager's job not to forget uncommitted
>> transactions.  Reasoning as though no TM exists is not only not very
>> relevant, but it might lead you to put in features that actually
>> make the TM's job harder.  In particular, a timeout (or any other
>> mechanism that leads PG to abort or commit a prepared transaction
>> of its own accord) does that.

> That's a fair point, but the reality is that not all XA transaction managers
> do a good job with that.

If you've got a crappy XA manager, you should get a better one, not
ask us to put in features that make PG unsafe to use with well-designed
XA managers.

regards, tom lane




Re: Error message inconsistency

2020-01-22 Thread Tom Lane
Alvaro Herrera  writes:
> I wonder if we shouldn't be using errtablecol() here instead of (in
> addition to?) patching the errmsg() to include the table name.

> Discussion: If we really like having the table names in errtable(), then
> we should have psql display it by default, and other tools will follow
> suit; in that case we should remove the table name from error messages,
> or at least not add it to even more messages.

> If we instead think that errtable() is there just for programmatically
> knowing the affected table, then we should add the table name to all
> errmsg() where relevant, as in the patch under discussion.

> What do others think?

I believe that the intended use of errtable() and friends is so that
applications don't have to parse those names out of a human-friendly
message.  We should add calls to them in cases where (a) an application
can tell from the SQLSTATE that some particular table is involved
and (b) it's likely that the app would wish to know which table that is.
I don't feel a need to sprinkle every single ereport() in the backend
with errtable(), just ones where there's a plausible use-case for the
extra cycles that will be spent.

On the other side of the coin, whether we use errtable() is not directly
a factor in deciding what the human-friendly messages should say.
I do find it hard to envision a case where we'd want to use errtable()
and *not* put the table name in the error message, just because if
applications need to know something then humans probably do too.  But
saying that we can make the messages omit info because it's available
from these program-friendly fields seems 100% wrong to me, even if one
turns a blind eye to the fact that existing client code likely won't
show those fields to users.

regards, tom lane




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Thomas Kellerer
Tom Lane schrieb am 22.01.2020 um 16:05:
> Craig Ringer  writes:
>> So I don't really see the point of doing anything with 2PC xacts
>> within Pg proper. It's the job of the app that prepares the 2PC xacts,
>> and if that app is unable to resolve them for some reason there's no
>> generally-correct action to take without administrator action.
>
> Right.  It's the XA transaction manager's job not to forget uncommitted
> transactions.  Reasoning as though no TM exists is not only not very
> relevant, but it might lead you to put in features that actually
> make the TM's job harder.  In particular, a timeout (or any other
> mechanism that leads PG to abort or commit a prepared transaction
> of its own accord) does that.
>
> Or another way to put it: the fundamental premise of a prepared
> transaction is that it will be possible to commit it on-demand with
> extremely low chance of failure.  Designing in a reason why we'd
> fail to be able to do that would be an anti-feature.

That's a fair point, but the reality is that not all XA transaction managers
do a good job with that.

Having somthing on the database side that can handle that in
exceptional cases would be very welcome.

(In Oracle you can't sometimes even run DML on tables where you have orphaned
XA transactions - which is extremely annoying, because by default
only the DBA can clean that up)

Thomas






Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Tom Lane
Craig Ringer  writes:
> So I don't really see the point of doing anything with 2PC xacts
> within Pg proper. It's the job of the app that prepares the 2PC xacts,
> and if that app is unable to resolve them for some reason there's no
> generally-correct action to take without administrator action.

Right.  It's the XA transaction manager's job not to forget uncommitted
transactions.  Reasoning as though no TM exists is not only not very
relevant, but it might lead you to put in features that actually
make the TM's job harder.  In particular, a timeout (or any other
mechanism that leads PG to abort or commit a prepared transaction
of its own accord) does that.

Or another way to put it: the fundamental premise of a prepared
transaction is that it will be possible to commit it on-demand with
extremely low chance of failure.  Designing in a reason why we'd
fail to be able to do that would be an anti-feature.

regards, tom lane




Re: We're getting close to the end of 2020-01 CF

2020-01-22 Thread Tomas Vondra

On Wed, Jan 22, 2020 at 02:09:39PM +0900, Michael Paquier wrote:

On Tue, Jan 21, 2020 at 05:20:17PM +0100, Tomas Vondra wrote:

Yeah, you're right returning them with feedback seems more appropriate,
given the long inactivity. Plus, the CF app apparently does not allow
moving WoA patches to the next CF anyway.


FWIW, I tend to take a base of two weeks as a sensible period of
time as that's half the CF period when I do the classification job.



Yeah. I've only nagged about patches that have been set to WoA before
the CF began, so far.


Those are the patches that have been set as WoA before this CF, and have
not been updated since. It's quite possible the state is stale for some
of those patches, although I've tried to check if there were any
messages on the list.


You need to be careful about bug fixes, as these are things that we
don't want to lose track of.  Another thing that I noticed in the past
is that some patches are registered as bug fixes, but they actually
implement a new feature.  So there can be tricky cases.
--


Makes sense.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: We're getting close to the end of 2020-01 CF

2020-01-22 Thread Alvaro Herrera
On 2020-Jan-22, Michael Paquier wrote:

> On Tue, Jan 21, 2020 at 05:20:17PM +0100, Tomas Vondra wrote:

> > Those are the patches that have been set as WoA before this CF, and have
> > not been updated since. It's quite possible the state is stale for some
> > of those patches, although I've tried to check if there were any
> > messages on the list.
> 
> You need to be careful about bug fixes, as these are things that we
> don't want to lose track of.

Oh yeah, I forgot to mention the exception for bug fixes. I agree these
should almost never be RwF (or in any way closed other than Committed,
really, except in, err, exceptional cases).

> Another thing that I noticed in the past
> is that some patches are registered as bug fixes, but they actually
> implement a new feature.  So there can be tricky cases.

Oh yeah, I think the CFM should exercise judgement and reclassify.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Error message inconsistency

2020-01-22 Thread Alvaro Herrera
I wonder if we shouldn't be using errtablecol() here instead of (in
addition to?) patching the errmsg() to include the table name.

Discussion: If we really like having the table names in errtable(), then
we should have psql display it by default, and other tools will follow
suit; in that case we should remove the table name from error messages,
or at least not add it to even more messages.

If we instead think that errtable() is there just for programmatically
knowing the affected table, then we should add the table name to all
errmsg() where relevant, as in the patch under discussion.

What do others think?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Duplicate Workers entries in some EXPLAIN plans

2020-01-22 Thread Georgios Kokolatos
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

> TEXT format was tricky due to its inconsistencies, but I think I have
> something working reasonably well. I added a simple test for TEXT
> format output as well, using a similar approach as the JSON format

Great!

> test, and liberally regexp_replacing away any volatile output. I
> suppose in theory we could do this for YAML, too, but I think it's
> gross enough not to be worth it, especially given the high similarity
> of all the structured outputs.

Agreed, what is in the patch suffices. Overall great work, a couple of
minor nitpicks if you allow me.

+   /* Prepare per-worker output */
+   if (es->analyze && planstate->worker_instrument) {

Style, parenthesis on its own line.

+   int num_workers = planstate->worker_instrument->num_workers;
+   int n;
+   worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo));
+   for (n = 0; n < num_workers; n++) {

I think C99 would be better here. Also no parenthesis needed.

+   worker_strs[n] = makeStringInfo();
+   }
+   }

@@ -1357,6 +1369,58 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
}

+   /* Prepare worker general execution details */
+   if (es->analyze && es->verbose && planstate->worker_instrument)
+   {
+   WorkerInstrumentation *w = planstate->worker_instrument;
+   int n;
+
+   for (n = 0; n < w->num_workers; ++n)

I think C99 would be better here.

+   {
+   Instrumentation *instrument = >instrument[n];
+   double  nloops = instrument->nloops;

-   appendStringInfoSpaces(es->str, es->indent * 2);
-   if (n > 0 || !es->hide_workers)
-   appendStringInfo(es->str, "Worker %d:  ", n);
+   if (indent)
+   {
+   appendStringInfoSpaces(es->str, es->indent * 2);
+   }

Style: No parenthesis needed


-   if (opened_group)
-   ExplainCloseGroup("Workers", "Workers", false, es);
+   /* Show worker detail */
+   if (planstate->worker_instrument) {
+   ExplainFlushWorkers(worker_strs, 
planstate->worker_instrument->num_workers, es);
}

Style: No parenthesis needed


+* just indent once, to add worker info on the next worker line.
+*/
+   if (es->str == es->root_str)
+   {
+   es->indent += es->format == EXPLAIN_FORMAT_TEXT ? 1 : 2;
+   }
+

Style: No parenthesis needed

+   ExplainCloseGroup("Workers", "Workers", false, es);
+   // do we have any other cleanup to do?

This comment does not really explain anything. Either remove
or rephrase. Also C style comments.

+   es->print_workers = false;
+}

int indent; /* current indentation level */
List   *grouping_stack; /* format-specific grouping state */
+   boolprint_workers;  /* whether current node has worker metadata */

Hmm.. commit  introduced `hide_workers` in the struct. Having both
names in the struct so far apart even seems a bit confusing and sloppy. Do you
think it would be possible to combine or rename?


+extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es);
+extern void ExplainCloseWorker(ExplainState *es);
+extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, 
ExplainState *es);

No need to expose those, is there? I feel there should be static.

Awaiting for answer or resolution of these comments to change the status.

//Georgios

Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Craig Ringer
On Wed, 22 Jan 2020 at 16:45, Ants Aasma  wrote:

> The intended use case of two phase transactions is ensuring atomic
> durability of transactions across multiple database systems.

Exactly. I was trying to find a good way to say this.

It doesn't make much sense to embed a 2PC resolver in Pg unless it's
an XA coordinator or similar. And generally it doesn't make sense for
the distributed transaction coordinator to reside alongside one of the
datasources being managed anyway, especially where failover and HA are
in the picture.

I *can* see it being useful, albeit rather heavyweight, to implement
an XA coordinator on top of PostgreSQL. Mostly for HA and replication
reasons. But generally you'd use postgres instances for the HA
coordinator and the DB(s) in which 2PC txns are being managed. While
you could run them in the same instance it'd probably mostly be for
toy-scale PoC/demo/testing use.

So I don't really see the point of doing anything with 2PC xacts
within Pg proper. It's the job of the app that prepares the 2PC xacts,
and if that app is unable to resolve them for some reason there's no
generally-correct action to take without administrator action.




Re: Error message inconsistency

2020-01-22 Thread MBeena Emerson
Hi Mahendra,

Thanks for working on this.

On Wed, 22 Jan 2020 at 13:26, Mahendra Singh Thalor  wrote:
>
> On Tue, 21 Jan 2020 at 10:51, Amit Kapila  wrote:
> >
> > On Mon, Jan 6, 2020 at 6:31 PM Mahendra Singh Thalor  
> > wrote:
> > >
> > > On Sat, 6 Jul 2019 at 09:53, Amit Kapila  wrote:
> > > >
> > > > On Mon, Jul 1, 2019 at 10:05 PM Alvaro Herrera 
> > > >  wrote:
> > > > >
> > > > > Do we have an actual patch here?
> > > > >
> > > >
> > > > We have a patch, but it needs some more work like finding similar
> > > > places and change all of them at the same time and then change the
> > > > tests to adapt the same.
> > > >
> > >
> > > Hi all,
> > > Based on above discussion, I tried to find out all the places where we 
> > > need to change error for "not null constraint".  As Amit Kapila pointed 
> > > out 1 place, I changed the error and adding modified patch.
> > >
> >
> > It seems you have not used the two error codes
> > (ERRCODE_NOT_NULL_VIOLATION and ERRCODE_CHECK_VIOLATION) pointed by me
> > above.
>
> Thanks Amit and Beena for reviewing patch.
>
> Yes, you are correct. I searched using error messages not error code.  That 
> was my mistake.  Now, I grepped using above error codes and found that these 
> error codes are used in 19 places. Below is the code parts of 19 places.
>
> 1. src/backend/utils/adt/domains.c
>
> 146 if (isnull)
> 147 ereport(ERROR,
> 148 (errcode(ERRCODE_NOT_NULL_VIOLATION),
> 149  errmsg("domain %s does not allow null 
> values",
> 150 
> format_type_be(my_extra->domain_type)),
> 151  errdatatype(my_extra->domain_type)));
> 152 break;
>
> I think, above error is for domain, so there is no need to add anything in 
> error message.
> -
> 2. src/backend/utils/adt/domains.c
>
> 181 if (!ExecCheck(con->check_exprstate, econtext))
> 182 ereport(ERROR,
> 183 (errcode(ERRCODE_CHECK_VIOLATION),
> 184  errmsg("value for domain %s violates 
> check constraint \"%s\"",
> 185 
> format_type_be(my_extra->domain_type),
> 186 con->name),
> 187  
> errdomainconstraint(my_extra->domain_type,
> 188  con->name)));
>
> I think, above error is for domain, so there is no need to add anything in 
> error message.
> -
> 3. src/backend/partitioning/partbounds.c
>
> 1330 if (part_rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
> 1331 ereport(WARNING,
> 1332 (errcode(ERRCODE_CHECK_VIOLATION),
> 1333  errmsg("skipped scanning foreign table \"%s\" 
> which is a partition of default partition \"%s\"",
> 1334 RelationGetRelationName(part_rel),
> 1335 RelationGetRelationName(default_rel;
>
> Relation name is already appended in error messgae.
> -
> 4. src/backend/partitioning/partbounds.c
>
> 1363 if (!ExecCheck(partqualstate, econtext))
> 1364 ereport(ERROR,
> 1365 (errcode(ERRCODE_CHECK_VIOLATION),
> 1366  errmsg("updated partition constraint for 
> default partition \"%s\" would be violated by some row",
> 1367 RelationGetRelationName(default_rel;
>
> Relation name is already appended in error messgae.
> -
> 5. src/backend/executor/execPartition.c
>
> 342 ereport(ERROR,
>  343 (errcode(ERRCODE_CHECK_VIOLATION),
>  344  errmsg("no partition of relation \"%s\" found for 
> row",
>  345 RelationGetRelationName(rel)),
>  346  val_desc ?
>  347  errdetail("Partition key of the failing row 
> contains %s.",
>  348val_desc) : 0));
>
> Relation name is already appended in error messgae.
> -
> 6. src/backend/executor/execMain.c
>
> 1877 ereport(ERROR,
> 1878 (errcode(ERRCODE_CHECK_VIOLATION),
> 1879  errmsg("new row for relation \"%s\" violates partition 
> constraint",
> 1880 
> RelationGetRelationName(resultRelInfo->ri_RelationDesc)),
> 

Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Ants Aasma
On Wed, 22 Jan 2020 at 09:02, Hamid Akhtar  wrote:
>
> At this stage, I'm not sure of the scale of changes this will require, 
> however, I wanted to get an understanding and consensus on whether (a) this 
> is something we should work on, and (b) whether an approach to implementing a 
> timeout makes sense.
>
> Please feel free to share your thoughts here.

The intended use case of two phase transactions is ensuring atomic
durability of transactions across multiple database systems. This
necessarily means that there needs to be a failure tolerant agent that
ensures there is consensus about the status of the transaction and
then executes that consensus across all systems. In other words, there
needs to be a transaction manager for prepared statements to actually
fulfil their purpose. Therefore I think that unilaterally timing out
prepared statements is just shifting the consequences of a broken
client from availability to durability. But if durability was never a
concern, why is the client even using prepared statements?

Citing the documentation:

> PREPARE TRANSACTION is not intended for use in applications or interactive 
> sessions. Its purpose is to allow an external transaction manager to perform 
> atomic global transactions across multiple databases or other transactional 
> resources. Unless you're writing a transaction manager, you probably 
> shouldn't be using PREPARE TRANSACTION.

Regards,
Ants Aasma




Re: error context for vacuum to include block number

2020-01-22 Thread Masahiko Sawada
On Wed, 22 Jan 2020 at 10:17, Justin Pryzby  wrote:
>
> On Tue, Jan 21, 2020 at 05:54:59PM -0300, Alvaro Herrera wrote:
> > > On Tue, Jan 21, 2020 at 03:11:35PM +0900, Masahiko Sawada wrote:
> > > > Some of them conflicts with the current HEAD(62c9b52231). Please rebase 
> > > > them.
> > >
> > > Sorry, it's due to other vacuum patch in this branch.
> > >
> > > New patches won't conflict, except for the 0005, so I renamed it for 
> > > cfbot.
> > > If it's deemed to be useful, I'll make a separate branch for the others.
> >
> > I think you have to have some other patches applied before these,
> > because in the context lines for some of the hunks there are
> > double-slash comments.
>
> And I knew that, so (re)tested that the extracted patches would apply, but it
> looks like maybe the patch checker is less smart (or more strict) than git, so
> it didn't work after all.

Thank you for updating the patches!

I'm not sure it's worth to have patches separately but I could apply
all patches cleanly. Here is my comments for the code applied all
patches:

1.
+   /* Used by the error callback */
+   char*relname;
+   char*relnamespace;
+   BlockNumber blkno;
+   int stage;  /* 0: scan heap; 1: vacuum heap; 2: vacuum index */
+} LVRelStats;

* The comment should be updated as we use both relname and
relnamespace for ereporting.

* We can leave both blkno and stage that are used only for error
context reporting put both relname and relnamespace to top of
LVRelStats.

* The 'stage' is missing to support index cleanup.

* Maybe we need a comment for 'blkno'.

2.
@@ -748,8 +742,31 @@ lazy_scan_heap(Relation onerel, VacuumParams
*params, LVRelStats *vacrelstats,
vacrelstats->scanned_pages = 0;
vacrelstats->tupcount_pages = 0;
vacrelstats->nonempty_pages = 0;
+
+   /* Setup error traceback support for ereport() */
+   vacrelstats->relnamespace =
get_namespace_name(RelationGetNamespace(onerel));
+   vacrelstats->relname = RelationGetRelationName(onerel);
+   vacrelstats->blkno = InvalidBlockNumber; /* Not known yet */
+   vacrelstats->stage = 0;
+
+   errcallback.callback = vacuum_error_callback;
+   errcallback.arg = (void *) vacrelstats;
+   errcallback.previous = error_context_stack;
+   error_context_stack = 
+
vacrelstats->latestRemovedXid = InvalidTransactionId;

+   if (aggressive)
+   ereport(elevel,
+   (errmsg("aggressively vacuuming \"%s.%s\"",
+   vacrelstats->relnamespace,
+   vacrelstats->relname)));
+   else
+   ereport(elevel,
+   (errmsg("vacuuming \"%s.%s\"",
+   vacrelstats->relnamespace,
+   vacrelstats->relname)));

* It seems to me strange that only initialization of latestRemovedXid
is done after error callback initialization.

* Maybe we can initialize relname and relnamespace in heap_vacuum_rel
rather than in lazy_scan_heap. BTW variables of vacrelstats are
initialized different places: some of them in heap_vacuum_rel and
others in lazy_scan_heap. I think we can gather those that can be
initialized at that time to heap_vacuum_rel.

3.
/* Work on all the indexes, then the heap */
+   /* Don't use the errcontext handler outside this function */
+   error_context_stack = errcallback.previous;
lazy_vacuum_all_indexes(onerel, Irel, indstats,
vacrelstats, lps, nindexes);
-
/* Remove tuples from heap */
lazy_vacuum_heap(onerel, vacrelstats);
+   error_context_stack = 

Maybe we can do like:

   /* Pop the error context stack */
   error_context_stack = errcallback.previous;

/* Work on all the indexes, then the heap */
lazy_vacuum_all_indexes(onerel, Irel, indstats,
vacrelstats, lps, nindexes);

/* Remove tuples from heap */
lazy_vacuum_heap(onerel, vacrelstats);

/* Push again the error context of heap scan */
error_context_stack = 

4.
+   /* Setup error traceback support for ereport() */
+   /* vacrelstats->relnamespace already set */
+   /* vacrelstats->relname already set */

These comments can be merged like:

/*
 * Setup error traceback for ereport(). Both relnamespace and
 * relname are already set.
 */

5.
+   /* Setup error traceback support for ereport() */
+   vacrelstats.relnamespace = get_namespace_name(RelationGetNamespace(indrel));
+   vacrelstats.relname = RelationGetRelationName(indrel);
+   vacrelstats.blkno = InvalidBlockNumber; /* Not used */

Why do we need to initialize blkno in spite of not using?

6.
+/*
+ * Error context callback for errors occurring during vacuum.
+ */
+static void
+vacuum_error_callback(void *arg)
+{
+   LVRelStats *cbarg = arg;
+
+   if (cbarg->stage == 0)
+   errcontext(_("while scanning block %u of relation \"%s.%s\""),
+   cbarg->blkno, cbarg->relnamespace, 

Re: Increase psql's password buffer size

2020-01-22 Thread Daniel Gustafsson
> On 22 Jan 2020, at 01:41, David Fetter  wrote:
> 
> On Tue, Jan 21, 2020 at 07:05:47PM +0100, David Fetter wrote:
>> On Tue, Jan 21, 2020 at 10:23:59AM -0500, Bruce Momjian wrote:
>>> On Tue, Jan 21, 2020 at 04:19:13PM +0100, David Fetter wrote:
 On Tue, Jan 21, 2020 at 10:12:52AM -0500, Bruce Momjian wrote:
> I think we should be using a macro to define the maximum length, rather
> than have 100 used in various places.
 
 It's not just 100 in some places. It's different in different places,
 which goes to your point.
 
 How about using a system that doesn't meaningfully impose a maximum
 length? The shell variable is a const char *, so why not just
 re(p)alloc as needed?
>>> 
>>> Uh, how do you know how big to make the buffer that receives the read?
>> 
>> You can start at any size, possibly even 100, and then increase the
>> size in a loop along the lines of (untested)

It doesn't seem like a terribly safe pattern to have the client decide the read
buffer without disclosing the size, and have the server resize the input buffer
to an arbitrary size as input comes in.

>   my_size *= 2;
>   buf = (char *) realloc(buf, my_size);

I know it's just example code, but using buf as the input to realloc like this
risks a memleak when realloc fails as the original buf pointer is overwritten.
Using a temporary pointer for ther returnvalue avoids that, which is how
pg_repalloc and repalloc does it.

cheers ./daniel