Re: pgbench logging broken by time logic changes

2021-07-09 Thread Thomas Munro
On Fri, Jul 9, 2021 at 5:14 PM Fabien COELHO  wrote:
> Works for me: patch applies, global and local check ok. I'm fine with it.

I hoped we were done here but I realised that your check for 1-3 log
lines will not survive the harsh environment of the build farm.
Adding sleep(2) before the final doLog() confirms that.  I had two
ideas:

1.  Give up and expect 1-180 lines.  (180s is the current timeout
tolerance used elsewhere to deal with
swamped/valgrind/swapping/microsd computers, after a few rounds of
inflation, so if you need an arbitrary large number to avoid buildfarm
measles that's my suggestion)
2.  Change doLog() to give up after end_time.  But then I think you'd
need to make it 0-3 :-(

I think the logging could be tightened up to work the way you expected
in future work, though.  Perhaps we could change all logging to work
with transaction start time instead of log-writing time, which doLog()
should receive.  If you never start a transaction after end_time, then
there can never be an aggregate that begins after that, and the whole
thing becomes more deterministic.  That kind of alignment of aggregate
timing with whole-run timing could also get rid of those partial
aggregates completely.  But that's an idea for 15.

So I think we should do 1 for now.  Objections or better ideas?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-09 Thread Andres Freund
Hi,

On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > Currently, the TIDs of dead tuples are stored in an array that is
> > collectively allocated at the start of lazy vacuum and TID lookup uses
> > bsearch(). There are the following challenges and limitations:
> 
> > So I prototyped a new data structure dedicated to storing dead tuples
> > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > 2.0 license). But I've implemented this idea from scratch because we
> > need to integrate it with Dynamic Shared Memory/Area to support
> > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > in total, whereas the implementation supports only 4-bytes integers.
> > Also, when it comes to vacuum, we neither need to compute the
> > intersection, the union, nor the difference between sets, but need
> > only an existence check.
> > 
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
> 
> How are you thinking of implementing iteration efficiently for rtbm? The
> second heap pass needs that obviously... I think the only option would
> be to qsort the whole thing?

I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.

The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.

My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.

The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.

A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.


This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.


The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

Greetings,

Andres Freund




Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-07-09 Thread Justin Pryzby
On Fri, Jul 09, 2021 at 07:43:02PM -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > The main question I have is whether this should include procedures.
> 
> I feel a bit uncomfortable about sticking this sort of limited-purpose
> selectivity mechanism into pg_dump.  I'd rather see a general filter
> method that can select object(s) of any type.  Pavel was doing some
> work towards that awhile ago, though I think he got frustrated about
> the lack of consensus on details.

That's this: https://commitfest.postgresql.org/33/2573/

-- 
Justin




Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-07-09 Thread Tom Lane
Tomas Vondra  writes:
> The main question I have is whether this should include procedures.

I feel a bit uncomfortable about sticking this sort of limited-purpose
selectivity mechanism into pg_dump.  I'd rather see a general filter
method that can select object(s) of any type.  Pavel was doing some
work towards that awhile ago, though I think he got frustrated about
the lack of consensus on details.  Which is a problem, but I don't
think the solution is to accrue more and more independently-designed-
and-implemented features that each solve some subset of the big problem.

regards, tom lane




Re: [PATCH] Pull general SASL framework out of SCRAM

2021-07-09 Thread Jacob Champion
On Thu, 2021-07-08 at 16:27 +0900, Michael Paquier wrote:
> I agree that this looks like an improvement in terms of the
> expectations behind a SASL mechanism, so I have done the attached to
> strengthen a bit all those checks.  However, I don't really see a
> point in back-patching any of that, as SCRAM satisfies with its
> implementation already all those conditions AFAIK.

Agreed.

> Thoughts?

LGTM, thanks!

> +  *  outputlen: The length (0 or higher) of the client response 
> buffer,
> +  * invalid if output is NULL.

nitpick: maybe "ignored" instead of "invalid"?

--Jacob


Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-07-09 Thread Tomas Vondra
Hi,

I took a quick look at the patch today. There was some minor bitrot
requiring a rebase, so I attach the rebased patch as v3.

The separate 0002 part contains some minor fixes - a couple
typos/rewording in the docs (would be good if a native speaker looked at
it, thought), and a slightly reworked chunk of code from pg_dump.c. The
change is more a matter of personal preference than correctness - it
just seems simpler this way, but ymmv. And I disliked that the comment
said "If we have to export only the functions .." but the if condition
was actually the exact opposite of that.

The main question I have is whether this should include procedures. I'd
probably argue procedures should be considered different from functions
(i.e. requiring a separate --procedures-only option), because it pretty
much is meant to be a separate object type. We don't allow calling DROP
FUNCTION on a procedure, etc. It'd be silly to introduce an unnecessary
ambiguity in pg_dump and have to deal with it sometime later.

I wonder if we should allow naming a function to dump, similarly to how
--table works for tables, for example.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 1acb9eb47277e6c3cc7d46f1fb5bef0cf669cad6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 9 Jul 2021 22:47:23 +0200
Subject: [PATCH 1/2] v3

---
 doc/src/sgml/ref/pg_dump.sgml   | 39 --
 src/bin/pg_dump/pg_backup.h |  1 +
 src/bin/pg_dump/pg_dump.c   | 45 +++--
 src/test/modules/test_pg_dump/t/001_base.pl |  9 +
 4 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a6c0788592..fcc3fc663c 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -123,6 +123,11 @@ PostgreSQL documentation
 Table data, large objects, and sequence values are dumped.

 
+   
+This option cannot be used together with the --schema-only,
+or --functions-only option.
+   
+

 This option is similar to, but for historical reasons not identical
 to, specifying --section=data.
@@ -136,13 +141,14 @@ PostgreSQL documentation
   

 Include large objects in the dump.  This is the default behavior
-except when --schema, --table, or
---schema-only is specified.  The -b
+except when --schema, --table,
+--schema-only, or --functions-only
+is specified.  The -b
 switch is therefore only useful to add large objects to dumps
 where a specific schema or table has been requested.  Note that
 blobs are considered data and therefore will be included when
 --data-only is used, but not
-when --schema-only is.
+when --schema-only or --functions-only are.

   
  
@@ -582,6 +588,11 @@ PostgreSQL documentation
 be dumped.

 
+   
+This option cannot be used together with the --functions-only
+option.
+   
+

 
  When -t is specified, pg_dump
@@ -790,7 +801,22 @@ PostgreSQL documentation
  
 
  
-  --if-exists
+  --functions-only
+  
+   
+Dmup only the stored functions and stored procedure definitions, not data, not
+other objects definition.
+   
+
+   
+This option is the same than --schema-only but restricted
+to stored functions and stored procedures.
+   
+  
+ 
+
+ 
+ --if-exists
   

 Use conditional commands (i.e., add an IF EXISTS
@@ -819,6 +845,11 @@ PostgreSQL documentation
 The only exception is that an empty pattern is disallowed.

 
+   
+This option cannot be used together with the --schema-only,
+or --functions-only option.
+   
+

 
  When --include-foreign-data is specified,
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 3c1cd858a8..e332d06317 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -153,6 +153,7 @@ typedef struct _dumpOptions
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
 	int			column_inserts;
+	int			functions_only;
 	int			if_exists;
 	int			no_comments;
 	int			no_security_labels;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 321152151d..7837656085 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -380,6 +380,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, _row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"extra-float-digits", required_argument, NULL, 8},
+		{"functions-only", no_argument, _only, 1},
 		{"if-exists", no_argument, _exists, 1},
 		{"inserts", no_argument, NULL, 9},
 		{"lock-wait-timeout", 

Re: Preventing abort() and exit() calls in libpq

2021-07-09 Thread Noah Misch
On Fri, Jul 09, 2021 at 10:06:18AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 03, 2021 at 06:44:20PM -0400, Tom Lane wrote:
> >> That'd require buildfarm owner intervention, as well as intervention
> >> by users.  Which seems like exporting our problems onto them.  I'd
> >> really rather not go that way if we can avoid it.
> 
> > I like that goal, though we'll have to see how difficult it proves.  As of
> > today, a GNU/Linux user building against static OpenLDAP will get a failure,
> > right?  That would export work onto that user, spuriously.
> 
> As a former packager for Red Hat, my response would be "you're doing it
> wrong".  Nobody on any Linux distro should *ever* statically link code
> from one package into code from another, because they are going to create
> untold pain for themselves when (not if) the first package is updated.
> So I flat out reject that as a valid use-case.
> 
> It may be that that ethos is not so strongly baked-in on other platforms.

Packagers do face more rules than users generally.

> But I'm content to wait and see if there are complaints before rescinding
> the automatic test; and if there are, I'd prefer to deal with it by just
> backing off to running the test on Linux only.

Okay.

> > We'd get something like 95% of the value by running the test on one Windows
> > buildfarm member and one non-Windows buildfarm member.
> 
> True.  But that just brings up the point that we aren't running the test
> at all on MSVC builds right now.  I have no idea how to do that, do you?

I don't.  But coverage via non-MSVC Windows is good enough.

> > ...  A strategy not having either of those drawbacks would be to skip
> > the test if libpq.so contains a definition of libpq_unbind().
> 
> I assume you meant some OpenLDAP symbol?

Yeah, that was supposed to say ldap_unbind().




Re: when the startup process doesn't (logging startup delays)

2021-07-09 Thread Robert Haas
Hi,

I'd really like to see this enabled by default, say with a default
interval of 10 seconds. If it has to be enabled explicitly, most
people won't, but I think a lot of people would benefit from knowing
why their system is slow to start up when that sort of thing happens.
I don't see much downside to having it on by default either, since it
shouldn't be expensive. I think the GUC's units should be seconds, not
milliseconds, though.

I tried starting the server with log_startup_progress_interval=1000
and then crashing it to see what the output looked like. I got this:

2021-07-09 15:49:55.956 EDT [99033] LOG:  all server processes
terminated; reinitializing
2021-07-09 15:49:55.970 EDT [99106] LOG:  database system was
interrupted; last known up at 2021-07-09 15:48:39 EDT
2021-07-09 15:49:56.499 EDT [99106] LOG:  Data directory sync (fsync)
complete after 529.673 ms
2021-07-09 15:49:56.501 EDT [99106] LOG:  database system was not
properly shut down; automatic recovery in progress
2021-07-09 15:49:56.503 EDT [99106] LOG:  redo starts at 0/223494A8
2021-07-09 15:49:57.504 EDT [99106] LOG:  Performing crash recovery,
elapsed time: 1000.373 ms, current LSN: 0/40A3F888
2021-07-09 15:49:58.505 EDT [99106] LOG:  Performing crash recovery,
elapsed time: 2001.449 ms, current LSN: 0/41F89388
2021-07-09 15:49:59.505 EDT [99106] LOG:  Performing crash recovery,
elapsed time: 3001.602 ms, current LSN: 0/55745760
2021-07-09 15:50:00.506 EDT [99106] LOG:  Performing crash recovery,
elapsed time: 4002.677 ms, current LSN: 0/60CB9FE0
2021-07-09 15:50:01.507 EDT [99106] LOG:  Performing crash recovery,
elapsed time: 5003.808 ms, current LSN: 0/6A2BBE10
2021-07-09 15:50:02.508 EDT [99106] LOG:  Performing crash recovery,
elapsed time: 6004.916 ms, current LSN: 0/71BA3F90
2021-07-09 15:50:03.385 EDT [99106] LOG:  invalid record length at
0/76BD80F0: wanted 24, got 0
2021-07-09 15:50:03.385 EDT [99106] LOG:  Crash recovery complete
after 6881.834 ms
2021-07-09 15:50:03.385 EDT [99106] LOG:  redo done at 0/76BD80C8
system usage: CPU: user: 2.77 s, system: 3.80 s, elapsed: 6.88 s
2021-07-09 15:50:04.778 EDT [99033] LOG:  database system is ready to
accept connections

Few observations on this:

- The messages you've added are capitalized, but the ones PostgreSQL
has currently are not. You should conform to the existing style.

- The "crash recovery complete" message looks redundant with the "redo
done" message. Also, in my mind, "redo" is one particular phase of
crash recovery, so it looks really odd that "crash recovery" finishes
first and then "redo" finishes. I think some thought is needed about
the terminology here.

- I'm not clear why I get a message about the data directory fsync but
not about resetting unlogged relations. I wasn't really expecting to
get a message about things that completed in less than the configured
interval, although I find that I don't mind having it there either.
But then it seems like it should be consistent across the various
things we're timing, and resetting unlogged relations should certainly
be one of those.

- The way you've coded this has some drift. In a perfect world, I'd
get a progress report at 1000.00 ms, 2000.00 ms, 3000.00 ms, etc.
That's never going to be the case, because there's always going to be
a slightly delay in responding to the timer interrupt. However, as
you've coded it, the delay increases over time. The first "Performing
crash recovery" message is only 373 ms late, but the last one is 4916
ms late. To avoid this, you want to reschedule the timer interrupt
based on the time the last one was supposed to fire, not the time it
actually did fire. (Exception: If the time it actually did fire is
before the time it was supposed to fire, then use the time it actually
did fire instead. This protects against the system clock being set
backwards.)

...Robert




Re: Online verification of checksums

2021-07-09 Thread Ibrar Ahmed
On Tue, Mar 9, 2021 at 10:43 PM David Steele  wrote:

> On 11/30/20 6:38 PM, David Steele wrote:
> > On 11/30/20 9:27 AM, Stephen Frost wrote:
> >> * Michael Paquier (mich...@paquier.xyz) wrote:
> >>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>  * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
> >  wrote:
> >> But here the checksum is broken, so while the offset is something we
> >> can rely on how do you make sure that the LSN is fine?  A broken
> >> checksum could perfectly mean that the LSN is actually *not* fine if
> >> the page header got corrupted.
> 
>  Of course that could be the case, but it gets to be a smaller and
>  smaller chance by checking that the LSN read falls within reasonable
>  bounds.
> >>>
> >>> FWIW, I find that scary.
> >>
> >> There's ultimately different levels of 'scary' and the risk here that
> >> something is actually wrong following these checks strikes me as being
> >> on the same order as random bits being flipped in the page and still
> >> getting a valid checksum (which is entirely possible with our current
> >> checksum...), or maybe even less.
> >
> > I would say a lot less. First you'd need to corrupt one of the eight
> > bytes that make up the LSN (pretty likely since corruption will probably
> > affect the entire block) and then it would need to be updated to a value
> > that falls within the current backup range, a 1 in 16 million chance if
> > a terabyte of WAL is generated during the backup. Plus, the corruption
> > needs to happen during the backup since we are going to check for that,
> > and the corrupted LSN needs to be ascending, and the LSN originally read
> > needs to be within the backup range (another 1 in 16 million chance)
> > since pages written before the start backup checkpoint should not be
> torn.
> >
> > So as far as I can see there are more likely to be false negatives from
> > the checksum itself.
> >
> > It would also be easy to add a few rounds of checks, i.e. test if the
> > LSN ascends but stays in the backup LSN range N times.
> >
> > Honestly, I'm much more worried about corruption zeroing the entire
> > page. I don't know how likely that is, but I know none of our proposed
> > solutions would catch it.
> >
> > Andres, since you brought this issue up originally perhaps you'd like to
> > weigh in?
>
> I had another look at this patch and though I think my suggestions above
> would improve the patch, I have no objections to going forward as is (if
> that is the consensus) since this seems an improvement over what we have
> now.
>
> It comes down to whether you prefer false negatives or false positives.
> With the LSN checking Stephen and I advocate it is theoretically
> possible to have a false negative but the chances of the LSN ascending N
> times but staying within the backup LSN range due to corruption seems to
> be approaching zero.
>
> I think Michael's method is unlikely to throw false positives, but it
> seems at least possible that a block would be hot enough to be appear
> torn N times in a row. Torn pages themselves are really easy to reproduce.
>
> If we do go forward with this method I would likely propose the
> LSN-based approach as a future improvement.
>
> Regards,
> --
> -David
> da...@pgmasters.net
>
>
>
I am changing the status to "Waiting on Author" based on the latest
comments of @David Steele 
and secondly the patch does not apply cleanly.

http://cfbot.cputube.org/patch_33_2719.log



-- 
Ibrar Ahmed


Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-07-09 Thread Ibrar Ahmed
On Tue, Mar 30, 2021 at 12:12 PM Paul Guo  wrote:

> On 2021/3/27, 10:23 PM, "Alvaro Herrera"  wrote:
>
> >Hmm, can you post a rebased set, where the points under discussion
> >   are marked in XXX comments explaining what the issue is?  This thread
> is
> >long and old ago that it's pretty hard to navigate the whole thing in
> >order to find out exactly what is being questioned.
>
> OK. Attached are the rebased version that includes the change I discussed
> in my previous reply. Also added POD documentation change for
> RecursiveCopy,
> and modified the patch to use the backup_options introduced in
> 081876d75ea15c3bd2ee5ba64a794fd8ea46d794 for tablespace mapping.
>
> >I think 0004 can be pushed without further ado, since it's a clear and
> >simple fix.  0001 needs a comment about the new parameter in
> >RecursiveCopy's POD documentation.
>
> Yeah, 0004 is no any risky. One concern seemed to be the compatibility of
> some
> WAL dump/analysis tools(?). I have no idea about this. But if we do not
> backport
> 0004 we do not seem to need to worry about this.
>
> >As I understand, this is a backpatchable bug-fix.
>
> Yes.
>
> Thanks.
>
> Patch does not apply successfully,
http://cfbot.cputube.org/patch_33_2161.log

Can you please rebase the patch.


-- 
Ibrar Ahmed


Re: enable_resultcache confusion

2021-07-09 Thread Robert Haas
On Fri, Jul 9, 2021 at 11:35 AM David Rowley  wrote:
> I really like that name.
>
> I'll wait to see if anyone else wants to voice their opinion before I
> do any renaming work.

I like it, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Query about time zone patterns in to_char

2021-07-09 Thread David Zhang
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Applied the patch `v2_support_of_tzh_tzm_patterns.patch` to `REL_14_STABLE` 
branch, both `make check` and `make check-world` are all passed.

Re: Synchronous commit behavior during network outage

2021-07-09 Thread Andrey Borodin



> 3 июля 2021 г., в 23:44, Jeff Davis  написал(а):
> 
> On Sat, 2021-07-03 at 14:06 +0500, Andrey Borodin wrote:
>>> But until you've disabled sync rep, the primary will essentially be
>>> down for writes whether using this new feature or not. Even if you
>>> can
>>> terminate some backends to try to free space, the application will
>>> just
>>> make new connections that will get stuck the same way.
>> 
>> Surely I'm talking about terminating postmaster, not individual
>> backends. But postmaster will need to terminate each running query.
>> We surely need to have a way to stop whole instance without making
>> any single query. And I do not like kill -9 for this purpose.
> 
> kill -6 would suffice.
SIGABRT is expected to generate a core dump, isn't it? Node failover is 
somewhat expected state in HA system.

> 
> I see the point that you don't want this to interfere with an
> administrative shutdown. But it seems like most shutdowns will need to
> escalate to SIGABRT for cases where things are going badly wrong (low
> memory, etc.) anyway. I don't see a better solution here.
In my experience SIGTERM coped fine so far.

> I don't fully understand why you'd be concerned about cancellation but
> not concerned about similar problems with termination, but if you think
> two GUCs are important I can do that.
I think 2 GUCs is a better solution than 1 GUC disabling both cancelation and 
termination.
It would be great if some other HA tool developers would chime in.

Thanks!

Best regards, Andrey Borodin.



Re: More time spending with "delete pending"

2021-07-09 Thread Alexander Lakhin
Hello Michael,
09.07.2021 08:52, Michael Paquier wrote:
> On Thu, Jul 08, 2021 at 11:00:00PM +0300, Alexander Lakhin wrote:
>> Beside the aforementioned test I can only propose the extended patch,
>> that incorporates the undo of the changes brought by bed90759f.
>> With this patch that test is passed.
> Checked and confirmed.  It is a nice test with IPC::Run you have here.
> Making things in win32stat.c more consistent with open.c surely is
> appealing.  One thing that I'd like to introduce in this patch, and
> also mentioned upthread, is to change the stat() call in open.c to use
> microsoft_native_stat().
>
> I have let pgbench run for a couple of hours with some concurrent
> activity using genfile.c, without noticing problems.  My environment
> is not representative of everything we can find out there on Windows,
> but it brings some confidence.
Thank you! I agree with your improvement. I can't remember why did I
inject 'include "port.h"' in genfile.c.
Today I've rechecked all the chain of includes and I see that stat() is
redefined as _pgstat64() in win32_port.h, that includes .
genfile.c includes "postgres.h" (that includes win32_port.h indirectly)
and then includes  again, but the later include should be
ignored due "#pragma once" in stat.h.
So I have no objection to the removal of that include.

Best regards,
Alexander




Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-07-09 Thread Tomas Vondra
Hi,

I took a quick look on this - I'm no expert in the details of snapshots,
so take my comments with a grain of salt.

AFAICS both Greg Nancarrow and Pavel Borisov are kinda right. I think
Greg is right the v3 patch does not seem like the right (or correct)
solution, for a couple reasons:


1) It fixes the symptoms, not the cause. If we set TransactionXmin to a
bogus value, this only fixes it locally in SubTransGetTopmostTransaction
but I'd be surprised if other places did not have the same issue.


2) The xid/TransactionXmin comparison in the v2 fix:

  xid = xid > TransactionXmin ? xid : TransactionXmin;

seems broken, because it compares the XIDs directly, but that's not
going to work after generating enough transactions.


3) But even if this uses TransactionIdFollowsOrEquals, it seems very
strange because the "xid" comes from

  XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))

i.e. it's the raw xmin from the tuple, so why should we be setting it to
TransactionXmin? That seems pretty strange, TBH.


So yeah, I think this is due to confusion with two snapshots and failing
to consider both of them when calculating TransactionXmin.

But I think removing one of the snapshots (as the v2 does it) is rather
strange too. I very much doubt having both the transaction and active
snapshots in the parallel worker is not intentional, and Pavel may very
well be right this breaks isolation levels that use the xact snapshot
(i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though.

So I think we need to keep both snapshots, but fix TransactionXmin to
consider both of them - I suppose ParallelWorkerMain could simply look
at the two snapshots, and use the minimum. Which is what [1] (already
linked by Pavel) talks about, although that's related to concerns about
one of the processes dying, which is not what's happening here.


I'm wondering what consequences this may have on production systems,
though. We've only seen this failing because of the assert, so what
happens when the build has asserts disabled?

Looking at SubTransGetTopmostTransaction, it seems the while loop ends
immediately (because it's pretty much the opposite of the assert), so we
just return the "xid" as topmost XID. But will that produce incorrect
query results, or what?



regards


[1]
https://www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-09 Thread Andres Freund
Hi,

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> Currently, the TIDs of dead tuples are stored in an array that is
> collectively allocated at the start of lazy vacuum and TID lookup uses
> bsearch(). There are the following challenges and limitations:

> So I prototyped a new data structure dedicated to storing dead tuples
> during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> The authors provide an implementation of Roaring Bitmap[3]  (Apache
> 2.0 license). But I've implemented this idea from scratch because we
> need to integrate it with Dynamic Shared Memory/Area to support
> parallel vacuum and need to support ItemPointerData, 6-bytes integer
> in total, whereas the implementation supports only 4-bytes integers.
> Also, when it comes to vacuum, we neither need to compute the
> intersection, the union, nor the difference between sets, but need
> only an existence check.
> 
> The data structure is somewhat similar to TIDBitmap. It consists of
> the hash table and the container area; the hash table has entries per
> block and each block entry allocates its memory space, called a
> container, in the container area to store its offset numbers. The
> container area is actually an array of bytes and can be enlarged as
> needed. In the container area, the data representation of offset
> numbers varies depending on their cardinality. It has three container
> types: array, bitmap, and run.

How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?

Greetings,

Andres Freund




Re: short circuit suggestion in find_hash_columns()

2021-07-09 Thread Zhihong Yu
On Fri, Jul 9, 2021 at 8:28 AM David Rowley  wrote:

> On Sat, 10 Jul 2021 at 03:15, Zhihong Yu  wrote:
> > I was looking at find_hash_columns() in nodeAgg.c
> >
> > It seems the first loop tries to determine the max column number needed,
> along with whether all columns are needed.
> >
> > The loop can be re-written as shown in the patch.
>
> This runs during ExecInitAgg().  Do you have a test case where you're
> seeing any performance gains from this change?
>
> David
>

Hi,
I made some attempt in varying related test but haven't seen much
difference in performance.

Let me spend more time (possibly in off hours) on this.

Cheers


Tab completion for CREATE SCHEMAAUTHORIZATION

2021-07-09 Thread Dagfinn Ilmari Mannsåker
Hi Hackers,

I just noticed there's no tab completion for CREATE SCHEMA
AUTHORIZATION, nor for anything after CREATE SCHEMA .

Please find attached a patch that adds this.

- ilmari

>From db02df7ea8d3a5db41268edd8c311a3631c9e9ad Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Fri, 9 Jul 2021 17:15:00 +0100
Subject: [PATCH] Add tab completion for CREATE SCHEMA

---
 src/bin/psql/tab-complete.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa41a..80f6001073 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -758,6 +758,13 @@ static const SchemaQuery Query_for_list_of_collations = {
 "   FROM pg_catalog.pg_roles "\
 "  WHERE substring(pg_catalog.quote_ident(rolname),1,%d)='%s'"
 
+#define Query_for_list_of_schema_roles \
+" SELECT pg_catalog.quote_ident(rolname) "\
+"   FROM pg_catalog.pg_roles "\
+"  WHERE substring(pg_catalog.quote_ident(rolname),1,%d)='%s'"\
+" UNION ALL SELECT 'CURRENT_USER'"\
+" UNION ALL SELECT 'SESSION_USER'"
+
 #define Query_for_list_of_grant_roles \
 " SELECT pg_catalog.quote_ident(rolname) "\
 "   FROM pg_catalog.pg_roles "\
@@ -2668,6 +2675,19 @@ psql_completion(const char *text, int start, int end)
 	else if (TailMatches("AS", "ON", "SELECT|UPDATE|INSERT|DELETE", "TO"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables, NULL);
 
+/* CREATE SCHEMA [  ] [ AUTHORIZATION ] */
+	else if (Matches("CREATE", "SCHEMA"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_schemas
+			" UNION ALL SELECT 'AUTHORIZATION'");
+	else if (Matches("CREATE", "SCHEMA", "AUTHORIZATION"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_schema_roles);
+	else if (Matches("CREATE", "SCHEMA", MatchAny, "AUTHORIZATION"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_schema_roles);
+	else if (Matches("CREATE", "SCHEMA", MatchAny, "AUTHORIZATION", MatchAny))
+		COMPLETE_WITH("CREATE", "GRANT");
+	else if (Matches("CREATE", "SCHEMA", MatchAny))
+		COMPLETE_WITH("AUTHORIZATION", "CREATE", "GRANT");
+
 /* CREATE SEQUENCE --- is allowed inside CREATE SCHEMA, so use TailMatches */
 	else if (TailMatches("CREATE", "SEQUENCE", MatchAny) ||
 			 TailMatches("CREATE", "TEMP|TEMPORARY", "SEQUENCE", MatchAny))
-- 
2.30.2



What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?

2021-07-09 Thread Bharath Rupireddy
Hi,

I've always had a hard time distinguishing various types of
processes/terms used in postgres. I look at the source code every time
to understand them, yet I don't feel satisfied with my understanding.
I request any hacker (having a better idea than me) to help me with
what each different process does and how they are different from each
other? Of course, I'm clear with normal backends (user sessions), bg
workers, but the others need a bit more understanding.

I appreciate the help.

Regards,
Bharath Rupireddy.




Re: enable_resultcache confusion

2021-07-09 Thread David Rowley
On Fri, 9 Jul 2021 at 06:00, Tom Lane  wrote:
>
> Maybe name the plan node type Memoize, and the GUC "enable_memoize"?

I really like that name.

I'll wait to see if anyone else wants to voice their opinion before I
do any renaming work.

David




Re: short circuit suggestion in find_hash_columns()

2021-07-09 Thread David Rowley
On Sat, 10 Jul 2021 at 03:15, Zhihong Yu  wrote:
> I was looking at find_hash_columns() in nodeAgg.c
>
> It seems the first loop tries to determine the max column number needed, 
> along with whether all columns are needed.
>
> The loop can be re-written as shown in the patch.

This runs during ExecInitAgg().  Do you have a test case where you're
seeing any performance gains from this change?

David




Re: Record a Bitmapset of non-pruned partitions

2021-07-09 Thread David Rowley
On Thu, 1 Jul 2021 at 17:49, Amit Langote  wrote:
> Given that you're proposing more uses for live_parts, maybe he'd be
> open to the idea.

Just to make sure the new field in the 0001 patch gets good enough
use, I've attached the patch which includes more usages of the field.

0002 adds a new field named interleaved_parts to PartitionBoundInfo
which is populated for LIST partitioned tables with any partitions
which have interleaved values, e.g FOR VALUES IN(3,5) and another
partition with FOR VALUES IN(4), the 3,5 partition is "interleaved"
around the partition for 4.

This combined with recording "live_parts" in the 0001 patch allows us
to do ordered partition scans in many more cases for LIST partitioning
and 1 more case with RANGE partitioning.

create table mclparted (a int) partition by list(a);
create table mclparted1 partition of mclparted for values in(1);
create table mclparted2 partition of mclparted for values in(2);
create table mclparted3_5 partition of mclparted for values in(3,5);
create table mclparted4 partition of mclparted for values in(4);
create index on mclparted (a);

set enable_bitmapscan=0;
set enable_sort=0;

-- ordered scan using Append
explain (costs off) select * from mclparted where a in(1,2) order by a;
   QUERY PLAN

 Append
   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
 Index Cond: (a = ANY ('{1,2}'::integer[]))
   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
 Index Cond: (a = ANY ('{1,2}'::integer[]))

-- no ordered scan due to interleaved partition. Must use Merge Append
explain (costs off) select * from mclparted where a in(3,4) order by a;
 QUERY PLAN

 Merge Append
   Sort Key: mclparted.a
   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5 mclparted_1
 Index Cond: (a = ANY ('{3,4}'::integer[]))
   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_2
 Index Cond: (a = ANY ('{3,4}'::integer[]))

Currently, this is a bit more strict than maybe it needs to be. I'm
disabling the optimisation if any interleaved partitions remain after
pruning, however, it would be ok to allow them providing their
interleaved partner(s) were pruned.  I think making that work might be
a bit more costly as we'd need to track all partitions that were
interleaved with each interleaved partition and ensure those were all
pruned.  As far as I can see that requires storing a Bitmapset per
interleaved partition and makes the whole thing not so cheap.  I'd
really like to keep all this stuff cheap as possible.  That's why I
ended up calculating the interleaved partitions in
create_list_bounds() rather than partitions_are_ordered().

The good news is that the code in partitions_are_ordered() became even
more simple as a result of this change. We can do ordered scan simply
when !bms_overlap(live_parts, boundinfo->interleaved_parts).

The additional case we can now allow for RANGE partition is that we
can now do ordered scan when there is a DEFAULT partition but it was
pruned. Previously we had to disable the optimisation when there was a
DEFAULT partition as we had no idea if it was pruned or not.

David


v2-0002-Allow-ordered-partition-scans-in-more-cases.patch
Description: Binary data


v2-0001-Track-non-pruned-partitions-in-RelOptInfo.patch
Description: Binary data


short circuit suggestion in find_hash_columns()

2021-07-09 Thread Zhihong Yu
Hi,
I was looking at find_hash_columns() in nodeAgg.c

It seems the first loop tries to determine the max column number needed,
along with whether all columns are needed.

The loop can be re-written as shown in the patch.

In normal cases, we don't need to perform scanDesc->natts iterations.
In best case scenario, the loop would terminate after two iterations.

Please provide your comment.

Thanks


find-hash-col-short-circuit.patch
Description: Binary data


Re: Grammar railroad diagram

2021-07-09 Thread Domingo Alvarez Duarte

Hello Andres !

Another way that I tested and it's working is to use 
https://www.bottlecaps.de/convert/ paste the postgresql grammar there 
and press "convert" and after press "view diagram".


Again optionally manually add the Tokens to a better diagram !



// Tokens from postgresql-13.3/src/include/parser/kwlist.h

ABORT_P ::= "abort"
ABSOLUTE_P ::= "absolute"
ACCESS ::= "access"
ACTION ::= "action"

...



Cheers !

On 9/7/21 4:36, Andres Freund wrote:

Hi,

On 2021-07-03 10:39:02 +0200, Domingo Alvarez Duarte wrote:

I've done a experimental tool to convert bison grammars to a kind of EBNF
understood by https://www.bottlecaps.de/rr/ui

It'd be nice if you could share that tool. The diagrams this can generate
are neat...

Greetings,

Andres Freund





Re: Query about time zone patterns in to_char

2021-07-09 Thread Tomas Vondra



On 5/20/21 8:25 PM, Bruce Momjian wrote:
> On Thu, May 20, 2021 at 12:21:12PM +0530, Nitin Jadhav wrote:
>> Thanks Suraj for reviewing the patch.
>>
>>> 1:
>>> +RESET timezone;
>>> +
>>> +
>>> CREATE TABLE TIMESTAMPTZ_TST (a int , b timestamptz);
>>>
>>> Extra line.
>>>
>>> 2:
>>> +SET timezone = '00:00';
>>> +SELECT to_char(now(), 'of') as "Of", to_char(now(), 'tzh:tzm') as 
>>> "tzh:tzm";
>>
>> I have fixed these comments.
>>
>>> I am not sure whether we should backport this or not but I don't see any
>> issues with back-patching.
> 
> Only significant fixes are backpatched, not features.
> 

Yeah, does not seem to be worth it, as there seem to be no actual
reports of issues in the field.

FWIW there seem to be quite a bit of other to_char differences compared
to Oracle (judging by docs and playing with sqlfiddle). But the patch
seems fine / simple enough and non-problematic, so perhaps let's just
get it committed?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fwd: Grammar railroad diagram

2021-07-09 Thread grd

Domingo, Bruce,

 

sorry for the error. It was caused by my server using Tomcat's default maxPostSize of 2MB, which

is not sufficient for the diagrams generated from the Postgres grammar. I have now extended it

to 10MB.

 

By the way, I had already created diagrams for PostgreSQL back in 2017, please find them here:

https://cdn.rawgit.com/GuntherRademacher/1e5a275f433fdc61bc4e81e24c287d67/raw/7c8599f5d2bf8450c52750abd70bb4bc90369bf8/gram.xhtml

 

At the time, this answered a question on StackOverflow, but apparently that question has been

deleted since. Those diagrams were created by pasting the content of

 

https://raw.githubusercontent.com/postgres/postgres/master/src/backend/parser/gram.y

 

to

 

https://bottlecaps.de/convert/

 

then clicking "Convert" and on the bottom of the result page, "View Diagram".

 

Downloading diagrams now works after maxPostSize has been extended on my side.

 

Best regards

Gunther

 

Gesendet: Dienstag, 06. Juli 2021 um 20:04 Uhr
Von: "Domingo Alvarez Duarte" 
An: g...@gmx.net
Betreff: Fwd: Grammar railroad diagram


Hello Gunther !

I've sent this to postgresql and they reported the error bellow.

Cheers !

 Forwarded Message 

	
		
			Subject:
			Re: Grammar railroad diagram
		
		
			Date:
			Tue, 6 Jul 2021 12:51:54 -0400
		
		
			From:
			Bruce Momjian 
		
		
			To:
			Domingo Alvarez Duarte 
		
		
			CC:
			pgsql-hackers@lists.postgresql.org
		
	



On Sat, Jul 3, 2021 at 10:39:02AM +0200, Domingo Alvarez Duarte wrote:

I've done a experimental tool to convert bison grammars to a kind of EBNF
understood by https://www.bottlecaps.de/rr/ui to generate railroad diagrams see
bellow the converted 'postgresql-13.3/src/backend/parser/gram.y' and with some
hand made changes to allow view it at https://www.bottlecaps.de/rr/ui the order
of the rules could be changed to a better view of the railroad diagrams. Copy
and paste the EBNF bellow on https://www.bottlecaps.de/rr/ui tab Edit Grammar
then switch to the tab View Diagram.



That is pretty cool. I had trouble figuring out how to get it working,
so here are the steps I used:

1. save my attachment (created by Domingo)
2. go to https://www.bottlecaps.de/rr/ui
3. select "Edit Grammar"
4. choose "Browse" at the bottom
5. select the attachment you saved in #1
6. choose "Load" at the bottom
7. select "View Diagram"

You can even click on the yellow boxes to see the sub-grammar. People
have asked for railroad diagrams in the past, and this certainly
produces them, and "Options" allows many customizations.

I tried downloading as XHTML+SVG and HTML+PNG but got an error:

HTTP Status 500 – Internal Server Error

Type Exception Report

Message The multi-part request contained parameter data (excluding
uploaded files) that exceeded the limit for maxPostSize set on the
associated connector

Description The server encountered an unexpected condition that
prevented it from fulfilling the request.

It might be nice to download this output and host it on the Postgres
website at some point.
 
--
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.












Re: Preventing abort() and exit() calls in libpq

2021-07-09 Thread Tom Lane
Noah Misch  writes:
> On Sat, Jul 03, 2021 at 06:44:20PM -0400, Tom Lane wrote:
>> That'd require buildfarm owner intervention, as well as intervention
>> by users.  Which seems like exporting our problems onto them.  I'd
>> really rather not go that way if we can avoid it.

> I like that goal, though we'll have to see how difficult it proves.  As of
> today, a GNU/Linux user building against static OpenLDAP will get a failure,
> right?  That would export work onto that user, spuriously.

As a former packager for Red Hat, my response would be "you're doing it
wrong".  Nobody on any Linux distro should *ever* statically link code
from one package into code from another, because they are going to create
untold pain for themselves when (not if) the first package is updated.
So I flat out reject that as a valid use-case.

It may be that that ethos is not so strongly baked-in on other platforms.
But I'm content to wait and see if there are complaints before rescinding
the automatic test; and if there are, I'd prefer to deal with it by just
backing off to running the test on Linux only.

> We'd get something like 95% of the value by running the test on one Windows
> buildfarm member and one non-Windows buildfarm member.

True.  But that just brings up the point that we aren't running the test
at all on MSVC builds right now.  I have no idea how to do that, do you?

> ...  A strategy not having either of those drawbacks would be to skip
> the test if libpq.so contains a definition of libpq_unbind().

I assume you meant some OpenLDAP symbol?

> If any other
> dependency contains exit calls, we'd likewise probe for one symbol of that
> library and skip the test if presence of that symbol reveals static linking.
> (That's maintenance-prone in its own way, but a maintenance-free strategy has
> not appeared.)

I'm more worried about the risk of failing to detect problems at all,
in case somebody fat-fingers things in a way that causes the test to
be skipped everywhere.

I'll keep that way in mind if we conclude that the existing way is
unworkable, but so far I don't think it is.

regards, tom lane




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-09 Thread Alvaro Herrera
On 2021-Jul-09, Amul Sul wrote:

> > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane  wrote:

> > > The point of the static-inline function idea was to be cheap enough
> > > that it isn't worth worrying about this sort of risky optimization.
> > > Given that an smgr function is sure to involve some kernel calls,
> > > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > > So where I was hoping to get to is that smgr objects are *only*
> > > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > > other pointers to them across any non-smgr operations.

> Herewith attached version did the same, thanks.

I think it would be valuable to have a comment in that function to point
out what is the function there for.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."  (E. Dijkstra)




Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-07-09 Thread Amul Sul
On Wed, Jul 7, 2021 at 9:44 AM Amul Sul  wrote:
>
> On Tue, Jul 6, 2021 at 11:06 PM Tom Lane  wrote:
> >
> > Amul Sul  writes:
> > > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
> > >  wrote:
> > >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> > >> >member alone and there's not the previous call to
> > >> RelationGetSmgr just above. How about using a temporary variable?
> > >>
> > >> SMgrRelation srel = RelationGetSmgr(index);
> > >> smgrwrite(srel, ...);
> > >> log_newpage(srel->..);
> >
> > > Understood.  Used a temporary variable for the place where
> > > RelationGetSmgr() calls are placed too close or in a loop.
> >
> > [ squint... ]  Doesn't this risk introducing exactly the sort of
> > cache-clobber hazard we're trying to prevent?  That is, the above is
> > not safe unless you are *entirely* certain that there is not and never
> > will be any possibility of a relcache flush before you are done using
> > the temporary variable.  Otherwise it can become a dangling pointer.
> >
>
> Yeah, there will a hazard, even if we sure right but cannot guarantee future
> changes in any subroutine that could get call in between.
>
> > The point of the static-inline function idea was to be cheap enough
> > that it isn't worth worrying about this sort of risky optimization.
> > Given that an smgr function is sure to involve some kernel calls,
> > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > So where I was hoping to get to is that smgr objects are *only*
> > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > other pointers to them across any non-smgr operations.
> >
>
> Ok, will revert changes added in  the previous version, thanks.
>

Herewith attached version did the same, thanks.

Regards,
Amul


v4_Add-RelationGetSmgr-inline-function.patch
Description: Binary data


Re: Support kerberos authentication for postgres_fdw

2021-07-09 Thread Tom Lane
Peifeng Qiu  writes:
> I'd like to add kerberos authentication support for postgres_fdw by adding two
> options to user mapping: krb_client_keyfile and gssencmode.

As you note, this'd have to be restricted to superusers, which makes it
seem like a pretty bad idea.  We really don't want to be in a situation
of pushing people to run day-to-day stuff as superuser.  Yeah, having
access to kerberos auth sounds good on the surface, but it seems like
it would be a net loss in security because of that.

Is there some other way?

regards, tom lane




Re: Transactions involving multiple postgres foreign servers, take 2

2021-07-09 Thread Masahiko Sawada
On Fri, Jul 9, 2021 at 3:26 PM Fujii Masao  wrote:
>
>
>
> On 2021/06/30 10:05, Masahiko Sawada wrote:
> > I've attached the new version patch that incorporates the comments
> > from Fujii-san and Ikeda-san I got so far.
>
> Thanks for updating the patches!
>
> I'm now reading 0001 and 0002 patches and wondering if we can commit them
> at first because they just provide independent basic mechanism for
> foreign transaction management.
>
> One question regarding them is; Why did we add new API only for "top" foreign
> transaction? Even with those patches, old API (CallSubXactCallbacks) is still
> being used for foreign subtransaction and xact_depth is still being managed
> in postgres_fdw layer (not PostgreSQL core). Is this intentional?

Yes, it's not needed for 2PC support and I was also concerned to add
complexity to the core by adding new API for subscriptions that are
not necessarily necessary for 2PC.

> As far as I read the code, keep using old API for foreign subtransaction 
> doesn't
> cause any actual bug. But it's just strange and half-baked to manage top and
> sub transaction in the differenet layer and to use old and new API for them.

That's a valid concern. I'm really not sure what we should do here but
I guess that even if we want to support subscriptions we have another
API dedicated for subtransaction commit and rollback.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-09 Thread Robert Haas
On Tue, Mar 2, 2021 at 3:26 PM Justin Pryzby  wrote:
> I don't know what committers will say, but I think that "ALTER TABLE" might be
> the essential thing for this patch to support, not "CREATE".  (This is similar
> to ALTER..SET STATISTICS, which is not allowed in CREATE.)
>
> The reason is that ALTER is what's important for RANGE partitions, which need
> to be created dynamically (for example, to support time-series data
> continuously inserting data around 'now').  I assume it's sometimes also
> important for LIST.  I think this patch should handle those cases better 
> before
> being commited, or else we risk implementing grammar and other user-facing 
> interface
> that fails to handle what's needed into the future (or that's non-essential).
> Even if dynamic creation isn't implemented yet, it seems important to at least
> implement the foundation for setting the configuration to *allow* that in the
> future, in a manner that's consistent with the initial implementation for
> "static" partitions.

I don't think it's a hard requirement, but it's an interesting point.
My initial reactions to the patch are:

- I don't think it's a very good idea to support LIST and HASH but not
RANGE. We need a design that can work for all three partitioning
strategies, even if we don't have support for all of them in the
initial patch. If they CAN all be in the same patch, so much the
better.

- I am not very impressed with the syntax. CONFIGURATION is an odd
word that seems too generic for what we're talking about here. It
would be tempting to use a connecting word like WITH or USING except
that both would be ambiguous here, so we can't. MySQL and Oracle use
the keyword PARTITIONS -- which I realize isn't a keyword at all in
PostgreSQL right now -- to introduce the partition specification. DB2
uses no keyword at all; it seems you just say PARTITION BY
(mypartitioncol) (...partition specifications go here...). I think
either approach could work for us. Avoiding the extra keyword is a
plus, especially since I doubt we're likely to support the exact
syntax that Oracle and MySQL offer anyway - though if we do, then I'd
be in favor of inserting the PARTITIONS keyword so that people's SQL
can work without modification.

- We need to think a little bit about exactly what we're trying to do.
The simplest imaginable thing here would be to just give people a
place to put a bunch of partition specifications. So you can imagine
letting someone say PARTITION BY HASH (FOR VALUES WITH (MODULUS 2,
REMAINDER 0), FOR VALUES WITH (MODULUS 2, REMAINDER 1)). However, the
patch quite rightly rejects that approach in favor of the theory that,
at CREATE TABLE time, you're just going to want to give a modulus and
have the system create one partition for every possible remainder. But
that could be expressed even more compactly than what the patch does.
Instead of saying PARTITION BY HASH CONFIGURATION (MODULUS 4) we could
just let people say PARTITION BY HASH (4) or probably even PARTITION
BY HASH 4.

- For list partitioning, the patch falls back to just letting you put
a bunch of VALUES IN clauses in the CREATE TABLE statement. I don't
find something like PARTITION BY LIST CONFIGURATION (VALUES IN (1, 2),
(1, 3)) to be particularly readable. What are all the extra keywords
adding? We could just say PARTITION BY LIST ((1, 2), (1, 3)). I think
I would find that easier to remember; not sure what other people
think. As an alternative, PARTITION BY LIST VALUES IN (1, 2), (1, 3)
looks workable, too.

- What about range partitioning? This is an interesting case because
while in theory you could leave gaps between range partitions, in
practice people probably don't want to do that very often, and it
might be better to have a simpler syntax that caters to the common
case, since people can always create partitions individually if they
happen to want gaps. So you can imagine making something like
PARTITION BY RANGE ((MINVALUE), (42), (163)) mean create two
partitions, one from (MINVALUE) to (42) and the other from (42) to
(163). I think that would be pretty useful.

- Another possible separating keyword here would be INITIALLY, which
is already a parser keyword. So then you could have stuff like
PARTITION BY HASH INITIALLY 4, PARTITION BY LIST INITIALLY ((1, 2),
(1, 3)), PARTITION BY RANGE INITIALLY ((MINVALUE), (42), (163)).

- The patch doesn't document the naming convention for the
automatically created partitions, and it is worth thinking a bit about
how that is going to work. Do people want to be able to specify the
name of the partitioned table when they are using this syntax, or are
they happy with automatically generated names? If the latter, are they
happy with THESE automatically generated names? I guess for HASH
appending _%d where %d is the modulus is fine, but it is not necessary
so great for LIST. If I said CREATE TABLE foo ... PARTITION BY LIST
(('en'), ('ru'), ('jp')) I think I'd be hoping to end up with
partitions named foo_en, foo_ru, 

Support kerberos authentication for postgres_fdw

2021-07-09 Thread Peifeng Qiu
Hi hackers,

I'd like to add kerberos authentication support for postgres_fdw by adding two
options to user mapping: krb_client_keyfile and gssencmode.

In the backend we have krb_server_keyfile option to specify a keytab file to
be used by postgres server, krb_client_keyfile is doing mostly the same thing.
This allows postgres_fdw(backend process) to authenticate on behalf of a
logged in user who is querying the foreign table. The credential is kept in
the backend process memory instead of local file to prevent abuse by users
on the same host.

Because backend process is accessing the filesystem of the server host, this
option should only be manipulated by super user. Otherwise, normal user may
steal the identity or probe the server filesystem. This principal is the same to
sslcert and sslkey options in user mapping.

Thoughts?

Best regards,
Peifeng

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 82aa14a65d..ba2edf0cba 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -333,6 +333,38 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
 		 entry->conn, server->servername, user->umid, user->userid);
 }
 
+
+/*
+ * Look for "krb_client_keyfile" option in user mapping.
+ * If found, set the KRB5_CLIENT_KTNAME environment just like we do
+ * in backend. Also remove this option from option list.
+ *
+ * If there are multiple principals in the keytab file, only the
+ * default(first) one will be selected.
+ */
+static void setup_kerberos_env(UserMapping *user)
+{
+	ListCell *lc;
+	DefElem*d;
+	StringInfo si = NULL;
+	foreach(lc, user->options)
+	{
+		d = (DefElem *) lfirst(lc);
+		if (strcmp(d->defname, "krb_client_keyfile") == 0)
+		{
+			si = makeStringInfo();
+			appendStringInfo(si, "KRB5_CLIENT_KTNAME=%s", defGetString(d));
+			user->options = list_delete_cell(user->options, lc);
+			putenv(si->data);
+			/* We don't want credential to be cached as file, keep it within current process.
+			 * The credential can be reused within the same session.
+			 */
+			putenv("KRB5CCNAME=MEMORY:");
+			return;
+		}
+	}
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
@@ -350,6 +382,8 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		const char **values;
 		int			n;
 
+		setup_kerberos_env(user);
+
 		/*
 		 * Construct connection params from generic options of ForeignServer
 		 * and UserMapping.  (Some of them might not be libpq options, in
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b8561d6a3c..afb2dc9b14 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -188,7 +188,7 @@ ALTER USER MAPPING FOR public SERVER testserver1
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslmode 'require');
 ERROR:  invalid option "sslmode"
-HINT:  Valid options in this context are: user, password, sslpassword, password_required, sslcert, sslkey
+HINT:  Valid options in this context are: user, password, sslpassword, password_required, sslcert, sslkey, krb_client_keyfile, gssencmode
 -- But we can add valid ones fine
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslpassword 'dummy');
@@ -196,6 +196,8 @@ ALTER USER MAPPING FOR public SERVER testserver1
 -- permitted to check validation.
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslkey 'value', ADD sslcert 'value');
+ALTER USER MAPPING FOR public SERVER testserver1
+	OPTIONS (ADD krb_client_keyfile 'value', ADD gssencmode 'value');
 ALTER FOREIGN TABLE ft1 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
@@ -9298,6 +9300,13 @@ HINT:  User mappings with the sslcert or sslkey options set may only be created
 ALTER USER MAPPING FOR CURRENT_USER SERVER loopback_nopw OPTIONS (ADD sslkey 'foo.key');
 ERROR:  sslcert and sslkey are superuser-only
 HINT:  User mappings with the sslcert or sslkey options set may only be created or modified by the superuser
+-- unpriv user also cannot set krb_client_keyfile / gssencmode on the user mapping
+ALTER USER MAPPING FOR CURRENT_USER SERVER loopback_nopw OPTIONS (ADD krb_client_keyfile 'foo.ktab');
+ERROR:  krb_client_keyfile and gssencmode are superuser-only
+HINT:  User mappings with the krb_client_keyfile or gssencmode options set may only be created or modified by the superuser
+ALTER USER MAPPING FOR CURRENT_USER SERVER loopback_nopw OPTIONS (ADD gssencmode 'require');
+ERROR:  krb_client_keyfile and gssencmode are superuser-only
+HINT:  User mappings with the krb_client_keyfile or gssencmode options set may only be created or modified by the superuser
 -- We're done with the role named after a specific user and need to check the
 -- changes to the public mapping.
 

Support kerberos authentication for postgres_fdw

2021-07-09 Thread Peifeng Qiu
Hi hackers,

I'd like to add kerberos authentication support for postgres_fdw by adding two
options to user mapping: krb_client_keyfile and gssencmode.

In the backend we have krb_server_keyfile option to specify a keytab file to
be used by postgres server, krb_client_keyfile is doing mostly the same thing.
This allows postgres_fdw(backend process) to authenticate on behalf of a
logged in user who is querying the foreign table. The credential is kept in
the backend process memory instead of local file to prevent abuse by users
on the same host.

Because backend process is accessing the filesystem of the server host, this
option should only be manipulated by super user. Otherwise, normal user may
steal the identity or probe the server filesystem. This principal is the same to
sslcert and sslkey options in user mapping.

Thoughts?

Best regards,
Peifeng




diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 82aa14a65d..ba2edf0cba 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -333,6 +333,38 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
 		 entry->conn, server->servername, user->umid, user->userid);
 }
 
+
+/*
+ * Look for "krb_client_keyfile" option in user mapping.
+ * If found, set the KRB5_CLIENT_KTNAME environment just like we do
+ * in backend. Also remove this option from option list.
+ *
+ * If there are multiple principals in the keytab file, only the
+ * default(first) one will be selected.
+ */
+static void setup_kerberos_env(UserMapping *user)
+{
+	ListCell *lc;
+	DefElem*d;
+	StringInfo si = NULL;
+	foreach(lc, user->options)
+	{
+		d = (DefElem *) lfirst(lc);
+		if (strcmp(d->defname, "krb_client_keyfile") == 0)
+		{
+			si = makeStringInfo();
+			appendStringInfo(si, "KRB5_CLIENT_KTNAME=%s", defGetString(d));
+			user->options = list_delete_cell(user->options, lc);
+			putenv(si->data);
+			/* We don't want credential to be cached as file, keep it within current process.
+			 * The credential can be reused within the same session.
+			 */
+			putenv("KRB5CCNAME=MEMORY:");
+			return;
+		}
+	}
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
@@ -350,6 +382,8 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		const char **values;
 		int			n;
 
+		setup_kerberos_env(user);
+
 		/*
 		 * Construct connection params from generic options of ForeignServer
 		 * and UserMapping.  (Some of them might not be libpq options, in
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b8561d6a3c..afb2dc9b14 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -188,7 +188,7 @@ ALTER USER MAPPING FOR public SERVER testserver1
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslmode 'require');
 ERROR:  invalid option "sslmode"
-HINT:  Valid options in this context are: user, password, sslpassword, password_required, sslcert, sslkey
+HINT:  Valid options in this context are: user, password, sslpassword, password_required, sslcert, sslkey, krb_client_keyfile, gssencmode
 -- But we can add valid ones fine
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslpassword 'dummy');
@@ -196,6 +196,8 @@ ALTER USER MAPPING FOR public SERVER testserver1
 -- permitted to check validation.
 ALTER USER MAPPING FOR public SERVER testserver1
 	OPTIONS (ADD sslkey 'value', ADD sslcert 'value');
+ALTER USER MAPPING FOR public SERVER testserver1
+	OPTIONS (ADD krb_client_keyfile 'value', ADD gssencmode 'value');
 ALTER FOREIGN TABLE ft1 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
@@ -9298,6 +9300,13 @@ HINT:  User mappings with the sslcert or sslkey options set may only be created
 ALTER USER MAPPING FOR CURRENT_USER SERVER loopback_nopw OPTIONS (ADD sslkey 'foo.key');
 ERROR:  sslcert and sslkey are superuser-only
 HINT:  User mappings with the sslcert or sslkey options set may only be created or modified by the superuser
+-- unpriv user also cannot set krb_client_keyfile / gssencmode on the user mapping
+ALTER USER MAPPING FOR CURRENT_USER SERVER loopback_nopw OPTIONS (ADD krb_client_keyfile 'foo.ktab');
+ERROR:  krb_client_keyfile and gssencmode are superuser-only
+HINT:  User mappings with the krb_client_keyfile or gssencmode options set may only be created or modified by the superuser
+ALTER USER MAPPING FOR CURRENT_USER SERVER loopback_nopw OPTIONS (ADD gssencmode 'require');
+ERROR:  krb_client_keyfile and gssencmode are superuser-only
+HINT:  User mappings with the krb_client_keyfile or gssencmode options set may only be created or modified by the superuser
 -- We're done with the role named after a specific user and need to check the
 -- changes to the public 

Re: Why ALTER SUBSCRIPTION ... SET (slot_name='none') requires subscription disabled?

2021-07-09 Thread Ranier Vilela
Em qui., 8 de jul. de 2021 às 23:50, Japin Li 
escreveu:

>
> On Thu, 08 Jul 2021 at 18:17, Amit Kapila  wrote:
> > On Thu, Jul 8, 2021 at 3:43 PM Japin Li  wrote:
> >>
> >> On Thu, 08 Jul 2021 at 17:51, Amit Kapila 
> wrote:
> >> > On Wed, Jul 7, 2021 at 7:25 PM Japin Li  wrote:
> >> >>
> >> >> Hi, hackers
> >> >>
> >> >> The documentation [1] says:
> >> >>
> >> >> When dropping a subscription that is associated with a replication
> slot on the
> >> >> remote host (the normal state), DROP SUBSCRIPTION will connect to
> the remote
> >> >> host and try to drop the replication slot as part of its operation.
> This is
> >> >> necessary so that the resources allocated for the subscription on
> the remote
> >> >> host are released. If this fails, either because the remote host is
> not
> >> >> reachable or because the remote replication slot cannot be dropped
> or does not
> >> >> exist or never existed, the DROP SUBSCRIPTION command will fail. To
> proceed in
> >> >> this situation, disassociate the subscription from the replication
> slot by
> >> >> executing ALTER SUBSCRIPTION ... SET (slot_name = NONE).
> >> >>
> >> >> However, when I try this, it complains the subscription is enabled,
> this command
> >> >> requires the subscription disabled. Why we need this limitation?
> >> >>
> >> >
> >> > If we don't have this limitation then even after you have set the slot
> >> > name to none, the background apply worker corresponding to that
> >> > subscription will continue to stream changes via the previous slot.
> >> >
> >>
> >> Yeah, thanks for your explain! Should we add some comments here?
> >>
> >
> > Sure, but let's keep that as a separate HEAD-only patch.
>
> Please consider review v3 patch. v3-0001 adds slot_name verification in
> parse_subscription_options() and comments for why we need disable
> subscription
> where set slot_name to NONE. v3-0002 comes from Ranier Vilela, it reduce
> the
> overhead strlen in ReplicationSlotValidateName().
>
+1 Seems good.

regards,
Ranier Vilela


Re: ERROR: "ft1" is of the wrong type.

2021-07-09 Thread Kyotaro Horiguchi
At Fri, 9 Jul 2021 11:03:56 +0900, Michael Paquier  wrote 
in 
> On Fri, Jul 09, 2021 at 10:44:13AM +0900, Kyotaro Horiguchi wrote:
> > Sounds reasonable. So the attached are that for PG11-PG14.  11 and 12
> > shares the same patch.
> 
> How much do the regression tests published upthread in
> https://postgr.es/m/20210219.173039.609314751334535042.horikyota@gmail.com
> apply here?  Shouldn't we also have some regression tests for the new
> error cases you are adding?  I agree that we'd better avoid removing

Mmm. Ok, I distributed the mother regression test into each version.

PG11, 12:

 - ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX

   Added.

 - ATT_TABLE | ATT_PARTITIONED_INDEX

   This test doesn't detect the "is of the wrong type" issue.

   The item is practically a dead one since the combination is caught
   by transformPartitionCmd before visiting ATPrepCmd, which emits a
   bit different error message for the test.

 "\"%s\" is not a partitioned table or index"

   ATPrepCmd emits an error that:

 "\"%s\" is not a table or partitioned index"

   Hmm.. somewhat funny.  Actually ATT_TABLE is a bit off here but
   there's no symbol ATT_PARTITIONED_TABLE.  Theoretically the symbol
   is needed but practically not.  I don't think we need to do more
   than that at least for these versions.  (Or we don't even need to
   add this item.)

PG13:

 - ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX

   Same to PG12.

 - ATT_TABLE | ATT_PARTITIONED_INDEX:

   This version raches this item in ATPrepCmd because the commit
   1281a5c907 moved the parse-transform phase to the ATExec stage,
   which is visited after ATPrepCmd.

   On the other hand, when the target relation is a regular table, the
   error is missed by ATPrepCmd then the control reaches to the
   Exec-stage. The error is finally aught by transformPartitionCmd.

   Of course this works fine but doesn't seem clean, but it is
   apparently a matter of the master branch.

 - ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX | 
ATT_FOREIGN_TABLE
   Added and works as expected.

PG14:

 - ATT_INDEX

   I noticed that this combination has been reverted by the commit
   ec48314708.

 - ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX
 - ATT_TABLE | ATT_PARTITIONED_INDEX:
 - ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX | 
ATT_FOREIGN_TABLE

   Same as PG13.

   So, PG14 and 13 share the same fix and test.

> error cases you are adding?  I agree that we'd better avoid removing
> those entries, one argument in favor of not removing any entries being
> that this could have an impact on forks.

Ok. The attached are the two patchsets for PG14-13 and PG12-11
containing the fix and the regression test.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 80fa3ae43f5697a562b2ba460a50cdedf8d53e55 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 8 Jul 2021 17:55:25 +0900
Subject: [PATCH v3 1/2] Add missing targets in ATWrongRelkindError

ATWrongRelkindError yields an ambiguous message ""tbl" is of the wrong
type" due the lack of some combinations of allowed_targets. Add the
missing items. We don't remove apparently unused entries in case
someone is using them.
---
 src/backend/commands/tablecmds.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 97a9725df7..d02e01cb20 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6029,6 +6029,12 @@ ATWrongRelkindError(Relation rel, int allowed_targets)
 		case ATT_TABLE | ATT_MATVIEW | ATT_INDEX:
 			msg = _("\"%s\" is not a table, materialized view, or index");
 			break;
+		case ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX:
+			msg = _("\"%s\" is not a table, materialized view, index, or partitioned index");
+			break;
+		case ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_PARTITIONED_INDEX | ATT_FOREIGN_TABLE:
+			msg = _("\"%s\" is not a table, materialized view, index, partitioned index, or foreign table");
+			break;
 		case ATT_TABLE | ATT_MATVIEW | ATT_FOREIGN_TABLE:
 			msg = _("\"%s\" is not a table, materialized view, or foreign table");
 			break;
@@ -6041,6 +6047,9 @@ ATWrongRelkindError(Relation rel, int allowed_targets)
 		case ATT_TABLE | ATT_MATVIEW | ATT_INDEX | ATT_FOREIGN_TABLE:
 			msg = _("\"%s\" is not a table, materialized view, index, or foreign table");
 			break;
+		case ATT_TABLE | ATT_PARTITIONED_INDEX:
+			msg = _("\"%s\" is not a table or partitioned index");
+			break;
 		case ATT_VIEW:
 			msg = _("\"%s\" is not a view");
 			break;
-- 
2.27.0

>From 3fc4f789006cb6f6239dda7e6f7c817f9da41812 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 9 Jul 2021 20:19:23 +0900
Subject: [PATCH v3 2/2] regress - Add missing targets in ATWrongRelkindError

---
 src/test/regress/expected/alter_table.out  | 7 +++
 src/test/regress/expected/foreign_data.out | 4 

Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.

2021-07-09 Thread Bharath Rupireddy
On Fri, Jul 9, 2021 at 8:43 AM Quan Zongliang  wrote:
> Thanks for the comments.
> Done

Thanks for the patch. Few comments:

1) How about just adding a comment /* support for old extension
version */ before INT2OID handling?
+ case INT2OID:
+ values[3] = UInt16GetDatum(page->pd_lower);
+ break;

2) Do we ever reach the error statement elog(ERROR, "incorrect output
types");? We have the function either defined with smallint or int, I
don't think so we ever reach it. Instead, an assertion would work as
suggested by Micheal.

3) Isn't this test case unstable when regression tests are run with a
different BLCKSZ setting? Or is it okay that some of the other tests
for pageinspect already outputs page_size, hash_page_stats.
+SELECT pagesize, version FROM page_header(get_raw_page('test1', 0));
+ pagesize | version
+--+-
+ 8192 |   4

4) Can we arrange pageinspect--1.8--1.9.sql  into the first line itself?
+DATA =  pageinspect--1.9--1.10.sql \
+ pageinspect--1.8--1.9.sql \

5) How about using "int4" instead of just "int", just for readability?

6) How about something like below instead of repetitive switch statements?
static inline Datum
get_page_header_attr(TupleDesc desc, int attno, int val)
{
Oid atttypid;
Datum datum;

atttypid = TupleDescAttr(desc, attno)->atttypid;
Assert(atttypid == INT2OID || atttypid == INT4OID);

if (atttypid == INT2OID)
datum = UInt16GetDatum(val);
else if (atttypid == INT4OID)
datum = Int32GetDatum(val);

return datum;
}

values[3] = get_page_header_attr(tupdesc, 3, page->pd_lower);
values[4] = get_page_header_attr(tupdesc, 4, page->pd_upper);
values[5] = get_page_header_attr(tupdesc, 5, page->pd_special);
values[6] = get_page_header_attr(tupdesc, 6, PageGetPageSize(page));

Regards,
Bharath Rupireddy.




Re: Transactions involving multiple postgres foreign servers, take 2

2021-07-09 Thread Masahiko Sawada
Sorry for the late reply.

On Tue, Jul 6, 2021 at 3:15 PM r.takahash...@fujitsu.com
 wrote:
>
> Hi,
>
>
> I'm interested in this patch and I also run the same test with Ikeda-san's 
> fxact_update.pgbench.

Thank you for testing!

> In my environment (poor spec VM), the result is following.
>
> * foreign_twophase_commit = disabled
> 363tps
>
> * foreign_twophase_commit = required (It is necessary to set -R ${RATE} as 
> Ikeda-san said)
> 13tps
>
>
> I analyzed the bottleneck using pstack and strace.
> I noticed that the open() during "COMMIT PREPARED" command is very slow.
>
> In my environment the latency of the "COMMIT PREPARED" is 16ms.
> (On the other hand, the latency of "COMMIT" and "PREPARE TRANSACTION" is 1ms)
> In the "COMMIT PREPARED" command, open() for wal segment file takes 14ms.
> Therefore, open() is the bottleneck of "COMMIT PREPARED".
> Furthermore, I noticed that the backend process almost always open the same 
> wal segment file.
>
> In the current patch, the backend process on foreign server which is 
> associated with the connection from the resolver process always run "COMMIT 
> PREPARED" command.
> Therefore, the wal segment file of the current "COMMIT PREPARED" command 
> probably be the same with the previous "COMMIT PREPARED" command.
>
> In order to improve the performance of the resolver process, I think it is 
> useful to skip closing wal segment file during the "COMMIT PREPARED" and 
> reuse file descriptor.
> Is it possible?

Not sure but it might be possible to keep holding an xlogreader for
reading PREPARE WAL records even after the transaction commit. But I
wonder how much open() for wal segment file accounts for the total
execution time of 2PC. 2PC requires 2 network round trips for each
participant. For example, if it took 500ms in total, we would not get
benefits much from the point of view of 2PC performance even if we
improved it from 14ms to 1ms.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Transactions involving multiple postgres foreign servers, take 2

2021-07-09 Thread Masahiko Sawada
Sorry for the late reply.

On Mon, Jul 5, 2021 at 3:29 PM Masahiro Ikeda  wrote:
>
>
>
> On 2021/06/30 10:05, Masahiko Sawada wrote:
> > On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda  
> > wrote:
> >>
> >> Hi Jamison-san, sawada-san,
> >>
> >> Thanks for testing!
> >>
> >> FWIF, I tested using pgbench with "--rate=" option to know the server
> >> can execute transactions with stable throughput. As sawada-san said,
> >> the latest patch resolved second phase of 2PC asynchronously. So,
> >> it's difficult to control the stable throughput without "--rate=" option.
> >>
> >> I also worried what I should do when the error happened because to increase
> >> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
> >> show the error, is it better to add the case to the HINT message?
> >>
> >> BTW, if sawada-san already develop to run the resolver processes in 
> >> parallel,
> >> why don't you measure performance improvement? Although Robert-san,
> >> Tunakawa-san and so on are discussing what architecture is best, one
> >> discussion point is that there is a performance risk if adopting 
> >> asynchronous
> >> approach. If we have promising solutions, I think we can make the 
> >> discussion
> >> forward.
> >
> > Yeah, if we can asynchronously resolve the distributed transactions
> > without worrying about max_prepared_foreign_transaction error, it
> > would be good. But we will need synchronous resolution at some point.
> > I think we at least need to discuss it at this point.
> >
> > I've attached the new version patch that incorporates the comments
> > from Fujii-san and Ikeda-san I got so far. We launch a resolver
> > process per foreign server, committing prepared foreign transactions
> > on foreign servers in parallel. To get a better performance based on
> > the current architecture, we can have multiple resolver processes per
> > foreign server but it seems not easy to tune it in practice. Perhaps
> > is it better if we simply have a pool of resolver processes and we
> > assign a resolver process to the resolution of one distributed
> > transaction one by one? That way, we need to launch resolver processes
> > as many as the concurrent backends using 2PC.
>
> Thanks for updating the patches.
>
> I have tested in my local laptop and summary is the following.

Thank you for testing!

>
> (1) The latest patch(v37) can improve throughput by 1.5 times compared to v36.
>
> Although I expected it improves by 2.0 times because the workload is that one
> transaction access two remote servers... I think the reason is that the disk
> is bottleneck and I couldn't prepare disks for each postgresql servers. If I
> could, I think the performance can be improved by 2.0 times.
>
>
> (2) The latest patch(v37) throughput of foreign_twophase_commit = required is
> about 36% compared to the case if foreign_twophase_commit = disabled.
>
> Although the throughput is improved, the absolute performance is not good. It
> may be the fate of 2PC. I think the reason is that the number of WAL writes is
> much increase and, the disk writes in my laptop is the bottleneck. I want to
> know the result testing in richer environments if someone can do so.
>
>
> (3) The latest patch(v37) has no overhead if foreign_twophase_commit =
> disabled. On the contrary, the performance improved by 3%. It may be within
> the margin of error.
>
>
>
> The test detail is following.
>
> # condition
>
> * 1 coordinator and 3 foreign servers
>
> * 4 instance shared one ssd disk.
>
> * one transaction queries different two foreign servers.
>
> ``` fxact_update.pgbench
> \set id random(1, 100)
>
> \set partnum 3
> \set p1 random(1, :partnum)
> \set p2 ((:p1 + 1) % :partnum) + 1
>
> BEGIN;
> UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> COMMIT;
> ```
>
> * pgbench generates load. I increased ${RATE} little by little until "maximum
> number of foreign transactions reached" error happens.
>
> ```
> pgbench -f fxact_update.pgbench -R ${RATE} -c 8 -j 8 -T 180
> ```
>
> * parameters
> max_prepared_transactions = 100
> max_prepared_foreign_transactions = 200
> max_foreign_transaction_resolvers = 4
>
>
> # test source code patterns
>
> 1. 2pc patches(v36) based on 6d0eb385 (foreign_twophase_commit = required).
> 2. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = required).
> 3. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = disabled).
> 4. 2595e039 without 2pc patches(v37).
>
>
> # results
>
> 1. tps = 241.8000TPS
>latency average = 10.413ms
>
> 2. tps = 359.017519 ( by 1.5 times compared to 1. by 0.36% compared to 3.)
>latency average = 15.427ms
>
> 3. tps = 987.372220 ( by 1.03% compared to 4. )
>latency average = 8.102ms
>
> 4. tps = 955.984574
>latency average = 8.368ms
>
> The disk is the bottleneck in my environment because disk util is almost 100%
> in every pattern. If disks for each instance 

Introduce pg_receivewal gzip compression tests

2021-07-09 Thread Georgios
Hi,

As suggested on a different thread [1], pg_receivewal can increase it's test
coverage. There exists a non trivial amount of code that handles gzip
compression. The current patch introduces tests that cover creation of gzip
compressed WAL files and the handling of gzip partial segments. Finally the
integrity of the compressed files is verified.

I hope you find this useful.

Cheers,
//Georgios

[1] 
https://www.postgresql.org/message-id/flat/ZCm1J5vfyQ2E6dYvXz8si39HQ2gwxSZ3IpYaVgYa3lUwY88SLapx9EEnOf5uEwrddhx2twG7zYKjVeuP5MwZXCNPybtsGouDsAD1o2L_I5E%3D%40pm.me

v1-0001-Introduce-pg_receivewal-gzip-compression-tests.patch
Description: Binary data


Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-09 Thread Pavel Borisov
>
> > Thank you for your review!
> > I've rebased the patch and made the changes mentioned.
> > PFA v5.
>
> I've set this back to "needs review" in CF.
>
Thanks for the attention! I did the review of this patch, and the changes
I've introduced in v5 are purely cosmetic. So I'd suppose the
ready-for-committer status should not better have been changed.
So I'd like return it to ready-for-committer. If you mind against this,
please mention. The opinion of Nitin, a second reviewer, is also very much
appreciated.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: "debug_invalidate_system_caches_always" is too long

2021-07-09 Thread Noah Misch
On Thu, Jul 08, 2021 at 04:34:55PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Wed, Jul 7, 2021 at 11:17 AM Tom Lane  wrote:
> >> Fair point.  What do you think of the alternative proposals
> >> "debug_flush_caches", "debug_discard_caches", etc?
> 
> > I like debug_discard_caches best.
> 
> I can live with that.  Anyone strongly against it?

I like it.




Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-09 Thread John Naylor
On Fri, Jul 9, 2021 at 6:30 AM Pavel Borisov  wrote:

> Thank you for your review!
> I've rebased the patch and made the changes mentioned.
> PFA v5.

I've set this back to "needs review" in CF.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-09 Thread Pavel Borisov
>
> I have reviewed the v4 patch. The patch does not get applied on the latest
> source. Kindly rebase.
> However I have found few comments.
>
> 1.
> > +-- must fail because of wrong configuration
> > +CREATE TABLE tbl_hash_fail (i int) PARTITION BY HASH (i)
> > +CONFIGURATION (values in (1, 2), (3, 4) default partition tbl_default);
>
> Here some of the keywords are mentioned in UPPER CASE (Ex: CREATE TABLE,
> CONFIGURATION, etc) and some are mentioned in lower case (Ex: values in,
> default partition, etc). Kindly make it common. I feel making it to UPPER
> CASE is better. Please take care of this in all the cases.
>
> 2. It is better to separate the failure cases and success cases in
> /src/test/regress/sql/create_table.sql for better readability. All the
> failure cases can be listed first and then the success cases.
>
> 3.
> > +   char *part_relname;
> > +
> > +   /*
> > +* Generate partition name in the format:
> > +* $relname_$partnum
> > +* All checks of name validity will be made afterwards in
> DefineRelation()
> > +*/
> > +   part_relname = psprintf("%s_%d", cxt->relation->relname, i);
>
> The assignment can be done directly while declaring the variable.
>
Thank you for your review!
I've rebased the patch and made the changes mentioned.
PFA v5.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v5-0001-Automatically-generate-partitions-by-LIST-and-HAS.patch
Description: Binary data


Re: Preventing abort() and exit() calls in libpq

2021-07-09 Thread Noah Misch
On Sat, Jul 03, 2021 at 06:44:20PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 03, 2021 at 10:45:59AM -0400, Tom Lane wrote:
> >> What I'm now thinking about is restricting the test to only be run on
> >> platforms where use of foo.a libraries is deprecated, so that we can
> >> be pretty sure that we won't hit this situation.  Even if we only
> >> run the test on Linux, that'd be plenty to catch any mistakes.
> 
> > Hmm.  Static libraries are the rarer case on both AIX and Linux, but I'm not
> > aware of a relevant deprecation on either platform.  If it comes this to, 
> > I'd
> > be more inclined to control the Makefile rule with an environment variable
> > (e.g. ENFORCE_LIBC_CALL_RESTRICTIONS) instead of reacting to the platform.
> 
> That'd require buildfarm owner intervention, as well as intervention
> by users.  Which seems like exporting our problems onto them.  I'd
> really rather not go that way if we can avoid it.

I like that goal, though we'll have to see how difficult it proves.  As of
today, a GNU/Linux user building against static OpenLDAP will get a failure,
right?  That would export work onto that user, spuriously.  Since the non-AIX
user count dwarfs the AIX user count, expect a user complaint from non-AIX
first.

We'd get something like 95% of the value by running the test on one Windows
buildfarm member and one non-Windows buildfarm member.  If you did gate the
check on an environment variable, there would be no need to angle for broad
adoption.  Still, I agree avoiding that configuration step is nice, all else
being equal.  A strategy not having either of those drawbacks would be to skip
the test if libpq.so contains a definition of libpq_unbind().  If any other
dependency contains exit calls, we'd likewise probe for one symbol of that
library and skip the test if presence of that symbol reveals static linking.
(That's maintenance-prone in its own way, but a maintenance-free strategy has
not appeared.)




RE: [HACKERS] logical decoding of two-phase transactions

2021-07-09 Thread tanghy.f...@fujitsu.com
On Friday, July 9, 2021 2:56 PM Ajin Cherian wrote:
> 
> On Fri, Jul 9, 2021 at 9:13 AM Peter Smith  wrote:
> 
> > I tried the v95-0001 patch.
> >
> > - The patch applied cleanly and all build / testing was OK.
> > - The documentation also builds OK.
> > - I checked all v95-0001 / v93-0001 differences and found no problems.
> > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1].
> >
> > So this patch LGTM.
> >
> 
> Applied, reviewed and tested the patch.
> Also ran a 5 level cascaded standby setup running a modified pgbench
> that does two phase commits and it ran fine.
> Did some testing using empty transactions and no issues found
> The patch looks good to me.

I did some cross version tests on patch v95 (publisher is PG14 and subscriber 
is PG15, or publisher is PG15 and subscriber is PG14; set two_phase option to 
on or off/default). It worked as expected, data could be replicated correctly.

Besides, I tested some scenarios using synchronized replication, it worked fine 
in my cases.

So this patch LGTM.

Regards
Tang


Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-09 Thread Masahiko Sawada
On Thu, Jul 8, 2021 at 10:40 PM Hannu Krosing  wrote:
>
> Very nice results.
>
> I have been working on the same problem but a bit different solution -
> a mix of binary search for (sub)pages and 32-bit bitmaps for
> tid-in-page.
>
> Even with currebnt allocation heuristics (allocate 291 tids per page)
> it initially allocate much less space, instead of current 291*6=1746
> bytes per page it needs to allocate 80 bytes.
>
> Also it can be laid out so that it is friendly to parallel SIMD
> searches doing up to 8 tid lookups in parallel.

Interesting.

>
> That said, for allocating the tid array, the best solution is to
> postpone it as much as possible and to do the initial collection into
> a file, which
>
> 1) postpones the memory allocation to the beginning of index cleanups
>
> 2) lets you select the correct size and structure as you know more
> about the distribution at that time
>
> 3) do the first heap pass in one go and then advance frozenxmin
> *before* index cleanup

I think we have to do index vacuuming before heap vacuuming (2nd heap
pass). So do you mean that it advances relfrozenxid of pg_class before
both index vacuuming and heap vacuuming?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-09 Thread Masahiko Sawada
On Thu, Jul 8, 2021 at 7:51 AM Peter Geoghegan  wrote:
>
> On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan  wrote:
> > I wonder how much it would help to break up that loop into two loops.
> > Make the callback into a batch operation that generates state that
> > describes what to do with each and every index tuple on the leaf page.
> > The first loop would build a list of TIDs, then you'd call into
> > vacuumlazy.c and get it to process the TIDs, and finally the second
> > loop would physically delete the TIDs that need to be deleted. This
> > would mean that there would be only one call per leaf page per
> > btbulkdelete(). This would reduce the number of calls to the callback
> > by at least 100x, and maybe more than 1000x.
>
> Maybe for something like rtbm.c (which is inspired by Roaring
> bitmaps), you would really want to use an "intersection" operation for
> this. The TIDs that we need to physically delete from the leaf page
> inside btvacuumpage() are the intersection of two bitmaps: our bitmap
> of all TIDs on the leaf page, and our bitmap of all TIDs that need to
> be deleting by the ongoing btbulkdelete() call.

Agreed. In such a batch operation, what we need to do here is to
compute the intersection of two bitmaps.

>
> Obviously the typical case is that most TIDs in the index do *not* get
> deleted -- needing to delete more than ~20% of all TIDs in the index
> will be rare. Ideally it would be very cheap to figure out that a TID
> does not need to be deleted at all. Something a little like a negative
> cache (but not a true negative cache). This is a little bit like how
> hash joins can be made faster by adding a Bloom filter -- most hash
> probes don't need to join a tuple in the real world, and we can make
> these hash probes even faster by using a Bloom filter as a negative
> cache.

Agreed.

>
> If you had the list of TIDs from a leaf page sorted for batch
> processing, and if you had roaring bitmap style "chunks" with
> "container" metadata stored in the data structure, you could then use
> merging/intersection -- that has some of the same advantages. I think
> that this would be a lot more efficient than having one binary search
> per TID. Most TIDs from the leaf page can be skipped over very
> quickly, in large groups. It's very rare for VACUUM to need to delete
> TIDs from completely random heap table blocks in the real world (some
> kind of pattern is much more common).
>
> When this merging process finds 1 TID that might really be deletable
> then it's probably going to find much more than 1 -- better to make
> that cache miss take care of all of the TIDs together. Also seems like
> the CPU could do some clever prefetching with this approach -- it
> could prefetch TIDs where the initial chunk metadata is insufficient
> to eliminate them early -- these are the groups of TIDs that will have
> many TIDs that we actually need to delete. ISTM that improving
> temporal locality through batching could matter a lot here.

That's a promising approach.

In rtbm, the pair of one hash entry and one container is used per
block. Therefore, we can skip TID from the leaf page by checking the
hash table, if there is no dead tuple in the block. If there is the
hash entry, since it means the block has at least one dead tuple, we
can look for the offset of TID from the leaf page from the container.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Grammar railroad diagram

2021-07-09 Thread Domingo Alvarez Duarte

Hello Andres !

There is two ways to achieve it:

-1 I just add the bison grammar in CocoR format here 
https://github.com/mingodad/CocoR-CPP in the examples folder.


-2 I created an small extension to Bison to do the same and published 
the patch here 
https://github.com/mingodad/bison/commit/da84329ebe5f4bc111ef34b2d46088b655a217f3 
(bison -e yourgramar.y)


And optionally to have the best railroad diagram we need to add the 
"Tokens" manually.


Cheers !

On 9/7/21 4:36, Andres Freund wrote:

Hi,

On 2021-07-03 10:39:02 +0200, Domingo Alvarez Duarte wrote:

I've done a experimental tool to convert bison grammars to a kind of EBNF
understood by https://www.bottlecaps.de/rr/ui

It'd be nice if you could share that tool. The diagrams this can generate
are neat...

Greetings,

Andres Freund





Re: Incorrect usage of strtol, atoi for non-numeric junk inputs

2021-07-09 Thread Kyotaro Horiguchi
Thank you for the comments.

At Fri, 9 Jul 2021 10:29:07 +0900, Michael Paquier  wrote 
in 
> On Thu, Jul 08, 2021 at 05:30:23PM +0900, Kyotaro Horiguchi wrote:
> > [1] is trying to expose pg_strtoint16/32 to frontend, but I don't see
> > much point in doing that in conjunction with [2] or this thread. Since
> > the integral parameter values of pg-commands are in int, which the
> > exising function strtoint() is sufficient to read.  So even [2] itself
> > doesn't need to utilize [1].
> 
> It sounds sensible from here to just use strtoint(), some strtol(),
> son strtod() and call it a day as these are already available.

Thanks.

> > -wait_seconds = atoi(optarg);
> > +errno = 0;
> > +wait_seconds = strtoint(optarg, , 10);
> > +if (*endptr || errno == ERANGE || wait_seconds < 0)
> > +{
> > +pg_log_error("invalid timeout \"%s\"", optarg);
> > +exit(1);
> > +}
> > [ ... ]
> > -killproc = atol(argv[++optind]);
> > +errno = 0;
> > +killproc = strtol(argv[++optind], , 10);
> > +if (*endptr || errno == ERANGE || killproc < 0)
> > +{
> > +pg_log_error("invalid process ID \"%s\"", 
> > argv[optind]);
> > +exit(1);
> > +}
> 
> Er, wait.  We've actually allowed negative values for pg_ctl
> --timeout or the subcommand kill!?

For killproc, leading minus sign suggests that it is an command line
option. Since pg_ctl doesn't have such an option, that values is
recognized as invalid *options*, even with the additional check.  The
additional check is useless in that sense. My intension is just to
make the restriction explicit so I won't feel sad even if it should be
removed.

$ pg_ctl kill HUP -12345
cg_ctl: infalid option -- '1'

--timeout accepts values less than 1, which values cause the command
ends with the timeout error before checking for the ending state. Not
destructive but useless as a behavior.  We have two choices here. One
is simply inhibiting zero or negative timeouts, and another is
allowing zero as timeout and giving it the same meaning to
--no-wait. The former is strictly right behavior but the latter is
casual and convenient. I took the former way in this version.

$ pg_ctl -w -t 0 start
pg_ctl: error: invalid timeout value "0", use --no-wait to finish without 
waiting

The same message is shown for non-decimal values but that would not matters.

> >  case 'j':
> > -user_opts.jobs = atoi(optarg);
> > +errno = 0;
> > +user_opts.jobs = strtoint(optarg, , 10);
> > +/**/
> > +if (*endptr || errno == ERANGE)
> > +pg_fatal("invalid number of jobs %s\n", optarg);
> > +
> >  break;
> 
> This one in pg_upgrade is incomplete.  Perhaps the missing comment
> should tell that negative job values are checked later on?

Zero or negative job numbers mean non-parallel mode and don't lead to
an error.  If we don't need to preserve that behavior (I personally
don't think it is ether useful and/or breaks someone's existing
usage.), it is sensible to do range-check here.

I noticed that I overlooked PGCTLTIMEOUT.

The attached is:

- disallowed less-than-one numbers as jobs in pg_upgrade. 
- disallowed less-than-one timeout in pg_ctl
- Use strtoint for PGCTLTIMEOUT in pg_ctl is 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0818de16f9febea3d90ae0404b4f5b3643f6cbe0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 8 Jul 2021 15:08:01 +0900
Subject: [PATCH v2 1/2] Be strict in numeric parameters on command line

Some numeric command line parameters are tolerant of valid values
followed by garbage like "123xyz".  Be strict to reject such invalid
values. Do the same for psql meta command parameters.
---
 src/bin/pg_amcheck/pg_amcheck.c|  6 ++-
 src/bin/pg_basebackup/pg_basebackup.c  | 13 +++--
 src/bin/pg_basebackup/pg_receivewal.c  | 18 +--
 src/bin/pg_basebackup/pg_recvlogical.c | 17 +--
 src/bin/pg_checksums/pg_checksums.c|  7 ++-
 src/bin/pg_ctl/pg_ctl.c| 18 ++-
 src/bin/pg_dump/pg_dump.c  | 39 ---
 src/bin/pg_dump/pg_restore.c   | 17 ---
 src/bin/pg_upgrade/option.c| 21 ++--
 src/bin/pgbench/pgbench.c  | 66 --
 src/bin/psql/command.c | 52 ++--
 src/bin/scripts/reindexdb.c| 10 ++--
 src/bin/scripts/vacuumdb.c | 23 +
 13 files changed, 219 insertions(+), 88 deletions(-)

diff --git a/src/bin/pg_amcheck/pg_amcheck.c b/src/bin/pg_amcheck/pg_amcheck.c
index 4bde16fb4b..71a82f9b75 100644
--- a/src/bin/pg_amcheck/pg_amcheck.c
+++ b/src/bin/pg_amcheck/pg_amcheck.c

Re: Teach pg_receivewal to use lz4 compression

2021-07-09 Thread gkokolatos



‐‐‐ Original Message ‐‐‐

On Friday, July 9th, 2021 at 04:49, Michael Paquier  wrote:

> On Thu, Jul 08, 2021 at 02:18:40PM +, gkokola...@pm.me wrote:
>
> > please find v2 of the patch which tries to address the commends
> >
> > received so far.
>
> Thanks!
>
> > Michael Paquier wrote:
> >
> > > -   system_or_bail('lz4', '-t', $lz4_wals[0]);
> > >
> > > I think that you should just drop this part of the test. The only
> > >
> > > part of LZ4 that we require to be present when Postgres is built with
> > >
> > > --with-lz4 is its library liblz4. Commands associated to it may not
> > >
> > > be around, causing this test to fail. The test checking that one .lz4
> > >
> > > file has been created is good to have. It may be worth adding a test
> > >
> > > with a .lz4.partial segment generated and --endpos pointing to a LSN
> > >
> > > that does not finish the segment that gets switched.
> >
> > I humbly disagree with the need for the test. It is rather easily possible
> >
> > to generate a file that can not be decoded, thus becoming useless. Having 
> > the
> >
> > test will provide some guarantee for the fact. In the current patch, there
> >
> > is code to find out if the program lz4 is available in the system. If it is
> >
> > not available, then that specific test is skipped. The rest remains as it
> >
> > were. Also `system_or_bail` is not used anymore in favour of the 
> > `system_log`
> >
> > so that the test counted and the execution of tests continues upon failure.
>
> Check. I can see what you are using in the new patch. We could live
>
> with that.

Great. Thank you.

>
> > > It seems to me that you are missing some logic in
> > >
> > > FindStreamingStart() to handle LZ4-compressed segments, in relation
> > >
> > > with IsCompressXLogFileName() and IsPartialCompressXLogFileName().
> >
> > Very correct. The logic is now added. Given the lz4 api, one has to fill
> >
> > in the uncompressed size during creation time. Then one can read the
> >
> > headers and verify the expectations.
>
> Yeah, I read that as well with lz4 --list and the kind. That's weird
>
> compared to how ZLIB gives an easy access to it. We may want to do an
>
> effort and tell about more lz4 --content-size/--list, telling that we
>
> add the size in the segment compressed because we have to and LZ4 does
>
> not do it by default?

I am afraid I do not follow. In the patch we do add the uncompressed size
and then, the uncompressed size is checked against a known value WalSegSz.
What the compressed size will be checked against?

>
> > > Should we have more tests for ZLIB, while on it? That seems like a
> > >
> > > good addition as long as we can skip the tests conditionally when
> > >
> > > that's not supported.
> >
> > Agreed. Please allow me to provide a distinct patch for this code.
>
> Thanks. Looking forward to seeing it. That may be better on a
>
> separate thread, even if I digressed in this thread :)

Thank you for the comments. I will sent in a separate thread.

>
> > > I think we can somehow use "acceleration" parameter of lz4 compression
> > >
> > > to map on compression level, It is not direct mapping but
> > >
> > > can't we create some internal mapping instead of completely ignoring
> > >
> > > this option for lz4, or we can provide another option for lz4?
> >
> > We can, though I am not in favour of doing so. There is seemingly
> >
> > little benefit for added complexity.
>
> Agreed.
>
> > > What I think is important for the user when it comes to this
> > >
> > > option is the consistency of its naming across all the tools
> > >
> > > supporting it. pg_dump and pg_basebackup, where we could plug LZ4,
> > >
> > > already use most of the short options you could use for pg_receivewal,
> > >
> > > having only a long one gives a bit more flexibility.
> >
> > Done.
>
> * http://www.zlib.org/rfc-gzip.html.
>
> -   -   For lz4 compressed segments
>
> */
>
> This comment is incomplete.

It is. I will fix.

>
> +#define IsLZ4CompressXLogFileName(fname) \
> -   (strlen(fname) == XLOG_FNAME_LEN + strlen(".lz4") && \
> -   strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN && \
> -   strcmp((fname) + XLOG_FNAME_LEN, ".lz4") == 0)
>
> +#define IsLZ4PartialCompressXLogFileName(fname) \
> -   (strlen(fname) == XLOG_FNAME_LEN + strlen(".lz4.partial") && \
> -   strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN && \
> -   strcmp((fname) + XLOG_FNAME_LEN, ".lz4.partial") == 0)
>
> This is getting complicated. Would it be better to change this stuff
>
> and switch to a routine that checks if a segment has a valid name, is
>
> partial, and the type of compression that applied to it? It seems to
>
> me that we should group iszlibcompress and islz4compress together with
>
> the options available through compression_method.

I agree with you. I will refactor.


> -   if (compresslevel != 0)
> -   {
> - if (compression_method == COMPRESSION_NONE)
>

psql tab auto-complete for CREATE PUBLICATION

2021-07-09 Thread Peter Smith
I found that the psql tab auto-complete was not working for some cases
of CREATE PUBLICATION [1].

CREATE PUBLICATION name
[ FOR TABLE [ ONLY ] table_name [ * ] [, ...]
  | FOR ALL TABLES ]
[ WITH ( publication_parameter [= value] [, ... ] ) ]

~~~

For example, the following scenarios did not work as I was expecting:

"create publication mypub for all tables " TAB --> expected complete
with "WITH ("

"create publication mypub for all ta" TAB --> expected complete with "TABLES"

"create publication mypub for all tables w" TAB --> expected complete
with "WITH ("

"create publication mypub for table mytable " TAB --> expected
complete with "WITH ("

~~~

PSA a small patch which seems to improve at least for those
aforementioned cases.

Now results are:

"create publication mypub for all tables " TAB --> "create publication
mypub for all tables WITH ( "

"create publication mypub for all ta" TAB --> "create publication
mypub for all tables "

"create publication mypub for all tables w" TAB --> "create
publication mypub for all tables with ( "

"create publication mypub for table mytable " TAB --> "create
publication mypub for table mytable WITH ( "

--
[1] https://www.postgresql.org/docs/devel/sql-createpublication.html

Kind Regards,
Peter Smith
Fujitsu Australia


v1-0001-more-auto-complete-for-CREATE-PUBLICATION.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-09 Thread Masahiko Sawada
On Fri, Jul 9, 2021 at 2:37 PM Andres Freund  wrote:
>
> Hi,
>
> On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
> > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > > this limitation by using MemoryContextAllocHuge() but there were
> > > concerns about point 2[1].
> > >
> > > 2. Allocate the whole memory space at once.
> > >
> > > 3. Slow lookup performance (O(logN)).
> > >
> > > I’ve done some experiments in this area and would like to share the
> > > results and discuss ideas.
> >
> > Yea, this is a serious issue.
> >
> >
> > 3) could possibly be addressed to a decent degree without changing the
> > fundamental datastructure too much. There's some sizable and trivial
> > wins by just changing vac_cmp_itemptr() to compare int64s and by using
> > an open coded bsearch().
>
> Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
> machine.
>
> Another thing I just noticed is that you didn't include the build times for 
> the
> datastructures. They are lower than the lookups currently, but it does seem
> like a relevant thing to measure as well. E.g. for #1 I see the following 
> build
> times
>
> array24.943 ms
> tbm 206.456 ms
> intset   93.575 ms
> vtbm134.315 ms
> rtbm145.964 ms
>
> that's a significant range...

Good point. I got similar results when measuring on my machine:

array 57.987 ms
tbm 297.720 ms
intset 113.796 ms
vtbm 165.268 ms
rtbm 199.658 ms

>
> Randomizing the lookup order (using a random shuffle in
> generate_index_tuples()) changes the benchmark results for #1 significantly:
>
> shuffled timeunshuffled time
> array6551.726 ms  6478.554 ms
> intset  67590.879 ms 10815.810 ms
> rtbm17992.487 ms  2518.492 ms
> tbm   364.917 ms   360.128 ms
> vtbm12227.884 ms  1288.123 ms

I believe that in your test, tbm_reaped() actually always returned
true. That could explain tbm was very fast in both cases. Since
TIDBitmap in the core doesn't support the existence check tbm_reaped()
in bdbench.c always returns true. I added a patch in the repository to
add existence check support to TIDBitmap, although it assumes bitmap
never be lossy.

That being said, I'm surprised that rtbm is slower than array even in
the unshuffled case. I've also measured the shuffle cases and got
different results. To be clear, I used prepare() SQL function to
prepare both virtual dead tuples and index tuples, load them by
attach_dead_tuples() SQL function, and executed bench() SQL function
for each data structure. Here are the results:

 shuffled time   unshuffled time
array  88899.513 ms   12616.521 ms
intset  73476.055 ms   10063.405 ms
rtbm   22264.671 ms2073.171 ms
tbm10285.092   ms  1417.312 ms
vtbm   14488.581 ms1240.666  ms

>
> FWIW, I get an assertion failure when using an assertion build:
>
> #2  0x561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 
> "found", errorType=0x7f9115a88d11 "FailedAssertion",
> fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at 
> /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
> #3  0x7f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, 
> offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
> #4  0x7f9115a8363d in load_rtbm (rtbm=0x561806293280, 
> itemptrs=0x7f908a203050, nitems=1000) at bdbench.c:618
> #5  0x7f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 
> , nitems=1000, minblk=2139062143, maxblk=2139062143, 
> maxoff=32639)
> at bdbench.c:587
> #6  0x7f9115a83837 in attach (lvtt=0x7f9115a8c300 , 
> nitems=1000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
> at bdbench.c:658
> #7  0x7f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at 
> bdbench.c:873
>
> I assume you just inverted the Assert(found) assertion?

Right. Fixed it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-07-09 Thread David Rowley
On Thu, 8 Jul 2021 at 05:44, David Christensen
 wrote:
> Enclosed is the patch to change the return type to numeric, as well as one 
> for expanding units to
> add PB and EB.

I ended up not changing the return type of pg_size_bytes().

> I figured that PB and EB are probably good enough additions at this point, so 
> we can debate whether
> to add the others.

Per Tom's concern both with changing the return type of
pg_size_bytes() and his and my concern about going too far adding more
units, I've adjusted your patch to only add petabytes and pushed it.
The maximum range of BIGINT is only 8 exabytes, so the BIGINT version
would never show in exabytes anyway. It would still be measuring in
petabytes at the 64-bit range limit.

After a bit of searching, I found reports that the estimated entire
stored digital data on Earth as of 2020 to be 59 zettabytes, or about
0.06 yottabytes.  I feel like we've gone far enough by adding
petabytes today. Maybe that's worth revisiting in a couple of storage
generations. After we're done there, we can start working on the LSN
wraparound code.

David




Re: [HACKERS] logical decoding of two-phase transactions

2021-07-09 Thread Ajin Cherian
On Fri, Jul 9, 2021 at 9:13 AM Peter Smith  wrote:

> I tried the v95-0001 patch.
>
> - The patch applied cleanly and all build / testing was OK.
> - The documentation also builds OK.
> - I checked all v95-0001 / v93-0001 differences and found no problems.
> - Furthermore, I noted that v95-0001 patch is passing the cfbot [1].
>
> So this patch LGTM.
>

Applied, reviewed and tested the patch.
Also ran a 5 level cascaded standby setup running a modified pgbench
that does two phase commits and it ran fine.
Did some testing using empty transactions and no issues found
The patch looks good to me.

regards,
Ajin Cherian




Re: Added schema level support for publication.

2021-07-09 Thread Greg Nancarrow
On Fri, Jul 9, 2021 at 1:28 PM houzj.f...@fujitsu.com
 wrote:
>
> Currently, postgres caches publication actions info in the
> RelationData::rd_pubactions, but after applying the patch, it seems
> rd_pubactions is not initialized when using schema level publication.
>
> It cound result in some unexpected behaviour when checking if command can be
> executed with current replica identity.
>

While testing this patch, I'm finding that for a FOR SCHEMA
publication, UPDATEs and DELETEs on tables belonging to that schema
are not getting replicated (but INSERTs and TRUNCATEs are).
Could this be related to the issues that Hou-san has identified?

Regards,
Greg Nancarrow
Fujitsu Australia




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-09 Thread Masahiko Sawada
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund  wrote:
>
> Hi,
>
>
> On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > 1. Don't allocate more than 1GB. There was a discussion to eliminate
> > this limitation by using MemoryContextAllocHuge() but there were
> > concerns about point 2[1].
> >
> > 2. Allocate the whole memory space at once.
> >
> > 3. Slow lookup performance (O(logN)).
> >
> > I’ve done some experiments in this area and would like to share the
> > results and discuss ideas.
>
> Yea, this is a serious issue.
>
>
> 3) could possibly be addressed to a decent degree without changing the
> fundamental datastructure too much. There's some sizable and trivial
> wins by just changing vac_cmp_itemptr() to compare int64s and by using
> an open coded bsearch().
>
> The big problem with bsearch isn't imo the O(log(n)) complexity - it's
> that it has an abominally bad cache locality. And that can be addressed
> https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf
>
> Imo 2) isn't really that a hard problem to improve, even if we were to
> stay with the current bsearch approach. Reallocation with an aggressive
> growth factor or such isn't that bad.
>
>
> That's not to say we ought to stay with binary search...
>
>
>
> > Problems Solutions
> > ===
> >
> > Firstly, I've considered using existing data structures:
> > IntegerSet(src/backend/lib/integerset.c)  and
> > TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
> > only either point 2 or 3. IntegerSet uses lower memory thanks to
> > simple-8b encoding but is slow at lookup, still O(logN), since it’s a
> > tree structure. On the other hand, TIDBitmap has a good lookup
> > performance, O(1), but could unnecessarily use larger memory in some
> > cases since it always allocates the space for bitmap enough to store
> > all possible offsets. With 8kB blocks, the maximum number of line
> > pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
> > bitmap is 40 bytes long and we always need 46 bytes in total per block
> > including other meta information.
>
> Imo tidbitmap isn't particularly good, even in the current use cases -
> it's constraining in what we can store (a problem for other AMs), not
> actually that dense, the lossy mode doesn't choose what information to
> loose well etc.
>
> It'd be nice if we came up with a datastructure that could also replace
> the bitmap scan cases.

Agreed.

>
>
> > The data structure is somewhat similar to TIDBitmap. It consists of
> > the hash table and the container area; the hash table has entries per
> > block and each block entry allocates its memory space, called a
> > container, in the container area to store its offset numbers. The
> > container area is actually an array of bytes and can be enlarged as
> > needed. In the container area, the data representation of offset
> > numbers varies depending on their cardinality. It has three container
> > types: array, bitmap, and run.
>
> Not a huge fan of encoding this much knowledge about the tid layout...
>
>
> > For example, if there are two dead tuples at offset 1 and 150, it uses
> > the array container that has an array of two 2-byte integers
> > representing 1 and 150, using 4 bytes in total. If we used the bitmap
> > container in this case, we would need 20 bytes instead. On the other
> > hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
> > uses the run container that has an array of 2-byte integers. The first
> > value in each pair represents a starting offset number, whereas the
> > second value represents its length. Therefore, in this case, the run
> > container uses only 4 bytes in total. Finally, if there are dead
> > tuples at every other offset from 1 to 100, it uses the bitmap
> > container that has an uncompressed bitmap, using 13 bytes. We need
> > another 16 bytes per block entry for hash table entry.
> >
> > The lookup complexity of a bitmap container is O(1) whereas the one of
> > an array and a run container is O(N) or O(logN) but the number of
> > elements in those two containers should not be large it would not be a
> > problem.
>
> Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
> many tuples have been deleted. In cases where the "run" storage is
> cheaper (e.g. because there's high offset numbers due to HOT pruning),
> we could end up regularly scanning a few hundred entries for a
> match. That's not cheap anymore.

With 8kB blocks, the maximum size of a bitmap container is 37 bytes.
IOW, other two types of containers are always smaller than 37 bytes.
Since the run container uses 4 bytes per run, the number of runs in a
run container never be more than 9. Even with 32kB blocks, we don’t
have more than 37 runs. So I think N is small enough in this case.

>
>
> > Evaluation
> > 
> >
> > Before implementing this idea and integrating it with lazy vacuum
> > code, I've implemented a benchmark tool dedicated to evaluating
> > lazy_tid_reaped() 

Re: Transactions involving multiple postgres foreign servers, take 2

2021-07-09 Thread Fujii Masao




On 2021/06/30 10:05, Masahiko Sawada wrote:

I've attached the new version patch that incorporates the comments
from Fujii-san and Ikeda-san I got so far.


Thanks for updating the patches!

I'm now reading 0001 and 0002 patches and wondering if we can commit them
at first because they just provide independent basic mechanism for
foreign transaction management.

One question regarding them is; Why did we add new API only for "top" foreign
transaction? Even with those patches, old API (CallSubXactCallbacks) is still
being used for foreign subtransaction and xact_depth is still being managed
in postgres_fdw layer (not PostgreSQL core). Is this intentional?
Sorry if this was already discussed before.

As far as I read the code, keep using old API for foreign subtransaction doesn't
cause any actual bug. But it's just strange and half-baked to manage top and
sub transaction in the differenet layer and to use old and new API for them.

OTOH, I'm afraid that adding new (not-essential) API for foreign subtransaction
might increase the code complexity unnecessarily.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Outdated comments about proc->sem in lwlock.c

2021-07-09 Thread Thomas Munro
On Thu, Jul 8, 2021 at 8:48 AM Daniel Gustafsson  wrote:
> > On 3 Jun 2021, at 04:07, Thomas Munro  wrote:
> > Here's a patch to remove the misleading comments.
>
> While not an expert in the area; reading the referenced commit and the code
> with the now removed comments, I think this is correct.

Thanks!  I made the comments slightly more uniform and pushed.




Re: when the startup process doesn't (logging startup delays)

2021-07-09 Thread Amul Sul
Few comments for v4 patch:

@@ -7351,6 +7363,8 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
LSN_FORMAT_ARGS(ReadRecPtr;

+   InitStartupProgress();
+
/*
 * main redo apply loop
 */
@@ -7358,6 +7372,8 @@ StartupXLOG(void)
{
boolswitchedTLI = false;

+   LogStartupProgress(RECOVERY_IN_PROGRESS, NULL);
+
 #ifdef WAL_DEBUG
if (XLOG_DEBUG ||
(rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
@@ -7569,6 +7585,8 @@ StartupXLOG(void)
 * end of main redo apply loop
 */

+   CloseStartupProgress(RECOVERY_END);

I am not sure I am getting the code flow correctly. From CloseStartupProgress()
naming it seems, it corresponds to InitStartupProgress() but what it is doing
is similar to LogStartupProgress(). I think it should be renamed to be inlined
with LogStartupProgress(), IMO.
---

+
+   /* Return if any invalid operation */
+   if (operation >= SYNCFS_END)
+   return;

+   /* Return if any invalid operation */
+   if (operation < SYNCFS_END)
+   return;
+

This part should be an assertion, it's the developer's responsibility to call
correctly.
---

+/*
+ * Codes of the operations performed during startup process
+ */
+typedef enum StartupProcessOp
+{
+   /* Codes for in-progress operations */
+   SYNCFS_IN_PROGRESS,
+   FSYNC_IN_PROGRESS,
+   RECOVERY_IN_PROGRESS,
+   RESET_UNLOGGED_REL_IN_PROGRESS,
+   /* Codes for end of operations */
+   SYNCFS_END,
+   FSYNC_END,
+   RECOVERY_END,
+   RESET_UNLOGGED_REL_END
+} StartupProcessOp;
+

Since we do have a separate call for the in-progress operation and the
end-operation, only a single enum would have been enough. If we do this, then I
think we should remove get_startup_process_operation_string() move messages to
the respective function.
---

Also, with your patch "make check-world" has few failures, kindly check that.

Regards,
Amul


On Mon, Jun 21, 2021 at 12:06 PM Nitin Jadhav
 wrote:
>
> > What is DUMMY about ?  If you just want to separate the "start" from "end",
> > you could write:
> >
> > /* codes for start of operations */
> > FSYNC_IN_PROGRESS
> > SYNCFS_IN_PROGRESS
> > ...
> > /* codes for end of operations */
> > FSYNC_END
> > SYNCFS_END
> > ...
>
> That was by mistake and I have corrected it in the attached patch.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Thu, Jun 17, 2021 at 6:22 PM Justin Pryzby  wrote:
> >
> > + * Codes of the operations performed during startup process
> > + */
> > +typedef enum StartupProcessOp
> > +{
> > +   SYNCFS_IN_PROGRESS,
> > +   FSYNC_IN_PROGRESS,
> > +   RECOVERY_IN_PROGRESS,
> > +   RESET_UNLOGGED_REL_IN_PROGRESS,
> > +   DUMMY,
> > +   SYNCFS_END,
> > +   FSYNC_END,
> > +   RECOVERY_END,
> > +   RESET_UNLOGGED_REL_END
> > +} StartupProcessOp;
> >
> > What is DUMMY about ?  If you just want to separate the "start" from "end",
> > you could write:
> >
> > /* codes for start of operations */
> > FSYNC_IN_PROGRESS
> > SYNCFS_IN_PROGRESS
> > ...
> > /* codes for end of operations */
> > FSYNC_END
> > SYNCFS_END
> > ...
> >
> > Or group them together like:
> >
> > FSYNC_IN_PROGRESS,
> > FSYNC_END,
> > SYNCFS_IN_PROGRESS,
> > SYNCFS_END,
> > RECOVERY_IN_PROGRESS,
> > RECOVERY_END,
> > RESET_UNLOGGED_REL_IN_PROGRESS,
> > RESET_UNLOGGED_REL_END,
> >
> > --
> > Justin