Re: Incorrect snapshots while promoting hot standby node when 2PC is used

2021-05-03 Thread Andrey Borodin



> 3 мая 2021 г., в 23:10, Andres Freund  написал(а):
> 
> Hi,
> 
> On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:
>> I'm investigating somewhat resemblant case.
>> We have an OLTP sharded installation where shards are almost always under 
>> rebalancing. Data movement is implemented with 2pc.
>> Switchover happens quite often due to datacenter drills. The installation is 
>> running on PostgreSQL 12.6.
> 
> If you still have the data it would be useful if you could check if the
> LSNs of the corrupted pages are LSNs from shortly after standby
> promotion/switchover?
That's a neat idea, I'll check if I can restore backup with corruptions.
I have a test cluster with corruptions, but it has undergone tens of 
switchovers...

>> Or, perhaps, it looks more like a hardware problem? Data_checksums are
>> on, but few years ago we observed ssd firmware that was loosing
>> updates, but passing checksums. I'm sure that we would benefit from
>> having separate relation fork for checksums or LSNs.
> 
> Right - checksums are "page local". They can only detect if a page is
> corrupted, not if e.g. an older version of the page (with correct
> checksum) has been restored. While there are schemes to have stronger
> error detection properties, they do come with substantial overhead (at
> least the ones I can think of right now).

We can have PTRACK-like fork with page LSNs. It can be flushed on checkpoint 
and restored from WAL on crash. So we always can detect stale page version. 
Like LSN-track :) We will have much faster rewind and delta-backups for free.

Though I don't think it worth an effort until we at least checksum CLOG.

Thanks!

Best regards, Andrey Borodin.



Re: Replication slot stats misgivings

2021-05-03 Thread vignesh C
On Tue, May 4, 2021 at 9:48 AM Masahiko Sawada  wrote:
>
> On Mon, May 3, 2021 at 10:21 PM Amit Kapila  wrote:
> >
> > On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, May 3, 2021 at 2:27 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > >
> > > > > > > I am not sure if any of these alternatives are a good idea. What 
> > > > > > > do
> > > > > > > you think? Do you have any other ideas for this?
> > > > > >
> > > > > > I've been considering some ideas but don't come up with a good one
> > > > > > yet. It’s just an idea and not tested but how about having
> > > > > > CreateDecodingContext() register before_shmem_exit() callback with 
> > > > > > the
> > > > > > decoding context to ensure that we send slot stats even on
> > > > > > interruption. And FreeDecodingContext() cancels the callback.
> > > > > >
> > > > >
> > > > > Is it a good idea to send stats while exiting and rely on the same? I
> > > > > think before_shmem_exit is mostly used for the cleanup purpose so not
> > > > > sure if we can rely on it for this purpose. I think we can't be sure
> > > > > that in all cases we will send all stats, so maybe Vignesh's patch is
> > > > > sufficient to cover the cases where we avoid losing it in cases where
> > > > > we would have sent a large amount of data.
> > > > >
> > > >
> > > > Sawada-San, any thoughts on this point?
> > >
> > > before_shmem_exit is mostly used to the cleanup purpose but how about
> > > on_shmem_exit()? pgstats relies on that to send stats at the
> > > interruption. See pgstat_shutdown_hook().
> > >
> >
> > Yeah, that is worth trying. Would you like to give it a try?
>
> Yes.
>
> In this approach, I think we will need to have a static pointer in
> logical.c pointing to LogicalDecodingContext that we’re using. At
> StartupDecodingContext(), we set the pointer to the just created
> LogicalDecodingContext and register the callback so that we can refer
> to the LogicalDecodingContext on that callback. And at
> FreeDecodingContext(), we reset the pointer to NULL (however, since
> FreeDecodingContext() is not called when an error happens we would
> need to ensure resetting it somehow). But, after more thought, if we
> have the static pointer in logical.c it would rather be better to have
> a global function that sends slot stats based on the
> LogicalDecodingContext pointed by the static pointer and can be called
> by ReplicationSlotRelease(). That way, we don’t need to worry about
> erroring out cases as well as interruption cases, although we need to
> have a new static pointer.
>
> I've attached a quick-hacked patch. I also incorporated the change
> that calls UpdateDecodingStats() at FreeDecodingContext() so that we
> can send slot stats also in the case where we spilled/streamed changes
> but finished without commit/abort/prepare record.
>
> >  I think
> > it still might not cover the cases where we error out in the backend
> > while decoding via APIs because at that time we won't exit, maybe for
> > that we can consider Vignesh's patch.
>
> Agreed. It seems to me that the approach of the attached patch is
> better than the approach using on_shmem_exit(). So if we want to avoid
> having the new static pointer and function for this purpose we can
> consider Vignesh’s patch.
>

I'm ok with using either my patch or Sawada san's patch, Even I had
the same thought of whether we should have a static variable thought
pointed out by Sawada san. Apart from that I had one minor comment:
This comment needs to be corrected "andu sed to sent"
+/*
+ * Pointing to the currently-used logical decoding context andu sed to sent
+ * slot statistics on releasing slots.
+ */
+static LogicalDecodingContext *MyLogicalDecodingContext = NULL;
+

Regards,
Vignesh




Re: AlterSubscription_refresh "wrconn" wrong variable?

2021-05-03 Thread Peter Smith
On Tue, May 4, 2021 at 2:31 PM Andres Freund  wrote:
>
> Hi,
>
> On 2021-05-04 09:29:42 +1000, Peter Smith wrote:
> > While reviewing some logical replication code I stumbled across a
> > variable usage that looks suspicious to me.
>
> > Note that the AlterSubscription_refresh function (unlike other
> > functions in the subscriptioncmds.c) is using the global variable
> > "wrconn" instead of a local stack variable of the same name. I was
> > unable to think of any good reason why it would be deliberately doing
> > this, so my guess is that it is simply an accidental mistake that has
> > gone unnoticed because the compiler was silently equally happy just
> > using the global var.
>
> > Apparently, this is not causing any reported problems because it seems
> > like the code has been this way for ~4 years [1].
>
> This sounded vaguely familiar. After a bit of searching I found that's
> because I debugged a crash related to it:
> https://www.postgresql.org/message-id/2020215820.qihhrz7fayu6myfi%40alap3.anarazel.de
>

Oh! No wonder it sounded familiar.

It looks like I've just re-discovered the identical problem 5 months
after your post.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: AlterSubscription_refresh "wrconn" wrong variable?

2021-05-03 Thread Peter Smith
On Tue, May 4, 2021 at 1:56 PM Bharath Rupireddy
 wrote:
>
> On Tue, May 4, 2021 at 5:00 AM Peter Smith  wrote:
> >
> > While reviewing some logical replication code I stumbled across a
> > variable usage that looks suspicious to me.
> >
> > Note that the AlterSubscription_refresh function (unlike other
> > functions in the subscriptioncmds.c) is using the global variable
> > "wrconn" instead of a local stack variable of the same name. I was
> > unable to think of any good reason why it would be deliberately doing
> > this, so my guess is that it is simply an accidental mistake that has
> > gone unnoticed because the compiler was silently equally happy just
> > using the global var.
> >
> > Apparently, this is not causing any reported problems because it seems
> > like the code has been this way for ~4 years [1].
> >
> > Even so, it doesn't look intentional to me and I felt that there may
> > be unknown consequences (e.g. resource leakage?) of just blatting over
> > the global var. So, PSA a small patch to make this
> > AlterSubscription_refresh function use a stack variable consistent
> > with the other nearby functions.
> >
> > Thoughts?
>
> +1. It looks like the global variable wrconn defined/declared in
> worker_internal.h/worker.c, is for logical apply/table sync worker and
> it doesn't make sense to use it for CREATE/ALTER subscription refresh
> code that runs on a backend. And I couldn't think of any unknown
> consequences/resource leakage, because that global variable is being
> used by different processes which have their own copy.
>
> And, the patch basically looks good to me, except a bit of rewording
> the commit message to something like "Use local variable wrconn in
> AlterSubscription_refresh instead of global a variable with the same
> name which is meant to be used for logical apply/table sync worker.
> Having the wrconn global variable in AlterSubscription_refresh doesn't
> cause any real issue as such but it keeps the code in
> subscriptioncmds.c inconsistent with other functions which use a local
> variable named wrconn." or some other better wording?
>
> Regression tests were passed on my dev system with the patch.
>

Thanks for your feedback.

I can post another patch (or same patch with an improved commit
comment) later, but I will just wait a day first in case there is more
information to say about it. e.g. my suspicion that there would be
"consequences" seems to have come to fruition after all [1] although I
never would have thought of that tricky trigger / refresh scenario.

--
[1] 
https://www.postgresql.org/message-id/20210504043149.vg4w66vuh4qjrbph%40alap3.anarazel.de

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Andres Freund
Hi,

On 2021-04-30 11:51:07 -0700, Peter Geoghegan wrote:
> I think that it's reasonable to impose some cost on index AMs here,
> but that needs to be bounded sensibly and unambiguously. For example,
> it would probably be okay if you had either 6 byte or 8 byte TIDs, but
> no other variations. You could require index AMs (the subset of index
> AMs that are ever able to store 8 byte TIDs) to directly encode which
> width they're dealing with at the level of each IndexTuple. That would
> create some problems for nbtree deduplication, especially in boundary
> cases, but ISTM that you can manage the complexity by sensibly
> restricting how the TIDs work across the board.

> For example, the TIDs should always work like unsigned integers -- the
> table AM must be willing to work with that restriction.

Isn't that more a question of the encoding than the concrete representation?


> You'd then have posting lists tuples in nbtree whose TIDs were all
> either 6 bytes or 8 bytes wide, with a mix of each possible (though
> not particularly likely) on the same leaf page. Say when you have a
> table that exceeds the current MaxBlockNumber restrictions. It would
> be relatively straightforward for nbtree deduplication to simply
> refuse to mix 6 byte and 8 byte datums together to avoid complexity in
> boundary cases. The deduplication pass logic has the flexibility that
> this requires already.

Which nbtree cases do you think would have an easier time supporting
switching between 6 or 8 byte tids than supporting fully variable width
tids?  Given that IndexTupleData already is variable-width, it's not
clear to me why supporting two distinct sizes would be harder than a
fully variable size?  I assume it's things like BTDedupState->htids?



> > What's wrong with varlena headers? It would end up being a 1-byte
> > header in practically every case, and no variable-width representation
> > can do without a length word of some sort. I'm not saying varlena is
> > as efficient as some new design could hypothetically be, but it
> > doesn't seem like it'd be a big enough problem to stress about. If you
> > used a variable-width representation for integers, you might actually
> > save bytes in a lot of cases. An awful lot of the TIDs people store in
> > practice probably contain several zero bytes, and if we make them
> > wider, that's going to be even more true.
> 
> Maybe all of this is true, and maybe it works out to be the best path
> forward in the long term, all things considered. But whether or not
> that's true is crucially dependent on what real practical table AMs
> (of which there will only ever be a tiny number) actually need to do.
> Why should we assume that the table AM cannot accept some
> restrictions? What good does it do to legalistically define the
> problem as a problem for index AMs to solve?

I don't think anybody is arguing that AMs cannot accept any restrictions? I do
think it's pretty clear that it's not entirely obvious what the concrete set
of proper restrictions would be, where we won't end up needing to re-evaluate
limits in a few years are.

If you add to that the fact that variable-width tids will often end up
considerably smaller than our current tids, it's not obvious why we should use
bitspace somewhere to indicate an 8 byte tid instead of a a variable-width
tid?

Greetings,

Andres Freund




Re: AlterSubscription_refresh "wrconn" wrong variable?

2021-05-03 Thread Andres Freund
Hi,

On 2021-05-04 09:29:42 +1000, Peter Smith wrote:
> While reviewing some logical replication code I stumbled across a
> variable usage that looks suspicious to me.

> Note that the AlterSubscription_refresh function (unlike other
> functions in the subscriptioncmds.c) is using the global variable
> "wrconn" instead of a local stack variable of the same name. I was
> unable to think of any good reason why it would be deliberately doing
> this, so my guess is that it is simply an accidental mistake that has
> gone unnoticed because the compiler was silently equally happy just
> using the global var.

> Apparently, this is not causing any reported problems because it seems
> like the code has been this way for ~4 years [1].

This sounded vaguely familiar. After a bit of searching I found that's
because I debugged a crash related to it:
https://www.postgresql.org/message-id/2020215820.qihhrz7fayu6myfi%40alap3.anarazel.de

Peter?

Greetings,

Andres Freund




Re: .ready and .done files considered harmful

2021-05-03 Thread Andres Freund
Hi,

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> I have two possible ideas for addressing this; perhaps other people
> will have further suggestions. A relatively non-invasive fix would be
> to teach pgarch.c how to increment a WAL file name. After archiving
> segment N, check using stat() whether there's an .ready file for
> segment N+1. If so, do that one next. If not, then fall back to
> performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?


> As far as I can see, this is just cheap insurance. If archiving is
> keeping up, the extra stat() won't matter much. If it's not, this will
> save more system calls than it costs. Since during normal operation it
> shouldn't really be possible for files to show up in pg_wal out of
> order, I don't really see a scenario where this changes the behavior,
> either. If there are gaps in the sequence at startup time, this will
> cope with it exactly the same as we do now, except with a better
> chance of finishing before I retire.

There's definitely gaps in practice :(. Due to the massive performance
issues with archiving there are several tools that archive multiple
files as part of one archive command invocation (and mark the additional
archived files as .done immediately).


> However, that's still pretty wasteful. Every time we have to wait for
> the next file to be ready for archiving, we'll basically fall back to
> repeatedly scanning the whole directory, waiting for it to show up.

Hm. That seems like it's only an issue because .done and .ready are in
the same directory? Otherwise the directory would be empty while we're
waiting for the next file to be ready to be archived. I hate that that's
a thing but given teh serial nature of archiving, with high per-call
overhead, I don't think it'd be ok to just break that without a
replacement :(.


> But perhaps we could work around this by allowing pgarch.c to access
> shared memory, in which case it could examine the current timeline
> whenever it wants, and probably also whatever LSNs it needs to know
> what's safe to archive.

FWIW, the shared memory stats patch implies doing that, since the
archiver reports stats.


> If we did that, could we just get rid of the .ready and .done files
> altogether? Are they just a really expensive IPC mechanism to avoid a
> shared memory connection, or is there some more fundamental reason why
> we need them?

What kind of shared memory mechanism are you thinking of? Due to
timelines and history files I don't think simple position counters would
be quite enough.

I think the aforementioned "batching" archive commands are part of the
problem :(.



> And is there any good reason why the archiver shouldn't be connected
> to shared memory? It is certainly nice to avoid having more processes
> connected to shared memory than necessary, but the current scheme is
> so inefficient that I think we end up worse off.

I think there is no fundamental for avoiding shared memory in the
archiver. I guess there's a minor robustness advantage, because the
forked shell to start the archvive command won't be attached to shared
memory. But that's only until the child exec()s to the archive command.

There is some minor performance advantage as well, not having to process
the often large and contended memory mapping for shared_buffers is
probably measurable - but swamped by the cost of needing to actually
archive the segment.


My only "concern" with doing anything around this is that I think the
whole approach of archive_command is just hopelessly broken, with even
just halfway busy servers only able to keep up archiving if they muck
around with postgres internal data during archive command execution. Add
to that how hard it is to write a robust archive command (e.g. the one
in our docs still suggests test ! -f && cp, which means that copy
failing in the middle yields an incomplete archive)...

While I don't think it's all that hard to design a replacement, it's
however likely still more work than addressing the O(n^2) issue, so ...

Greetings,

Andres Freund




Re: Replication slot stats misgivings

2021-05-03 Thread Masahiko Sawada
On Mon, May 3, 2021 at 10:21 PM Amit Kapila  wrote:
>
> On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada  wrote:
> >
> > On Mon, May 3, 2021 at 2:27 PM Amit Kapila  wrote:
> > >
> > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > I am not sure if any of these alternatives are a good idea. What do
> > > > > > you think? Do you have any other ideas for this?
> > > > >
> > > > > I've been considering some ideas but don't come up with a good one
> > > > > yet. It’s just an idea and not tested but how about having
> > > > > CreateDecodingContext() register before_shmem_exit() callback with the
> > > > > decoding context to ensure that we send slot stats even on
> > > > > interruption. And FreeDecodingContext() cancels the callback.
> > > > >
> > > >
> > > > Is it a good idea to send stats while exiting and rely on the same? I
> > > > think before_shmem_exit is mostly used for the cleanup purpose so not
> > > > sure if we can rely on it for this purpose. I think we can't be sure
> > > > that in all cases we will send all stats, so maybe Vignesh's patch is
> > > > sufficient to cover the cases where we avoid losing it in cases where
> > > > we would have sent a large amount of data.
> > > >
> > >
> > > Sawada-San, any thoughts on this point?
> >
> > before_shmem_exit is mostly used to the cleanup purpose but how about
> > on_shmem_exit()? pgstats relies on that to send stats at the
> > interruption. See pgstat_shutdown_hook().
> >
>
> Yeah, that is worth trying. Would you like to give it a try?

Yes.

In this approach, I think we will need to have a static pointer in
logical.c pointing to LogicalDecodingContext that we’re using. At
StartupDecodingContext(), we set the pointer to the just created
LogicalDecodingContext and register the callback so that we can refer
to the LogicalDecodingContext on that callback. And at
FreeDecodingContext(), we reset the pointer to NULL (however, since
FreeDecodingContext() is not called when an error happens we would
need to ensure resetting it somehow). But, after more thought, if we
have the static pointer in logical.c it would rather be better to have
a global function that sends slot stats based on the
LogicalDecodingContext pointed by the static pointer and can be called
by ReplicationSlotRelease(). That way, we don’t need to worry about
erroring out cases as well as interruption cases, although we need to
have a new static pointer.

I've attached a quick-hacked patch. I also incorporated the change
that calls UpdateDecodingStats() at FreeDecodingContext() so that we
can send slot stats also in the case where we spilled/streamed changes
but finished without commit/abort/prepare record.

>  I think
> it still might not cover the cases where we error out in the backend
> while decoding via APIs because at that time we won't exit, maybe for
> that we can consider Vignesh's patch.

Agreed. It seems to me that the approach of the attached patch is
better than the approach using on_shmem_exit(). So if we want to avoid
having the new static pointer and function for this purpose we can
consider Vignesh’s patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 00543ede45..f32a2da565 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -51,6 +51,12 @@ typedef struct LogicalErrorCallbackState
 	XLogRecPtr	report_location;
 } LogicalErrorCallbackState;
 
+/*
+ * Pointing to the currently-used logical decoding context andu sed to sent
+ * slot statistics on releasing slots.
+ */
+static LogicalDecodingContext *MyLogicalDecodingContext = NULL;
+
 /* wrappers around output plugin callbacks */
 static void output_plugin_error_callback(void *arg);
 static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -290,6 +296,13 @@ StartupDecodingContext(List *output_plugin_options,
 
 	MemoryContextSwitchTo(old_context);
 
+	/*
+	 * Keep holding the decoding context until freeing the decoding context
+	 * or releasing the logical slot.
+	 */
+	Assert(MyLogicalDecodingContext == NULL);
+	MyLogicalDecodingContext = ctx;
+
 	return ctx;
 }
 
@@ -626,10 +639,12 @@ FreeDecodingContext(LogicalDecodingContext *ctx)
 	if (ctx->callbacks.shutdown_cb != NULL)
 		shutdown_cb_wrapper(ctx);
 
+	UpdateDecodingStats(ctx);
 	ReorderBufferFree(ctx->reorder);
 	FreeSnapshotBuilder(ctx->snapshot_builder);
 	XLogReaderFree(ctx->reader);
 	MemoryContextDelete(ctx->context);
+	MyLogicalDecodingContext = NULL;
 }
 
 /*
@@ -1811,3 +1826,17 @@ UpdateDecodingStats(LogicalDecodingContext *ctx)
 	rb->totalTxns = 0;
 	rb->totalBytes = 0;
 }
+
+/*
+ * The function called at releasing a logical replication slot, sending 

Re: AlterSubscription_refresh "wrconn" wrong variable?

2021-05-03 Thread Bharath Rupireddy
On Tue, May 4, 2021 at 5:00 AM Peter Smith  wrote:
>
> While reviewing some logical replication code I stumbled across a
> variable usage that looks suspicious to me.
>
> Note that the AlterSubscription_refresh function (unlike other
> functions in the subscriptioncmds.c) is using the global variable
> "wrconn" instead of a local stack variable of the same name. I was
> unable to think of any good reason why it would be deliberately doing
> this, so my guess is that it is simply an accidental mistake that has
> gone unnoticed because the compiler was silently equally happy just
> using the global var.
>
> Apparently, this is not causing any reported problems because it seems
> like the code has been this way for ~4 years [1].
>
> Even so, it doesn't look intentional to me and I felt that there may
> be unknown consequences (e.g. resource leakage?) of just blatting over
> the global var. So, PSA a small patch to make this
> AlterSubscription_refresh function use a stack variable consistent
> with the other nearby functions.
>
> Thoughts?

+1. It looks like the global variable wrconn defined/declared in
worker_internal.h/worker.c, is for logical apply/table sync worker and
it doesn't make sense to use it for CREATE/ALTER subscription refresh
code that runs on a backend. And I couldn't think of any unknown
consequences/resource leakage, because that global variable is being
used by different processes which have their own copy.

And, the patch basically looks good to me, except a bit of rewording
the commit message to something like "Use local variable wrconn in
AlterSubscription_refresh instead of global a variable with the same
name which is meant to be used for logical apply/table sync worker.
Having the wrconn global variable in AlterSubscription_refresh doesn't
cause any real issue as such but it keeps the code in
subscriptioncmds.c inconsistent with other functions which use a local
variable named wrconn." or some other better wording?

Regression tests were passed on my dev system with the patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Jeff Davis
On Mon, 2021-05-03 at 18:12 -0700, Peter Geoghegan wrote:
> But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its
> MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with
> default BLCKSZ. I doubt that that restriction is something that you
> can afford to live with, even just for the time being.

Oh, you're right. I missed that MaxHeapTuplesPerPage was an order of
magnitude smaller.

> I don't see why that's necessarily a problem. Why, in general, should
> every table AM be able to support every index AM?

I didn't propose that every table AM needs to support every index type,
just that we should do something or at least document something. It's
pretty frustrating to have to fall back to manually managing the
indexes for dozens or hundreds of partitions when you make use of
multiple table AMs.

We might be conflating support for index AMs with support for features
like bitmap scans. If a certain kind of index fails at CREATE INDEX
time, that's painful for the partitioning case. But here it's more like
the CREATE INDEX would succeed but it would just never be used, which
is a different kind of frustrating.

Whatever we do or don't do, we should try to avoid surprises. I expect
table AMs to be used heavily with partitioning.

Regards,
Jeff Davis






Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 5:15 PM Jeff Davis  wrote:
> I don't see why in-core changes are a strict requirement. It doesn't
> make too much difference if a lossy TID doesn't correspond exactly to
> the columnar layout -- it should be fine as long as there's locality,
> right?

But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its
MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with
default BLCKSZ. I doubt that that restriction is something that you
can afford to live with, even just for the time being.

> > It seems senseless to *require* table AMs to support something like a
> > bitmap scan.
>
> I am not yet convinced that it's "senseless", but it is optional and
> there's probably a reason that it's not required.

I mean it's senseless to require it in the general case.

> We still need to address the fact that two features have had a minor
> collision: indexes on a partitioned table and table AMs that don't
> necessarily support all index types. It's not good to just throw an
> error, because we could be forcing the user to manually manage the
> indexes on hundreds of partitions just because some tables have a
> different AM and it doesn't support the index type.

I don't see why that's necessarily a problem. Why, in general, should
every table AM be able to support every index AM?

I find it puzzling that nobody can find one single thing that the
table AM interface *can't* do. What are the chances that the
abstraction really is perfect?

> > I don't think it's a coincidence that GIN is the index AM
> > that looks like it presents at least 2 problems for the columnar
> > table
> > AM. To me this suggests that this will need a much higher level
> > discussion.
>
> One problem is that ginpostinglist.c restricts the use of offset
> numbers higher than MaxOffsetNumber - 1. At best, that's a confusing
> and unnecessary off-by-one error that we happen to be stuck with
> because it affects the on-disk format. Now that I'm past that
> particular confusion, I can live with a workaround until we do
> something better.
>
> What is the other problem with GIN?

I just meant the tidbitmap.c stuff, and so on. There is really one big
problem: GIN leverages the fact that bitmap scans are all that it
supports in many different ways. The reality is that it was designed
to work with heapam -- that's how it evolved. It seems rather unlikely
that problems are confined to this ginpostinglist.c representational
issue -- which is very surface-level. The only way to figure it out is
to try to make it work and see what happens, though, so perhaps it
isn't worth discussing any further until that happens.

--
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Jeff Davis
On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote:
> Sure, but it either makes sense for the columnar table AM to support
> bitmap scans (or some analogous type of scan that works only slightly
> differently) or it doesn't. It's not at all clear which it is right
> now.

It makes sense for my columnar table AM -- there's TID locality.

> If it makes sense then it will of course be necessary to describe
> what
> "bitmap scan" actually means with the columnar storage table AM (plus
> you'll still need to make some in-core changes to places like
> tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's
> going to be a bit annoying in the partitioning scenario you describe,
> but some things are bound to be *inherently* impossible, so it can't
> be
> helped.

I don't see why in-core changes are a strict requirement. It doesn't
make too much difference if a lossy TID doesn't correspond exactly to
the columnar layout -- it should be fine as long as there's locality,
right?

> It seems senseless to *require* table AMs to support something like a
> bitmap scan. 

I am not yet convinced that it's "senseless", but it is optional and
there's probably a reason that it's not required.

We still need to address the fact that two features have had a minor
collision: indexes on a partitioned table and table AMs that don't
necessarily support all index types. It's not good to just throw an
error, because we could be forcing the user to manually manage the
indexes on hundreds of partitions just because some tables have a
different AM and it doesn't support the index type.

We probably want to do something about that, but as far as I can tell,
it's not a problem for columnar right now.

> I don't think it's a coincidence that GIN is the index AM
> that looks like it presents at least 2 problems for the columnar
> table
> AM. To me this suggests that this will need a much higher level
> discussion.

One problem is that ginpostinglist.c restricts the use of offset
numbers higher than MaxOffsetNumber - 1. At best, that's a confusing
and unnecessary off-by-one error that we happen to be stuck with
because it affects the on-disk format. Now that I'm past that
particular confusion, I can live with a workaround until we do
something better.

What is the other problem with GIN?

Regards,
Jeff Davis






AlterSubscription_refresh "wrconn" wrong variable?

2021-05-03 Thread Peter Smith
While reviewing some logical replication code I stumbled across a
variable usage that looks suspicious to me.

Note that the AlterSubscription_refresh function (unlike other
functions in the subscriptioncmds.c) is using the global variable
"wrconn" instead of a local stack variable of the same name. I was
unable to think of any good reason why it would be deliberately doing
this, so my guess is that it is simply an accidental mistake that has
gone unnoticed because the compiler was silently equally happy just
using the global var.

Apparently, this is not causing any reported problems because it seems
like the code has been this way for ~4 years [1].

Even so, it doesn't look intentional to me and I felt that there may
be unknown consequences (e.g. resource leakage?) of just blatting over
the global var. So, PSA a small patch to make this
AlterSubscription_refresh function use a stack variable consistent
with the other nearby functions.

Thoughts?

--
[1] 
https://github.com/postgres/postgres/commit/7c4f52409a8c7d85ed169bbbc1f6092274d03920#

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Fix-wrconn.-Use-stack-variable.patch
Description: Binary data


Re: Performance Evaluation of Result Cache by using TPC-DS

2021-05-03 Thread David Rowley
Thanks for doing further analysis on this.

On Mon, 26 Apr 2021 at 20:31, Yuya Watari  wrote:
> Thank you for running experiments on your machine and I really
> appreciate your deep analysis.
>
> Your results are very interesting. In 5 queries, the result cache is
> cheaper but slower. Especially, in query 88, although the cost with
> result cache is cheaper, it has 34.23% degradation in query execution
> time. This is big regression.

That's certainly one side of it.   On the other side, it's pretty
important to also note that in 4 of 23 queries the result cache plan
executed faster but the planner costed it as more expensive.

I'm not saying the costing is perfect, but what I am saying is, as you
noted above, in 5 of 23 queries the result cache was cheaper and
slower, and, as I just noted, in 4 of 23 queries, result cache was
more expensive and faster.  We know that costing is never going to be
a perfect representation of what the execution time will be  However,
in these examples, we've just happened to get quite a good balance. If
we add a penalty to result cache then it'll just subtract from one
problem group and add to the other.

Overall, in my tests execution was 1.15% faster with result cache
enabled than it was without.

I could maybe get on board with adding a small fixed cost penalty. I'm
not sure exactly what it would be, maybe a cpu_tuple_cost instead of a
cpu_operator_cost and count it in for forming/deforming cached tuples.
I think the patch you wrote to add the resultcache_cost_factor is only
suitable for running experiments with.

The bigger concerns I have with the costing are around what happens
when an n_distinct estimate is far too low on one of the join columns.
I think it is more likely to be concerns like that one which would
cause us to default enable_resultcache to off.

David




Re: Simplify backend terminate and wait logic in postgres_fdw test

2021-05-03 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote:
>> Looks fine to me.  Let's wait a bit first to see if Fujii-san has any
>> objections to this cleanup as that's his code originally, from
>> 32a9c0bd.

> And hearing nothing, done.  The tests of postgres_fdw are getting much
> faster for me now, from basically 6s to 4s (actually that's roughly
> 1.8s of gain as pg_wait_until_termination waits at least 100ms,
> twice), so that's a nice gain.

The buildfarm is showing that one of these test queries is not stable
under CLOBBER_CACHE_ALWAYS:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-05-01%2007%3A44%3A47

of which the relevant part is:

diff -U3 
/home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out
 
/home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out
--- 
/home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out
   2021-05-01 03:44:45.022300613 -0400
+++ 
/home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out
2021-05-03 09:11:24.051379288 -0400
@@ -9215,8 +9215,7 @@
WHERE application_name = 'fdw_retry_check';
  pg_terminate_backend 
 --
- t
-(1 row)
+(0 rows)
 
 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.

I can reproduce that locally by setting

alter system set debug_invalidate_system_caches_always = 1;

and running "make installcheck" in contrib/postgres_fdw.
(It takes a good long time to run the whole test script
though, so you might want to see if running just these few
queries is enough.)

There's no evidence of distress in the postmaster log,
so I suspect this might just be a timing instability,
e.g. remote process already gone before local process
looks.  If so, it's probably hopeless to make this
test stable as-is.  Perhaps we should just take it out.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 2:03 PM Jeff Davis  wrote:
> Another point: the idea of supporting only some kinds of indexes
> doesn't mix well with partitioning. If you declare an index on the
> parent, we should do something reasonable if one partition's table AM
> doesn't support that index AM.

Sure, but it either makes sense for the columnar table AM to support
bitmap scans (or some analogous type of scan that works only slightly
differently) or it doesn't. It's not at all clear which it is right now.

If it makes sense then it will of course be necessary to describe what
"bitmap scan" actually means with the columnar storage table AM (plus
you'll still need to make some in-core changes to places like
tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's
going to be a bit annoying in the partitioning scenario you describe,
but some things are bound to be *inherently* impossible, so it can't be
helped.

It seems senseless to *require* table AMs to support something like a
bitmap scan. I don't think it's a coincidence that GIN is the index AM
that looks like it presents at least 2 problems for the columnar table
AM. To me this suggests that this will need a much higher level
discussion.


--
Peter Geoghegan




Re: Avoiding smgrimmedsync() during nbtree index builds

2021-05-03 Thread Melanie Plageman
So, I've written a patch which avoids doing the immediate fsync for
index builds either by using shared buffers or by queueing sync requests
for the checkpointer. If a checkpoint starts during the index build and
the backend is not using shared buffers for the index build, it will
need to do the fsync.

The reviewer will notice that _bt_load() extends the index relation for
the metapage before beginning the actual load of leaf pages but does not
actually write the metapage until the end of the index build. When using
shared buffers, it was difficult to create block 0 of the index after
creating all of the other blocks, as the block number is assigned inside
of ReadBuffer_common(), and it doesn't really work with the current
bufmgr API to extend a relation with a caller-specified block number.

I am not entirely sure of the correctness of doing an smgrextend() (when
not using shared buffers) without writing any WAL. However, the metapage
contents are not written until after WAL logging them later in
_bt_blwritepage(), so, perhaps it is okay?

I am also not fond of the change to the signature of _bt_uppershutdown()
that this implementation forces. Now, I must pass the shared buffer
(when using shared buffers) that I've reserved (pinned and locked) for
the metapage and, if not using shared buffers, the page I've allocated
for the metapage, before doing the index build to _bt_uppershutdown()
after doing the rest of the index build. I don't know that it seems
incorrect -- more that it feels a bit messy (and inefficient) to hold
onto that shared buffer or memory for the duration of the index build,
during which I have no intention of doing anything with that buffer or
memory. However, the alternative I devised was to change
ReadBuffer_common() or to add a new ReadBufferExtended() mode which
indicated that the caller would specify the block number and whether or
not it was an extend, which also didn't seem right.

For the extensions of the index done during index build, I use
ReadBufferExtended() directly instead of _bt_getbuf() for a few reasons.
I thought (am not sure) that I don't need to do
LockRelationForExtension() during index build. Also, I decided to use
RBM_ZERO_AND_LOCK mode so that I had an exclusive lock on the buffer
content instead of doing _bt_lockbuf() (which is what _bt_getbuf()
does). And, most of the places I added the call to ReadBufferExtended(),
the non-shared buffer code path is already initializing the page, so it
made more sense to just share that codepath.

I considered whether or not it made sense to add a new btree utility
function which calls ReadBufferExtended() in this way, however, I wasn't
sure how much that would buy me. The other place it might be able to be
used is btvacuumpage(), but that case is different enough that I'm not
even sure what the function would be called -- basically it would just
be an alternative to _bt_getbuf() for a couple of somewhat unrelated edge
cases.

On Thu, Jan 21, 2021 at 5:51 PM Andres Freund  wrote:
>
> Hi,
>
> On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> > On 21/01/2021 22:36, Andres Freund wrote:
> > > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > > that the attached script takes 2m17.223s with the smgrimmedsync and
> > > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > > the former case, CPU bound in the latter.
> > >
> > > Creating lots of tables with indexes (directly or indirectly through
> > > relations having a toast table) is pretty common, particularly after the
> > > introduction of partitioning.
> > >

Moving index builds of indexes which would fit in shared buffers back
into shared buffers has the benefit of eliminating the need to write
them out and fsync them if they will be subsequently used and thus read
right back into shared buffers. This avoids some of the unnecessary
fsyncs Andres is talking about here as well as avoiding some of the
extra IO required to write them and then read them into shared buffers.

I have dummy criteria for whether or not to use shared buffers (if the
number of tuples to be indexed is > 1000). I am considering using a
threshold of some percentage of the size of shared buffers as the
actual criteria for determining where to do the index build.

> > >
> > > Thinking through the correctness of replacing smgrimmedsync() with sync
> > > requests, the potential problems that I can see are:
> > >
> > > 1) redo point falls between the log_newpage() and the
> > > write()/register_dirty_segment() in smgrextend/smgrwrite.
> > > 2) redo point falls between write() and register_dirty_segment()
> > >
> > > But both of these are fine in the context of initially filling a newly
> > > created relfilenode, as far as I can tell? Otherwise the current
> > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> >
> > Hmm. If the redo point falls between write() and the
> > register_dirty_segment(), and the checkpointer finishes t

Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Tom Lane
Andres Freund  writes:
> On 2021-05-03 16:20:43 -0400, Tom Lane wrote:
>> Hmm, by that argument, any unexpected child PID in reaper() ought to be
>> grounds for a restart, regardless of its exit code.  Which'd be fine by
>> me.  I'm on board with being more restrictive about this, not less so.

> Are there any holes / races that could lead to this "legitimately"
> happening? To me the signal blocking looks like it should prevent that?

If it did happen it would imply a bug in the postmaster's child-process
bookkeeping.

(Or, I guess, some preloaded module deciding that launching its own
children was OK, whether or not it could find out whether they
succeeded.)

> I'm a bit worried that we'd find some harmless corner cases under adding
> a new instability. So personally I'd be inclined to just make it a
> warning, but ...

Well, I wouldn't recommend adding such a check in released branches,
but I'd be in favor of changing it in HEAD (or waiting till v15
opens).

Meanwhile, it seems like we both thought of complaining if the
postmaster's PID is 1.  I'm not quite sure if there are any
portability hazards from that, but on the whole it sounds like
a good way to detect badly-configured containers.

regards, tom lane




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> Alvaro Herrera  writes:
>> On 2021-May-03, Andres Freund wrote:
>>> The issue turns out to be that postgres was in a container, with pid
>>> namespaces enabled. Because postgres was run directly in the container,
>>> without a parent process inside, it thus becomes pid 1. Which mostly
>>> works without a problem. Until, as the case here with the archive
>>> command, a sub-sub process exits while it still has a child. Then that
>>> child gets re-parented to postmaster (as init).
>
>> Hah .. interesting.  I think we should definitely make this work, since
>> containerized stuff is going to become more and more prevalent.
>
> How would we make it "work"?  The postmaster can't possibly be expected
> to know the right thing to do with unexpected children.
>
>> I guess we can do that in older releases, but do we really need it?  As
>> I understand, the only thing we need to do is verify that the dying PID
>> is a backend PID, and not cause a crash cycle if it isn't.

> Maybe we should put in a startup-time check, analogous to the
> can't-run-as-root test, that the postmaster mustn't be PID 1.

Given that a number of minimal `init`s already exist specifically for
the case of running a single application in a container, I don't think
Postgres should to reinvent that wheel.  A quick eyball of the output of
`apt search container init` on a Debian Bullseyse system reveals at
least four:

 - https://github.com/Yelp/dumb-init
 - https://github.com/krallin/tini
 - https://github.com/fpco/pid1
 - https://github.com/openSUSE/catatonit

The first one also explains why there's more to being PID 1 than just
handling reparented children.

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Jeff Davis
On Fri, 2021-04-30 at 10:55 -0700, Jeff Davis wrote:
> On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote:
> > ISTM that would be up to the index AM.  We'd need some interlocks
> > on
> > which index AMs could be used with which table AMs in any case, I
> > think.
> 
> I'm not sure why? It seems like we should be able to come up with
> something that's generic enough.

Another point: the idea of supporting only some kinds of indexes
doesn't mix well with partitioning. If you declare an index on the
parent, we should do something reasonable if one partition's table AM
doesn't support that index AM.

Regards,
Jeff Davis






.ready and .done files considered harmful

2021-05-03 Thread Robert Haas
I and various colleagues of mine have from time to time encountered
systems that got a bit behind on WAL archiving, because the
archive_command started failing and nobody noticed right away.
Ideally, people should have monitoring for this and put it to rights
immediately, but some people don't. If those people happen to have a
relatively small pg_wal partition, they will likely become aware of
the issue when it fills up and takes down the server, but some users
provision disk space pretty generously and therefore nothing compels
them to notice the issue until they fill it up. In at least one case,
on a system that was actually generating a reasonable amount of WAL,
this took in excess of six months.

As you might imagine, pg_wal can get fairly large in such scenarios,
but the user is generally less concerned with solving that problem
than they are with getting the system back up. It is doubtless true
that the user would prefer to shrink the disk usage down to something
more reasonable over time, but on the facts as presented, it can't
really be an urgent issue for them. What they really need is just free
up a little disk space somehow or other and then get archiving running
fast enough to keep up with future WAL generation. Regrettably, the
archiver cannot do this, not even if you set archive_command =
/bin/true, because the archiver will barely ever actually run the
archive_command. Instead, it will spend virtually all of its time
calling readdir(), because for some reason it feels a need to make a
complete scan of the archive_status directory before archiving a WAL
file, and then it has to make another scan before archiving the next
one.

Someone - and it's probably for the best that the identity of that
person remains unknown to me - came up with a clever solution to this
problem, which is now used almost as a matter of routine whenever this
comes up. You just run pg_archivecleanup on your pg_wal directory, and
then remove all the corresponding .ready files and call it a day. I
haven't scrutinized the code for pg_archivecleanup, but evidently it
avoids needing O(n^2) time for this and therefore can clean up the
whole directory in something like the amount of time the archiver
would take to deal with a single file. While this seems to be quite an
effective procedure and I have not yet heard any user complaints, it
seems disturbingly error-prone, and honestly shouldn't ever be
necessary. The issue here is only that pgarch.c acts as though after
archiving 00010001, 00010002, and then
00010003, we have no idea what file we might need to
archive next. Could it, perhaps, be 00010004? Only a
full directory scan will tell us the answer!

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan. As far as I can see, this is just
cheap insurance. If archiving is keeping up, the extra stat() won't
matter much. If it's not, this will save more system calls than it
costs. Since during normal operation it shouldn't really be possible
for files to show up in pg_wal out of order, I don't really see a
scenario where this changes the behavior, either. If there are gaps in
the sequence at startup time, this will cope with it exactly the same
as we do now, except with a better chance of finishing before I
retire.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.
And I think that we can't get around that by just using stat() to look
for the appearance of the file we expect to see, because it's possible
that we might be doing all of this on a standby which then gets
promoted, or some upstream primary gets promoted, and WAL files start
appearing on a different timeline, making our prediction of what the
next filename will be incorrect. But perhaps we could work around this
by allowing pgarch.c to access shared memory, in which case it could
examine the current timeline whenever it wants, and probably also
whatever LSNs it needs to know what's safe to archive. If we did that,
could we just get rid of the .ready and .done files altogether? Are
they just a really expensive IPC mechanism to avoid a shared memory
connection, or is there some more fundamental reason why we need them?
And is there any good reason why the archiver shouldn't be connected
to shared memory? It is certainly nice to avoid having more processes
connected to shared memory than necessary, but the current scheme is
so inefficient that I think we end up worse off.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Andres Freund
Hi,

On 2021-05-03 16:20:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-05-03 15:37:24 -0400, Tom Lane wrote:
> >> And who's to say that ignoring unexpected child deaths is okay,
> >> anyway?  We could hardly be sure that the dead process hadn't been
> >> connected to shared memory.
> 
> > I don't think checking the exit status of unexpected children to see
> > whether we should crash-restart out of that concern is meaningful: We
> > don't know that the child didn't do anything bad with shared memory when
> > they exited with exit(1), instead of exit(2).
> 
> Hmm, by that argument, any unexpected child PID in reaper() ought to be
> grounds for a restart, regardless of its exit code.  Which'd be fine by
> me.  I'm on board with being more restrictive about this, not less so.

Are there any holes / races that could lead to this "legitimately"
happening? To me the signal blocking looks like it should prevent that?

I'm a bit worried that we'd find some harmless corner cases under adding
a new instability. So personally I'd be inclined to just make it a
warning, but ...

Greetings,

Andres Freund




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Andrew Dunstan


On 5/3/21 3:07 PM, Andres Freund wrote:
> Hi,
>
> A colleague debugged an issue where their postgres was occasionally
> crash-restarting under load.
>
> The cause turned out to be that a relatively complex archive_command was
> used, which could in some rare circumstances have a bash subshell
> pipeline not succeed.  It wasn't at all obvious why that'd cause a crash
> though - the archive command handles the error.
>
> The issue turns out to be that postgres was in a container, with pid
> namespaces enabled. Because postgres was run directly in the container,
> without a parent process inside, it thus becomes pid 1. Which mostly
> works without a problem. Until, as the case here with the archive
> command, a sub-sub process exits while it still has a child. Then that
> child gets re-parented to postmaster (as init).
>
> Such a child is likely to have exited not just with 0 or 1, but
> something else. As the pid won't match anything in reaper(), we'll go to
> CleanupBackend(). Where any exit status but 0/1 will unconditionally
> trigger a restart:
>
>   if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
>   {
>   HandleChildCrash(pid, exitstatus, _("server process"));
>   return;
>   }
>
>
> This kind of thing is pretty hard to debug, because it's not easy to
> even figure out what the "crashing" pid belonged to.
>
> I wonder if we should work a bit harder to try to identify whether an
> exiting process was a "server process" before identifying it as such?
>
> And perhaps we ought to warn about postgres running as "init" unless we
> make that robust?
>

Hmm, my initial reaction was if we detect very early on we're PID 1 then
fork and do all our work in the child, and in the parent just wait until
there are no more children. Not sure if that's feasible but I thought
I'd throw it out there.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Alvaro Herrera
On 2021-May-03, Andres Freund wrote:

> Using / for a single statically linked binary that e.g. just serves a
> bunch of hardcoded files is one thing. Putting actual data in / for
> something like postgres another.

Yeah, I just had a word with them and I had misunderstood what they were
doing.  They were attempting something completely insane and pointless,
so I'm going to leave it at that.

> I think there's a few more special cases when running as init, other
> than reparenting. E.g. I think the default signal handlers are
> different, the kernel kills the process in fewer cases etc. I am not
> opposed to adding support for it, but I think it'd need a bit of care.

Ok, we can leave that as future development then.

> Given that we probably shouldn't just break things in a minor release by
> refusing to run as 1, a warning seems to be the easiest thing for now?

WFM.

-- 
Álvaro Herrera   Valdivia, Chile
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Andres Freund
Hi,

On 2021-05-03 15:25:53 -0400, Alvaro Herrera wrote:
> I also heard a story where things ran into trouble (I didn't get the
> whole story of *what* was the problem with that) because the datadir is /.
> I know -- nobody in their right mind would put the datadir in / -- but
> apparently in the container world that's not something as stupid as it
> sounds.  That's of course not related to what you describe here
> code-wise, but the underlying reason is the same.

It still seems pretty insane in the container world too. Postgres needs
shared libraries (even if you managed to link postgres itself
statically, something we do not support). Postgres needs to write to the
data directory. Putting shared libraries inside the data directory seems
like a bad idea.

Using / for a single statically linked binary that e.g. just serves a
bunch of hardcoded files is one thing. Putting actual data in / for
something like postgres another.


> > I wonder if we should work a bit harder to try to identify whether an
> > exiting process was a "server process" before identifying it as such?
> 
> Well, we've never made any effort there because it just wasn't possible.
> Nobody ever had postmaster also be init .. until containers.  Let's fix
> it.

> > And perhaps we ought to warn about postgres running as "init" unless we
> > make that robust?
> 
> I guess we can do that in older releases, but do we really need it?  As
> I understand, the only thing we need to do is verify that the dying PID
> is a backend PID, and not cause a crash cycle if it isn't.

I think there's a few more special cases when running as init, other
than reparenting. E.g. I think the default signal handlers are
different, the kernel kills the process in fewer cases etc. I am not
opposed to adding support for it, but I think it'd need a bit of care.

Given that we probably shouldn't just break things in a minor release by
refusing to run as 1, a warning seems to be the easiest thing for now?

Greetings,

Andres Freund




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Tom Lane
Andres Freund  writes:
> On 2021-05-03 15:37:24 -0400, Tom Lane wrote:
>> And who's to say that ignoring unexpected child deaths is okay,
>> anyway?  We could hardly be sure that the dead process hadn't been
>> connected to shared memory.

> I don't think checking the exit status of unexpected children to see
> whether we should crash-restart out of that concern is meaningful: We
> don't know that the child didn't do anything bad with shared memory when
> they exited with exit(1), instead of exit(2).

Hmm, by that argument, any unexpected child PID in reaper() ought to be
grounds for a restart, regardless of its exit code.  Which'd be fine by
me.  I'm on board with being more restrictive about this, not less so.

> Do you feel the same about having different logging between the "known"
> and "unknown" child processes?

No objection to logging such cases more clearly, for sure.

regards, tom lane




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Andres Freund
Hi,

On 2021-05-03 15:37:24 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2021-May-03, Andres Freund wrote:
> >> The issue turns out to be that postgres was in a container, with pid
> >> namespaces enabled. Because postgres was run directly in the container,
> >> without a parent process inside, it thus becomes pid 1. Which mostly
> >> works without a problem. Until, as the case here with the archive
> >> command, a sub-sub process exits while it still has a child. Then that
> >> child gets re-parented to postmaster (as init).
>
> > Hah .. interesting.  I think we should definitely make this work, since
> > containerized stuff is going to become more and more prevalent.
>
> How would we make it "work"?  The postmaster can't possibly be expected
> to know the right thing to do with unexpected children.

Not saying that we should, but we could check if we're pid 1 / init, and
just warn about children we don't know anything about. Which we could
detect by iterating over BackendList/BackgroundWorkerList before
crash-restarting in CleanupBackend().  Then we'd not loose reliability
in the "normal" case, while not reducing reliability in the container
case.

I'm not quite sure I buy the reliability argument, tbh: The additional
process exits we see as pid 1 are after all process exits that we'd not
see if we weren't pid 1. And if we're not pid 1 then there really should
never be any "unexpected children" - we know what processes postmaster
itself forked after all. So where would unexpected children come from,
except reparenting?


> And who's to say that ignoring unexpected child deaths is okay,
> anyway?  We could hardly be sure that the dead process hadn't been
> connected to shared memory.

I don't think checking the exit status of unexpected children to see
whether we should crash-restart out of that concern is meaningful: We
don't know that the child didn't do anything bad with shared memory when
they exited with exit(1), instead of exit(2).


Random thought: I wonder if we ought to set madvise(MADV_DONTFORK) on
shared memory in postmaster children, where available. Then we could be
fairly certain that there aren't processes we don't know about that are
attached to shared memory (unless there's some nasty
shared_preload_library forking early during backend startup - but that's
hard to get excited about).


> > I guess we can do that in older releases, but do we really need it?  As
> > I understand, the only thing we need to do is verify that the dying PID
> > is a backend PID, and not cause a crash cycle if it isn't.
>
> I think that'd be a net reduction in reliability, not an improvement.
> In most scenarios it'd do little except mask bugs.

Do you feel the same about having different logging between the "known"
and "unknown" child processes?


Personally I don't think it's of utmost importance to support running as
pid 1. But we should at least print useful log messages about what
processes exited...


Greetings,

Andres Freund




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Alvaro Herrera
On 2021-May-03, Tom Lane wrote:

> Alvaro Herrera  writes:
> > I also heard a story where things ran into trouble (I didn't get the
> > whole story of *what* was the problem with that) because the datadir is /.
> 
> BTW, as far as that goes, I think the general recommendation is that
> the datadir shouldn't be a mount point, because bad things happen if
> you mount or unmount the drive while the postmaster is up.  I could
> see enforcing that, if we could find a reasonably platform-independent
> way to do it.

/ is not a mount point; it's just that the container system binds (?)
some different directory as / for the process to run into.  I suppose it
must be similar to chrooting to /, but I'm not sure if it's exactly
that.

> (Of course, / can't be unmounted, so I wonder exactly what bad thing
> happened in that story.)

It's not related to unmounting.  I'll try to get the details.

-- 
Álvaro Herrera   Valdivia, Chile




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 12:06 PM Matthias van de Meent
 wrote:
> One could relatively easily disable bitmap scans on the table AM by
> not installing the relevant bitmap support functions on the registered
> TableAM structure, and thus not touch that problem.

I have no idea how much it'll hurt things if the column store table AM
supports no analogue of bitmap scans.

> Some indexes will
> then never be accessed due to the bitmap scan requirement of their
> IndexAM (gin, brin, bloom, to name a few), and as such won't make
> sense to create on that table, but that's about it I think.

Right. More formally: if this restriction is accepted by a table AM
(say the column store table AM), then any index AM with amgettuple set
to NULL cannot ever be used (it should probably be treated as an error
condition at CREATE INDEX time).

If this really is the best path forward (again, no idea if that's
true) then that would conveniently make it pretty easy to solve the
GIN posting list issue raised by Jeff. It just wouldn't matter -- GIN
indexes cannot be used with the column store anyway.

-- 
Peter Geoghegan




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Tom Lane
Alvaro Herrera  writes:
> I also heard a story where things ran into trouble (I didn't get the
> whole story of *what* was the problem with that) because the datadir is /.

BTW, as far as that goes, I think the general recommendation is that
the datadir shouldn't be a mount point, because bad things happen if
you mount or unmount the drive while the postmaster is up.  I could
see enforcing that, if we could find a reasonably platform-independent
way to do it.

(Of course, / can't be unmounted, so I wonder exactly what bad thing
happened in that story.)

regards, tom lane




Re: Regex performance regression induced by match-all code

2021-05-03 Thread Joel Jacobson
On Mon, May 3, 2021, at 21:38, Tom Lane wrote:
> "Joel Jacobson" mailto:joel%40compiler.org>> writes:
> > On Sun, May 2, 2021, at 18:53, Tom Lane wrote:
> >> fix-exponential-cost-of-checkmatchall-2.patch
> 
> > Successfully tested.
> 
> Again, thanks for checking!

You're welcome, thanks for coding!

/Joel

Re: Regex performance regression induced by match-all code

2021-05-03 Thread Tom Lane
"Joel Jacobson"  writes:
> On Sun, May 2, 2021, at 18:53, Tom Lane wrote:
>> fix-exponential-cost-of-checkmatchall-2.patch

> Successfully tested.

Again, thanks for checking!

regards, tom lane




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-May-03, Andres Freund wrote:
>> The issue turns out to be that postgres was in a container, with pid
>> namespaces enabled. Because postgres was run directly in the container,
>> without a parent process inside, it thus becomes pid 1. Which mostly
>> works without a problem. Until, as the case here with the archive
>> command, a sub-sub process exits while it still has a child. Then that
>> child gets re-parented to postmaster (as init).

> Hah .. interesting.  I think we should definitely make this work, since
> containerized stuff is going to become more and more prevalent.

How would we make it "work"?  The postmaster can't possibly be expected
to know the right thing to do with unexpected children.

> I guess we can do that in older releases, but do we really need it?  As
> I understand, the only thing we need to do is verify that the dying PID
> is a backend PID, and not cause a crash cycle if it isn't.

I think that'd be a net reduction in reliability, not an improvement.
In most scenarios it'd do little except mask bugs.  And who's to say
that ignoring unexpected child deaths is okay, anyway?  We could hardly
be sure that the dead process hadn't been connected to shared memory.

Maybe we should put in a startup-time check, analogous to the
can't-run-as-root test, that the postmaster mustn't be PID 1.

regards, tom lane




Re: Regex performance regression induced by match-all code

2021-05-03 Thread Joel Jacobson
On Sun, May 2, 2021, at 18:53, Tom Lane wrote:
> fix-exponential-cost-of-checkmatchall-2.patch

Successfully tested.

SELECT
  is_match <> (subject ~ pattern),
  captured IS DISTINCT FROM regexp_match(subject, pattern, flags),
  COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2;
?column? | ?column? |  count
--+--+-
f| f| 3253889
(1 row)

Time: 94149.542 ms (01:34.150)
Time: 91565.305 ms (01:31.565)
Time: 91565.305 ms (01:31.565)

SELECT regexp_matches('', '(.|){20}','');
regexp_matches

{""}
(1 row)

Time: 0.541 ms

SELECT regexp_matches('', '(.|){25}','');
regexp_matches

{""}
(1 row)

Time: 0.724 ms

SELECT regexp_matches('', '(.|){27}','');
regexp_matches

{""}
(1 row)

Time: 0.782 ms

/Joel

Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Alvaro Herrera
On 2021-May-03, Andres Freund wrote:

> The issue turns out to be that postgres was in a container, with pid
> namespaces enabled. Because postgres was run directly in the container,
> without a parent process inside, it thus becomes pid 1. Which mostly
> works without a problem. Until, as the case here with the archive
> command, a sub-sub process exits while it still has a child. Then that
> child gets re-parented to postmaster (as init).

Hah .. interesting.  I think we should definitely make this work, since
containerized stuff is going to become more and more prevalent.

I also heard a story where things ran into trouble (I didn't get the
whole story of *what* was the problem with that) because the datadir is /.
I know -- nobody in their right mind would put the datadir in / -- but
apparently in the container world that's not something as stupid as it
sounds.  That's of course not related to what you describe here
code-wise, but the underlying reason is the same.

> I wonder if we should work a bit harder to try to identify whether an
> exiting process was a "server process" before identifying it as such?

Well, we've never made any effort there because it just wasn't possible.
Nobody ever had postmaster also be init .. until containers.  Let's fix
it.

> And perhaps we ought to warn about postgres running as "init" unless we
> make that robust?

I guess we can do that in older releases, but do we really need it?  As
I understand, the only thing we need to do is verify that the dying PID
is a backend PID, and not cause a crash cycle if it isn't.

-- 
Álvaro Herrera   Valdivia, Chile




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, May 3, 2021 at 2:48 PM Tom Lane  wrote:
> > I'm still of the opinion that slicing and dicing this at the per-GUC
> > level is a huge waste of effort.  Just invent one role that lets
> > grantees set any GUC, document it as being superuser-equivalent,
> > and be done.
> 
> If you want to grant someone full superuser rights, you can do that
> already. The trick is what to do when you want someone to be able to
> administer the cluster in a meaningful way without giving them full
> superuser rights.

I would suggest that both are useful, but the one-big-hammer does
nothing to answer the use-case which was brought up on this particular
thread (which is also certainly not the first time this has been
desired).  Instead, I would imagine that there would be a set of
predefined roles for the capabilities and then we might have another
role which is akin to 'pg_monitor' but is 'pg_admin' which is GRANT'd a
bunch of those capabilities and explicitly documented to be able to
become a superuser if they wished to.

Perhaps we would also have a "pg_notsuperuser_admin" which would be
GRANT'd all the capabilities, excluding the ones that could be used to
gain superuser access.

As has also been discussed recently, one of the big missing capabilities
for a "pg_notsuperuser_admin" is a 'create role' capability.  I realize
that's not exactly the same as GUCs but it's a big part of what's
missing to make all of this "run a service where the 'DBA' can do
everything except get out to the OS" stuff work out of the box.

Thanks,

Stephen


signature.asc
Description: PGP signature


PG in container w/ pid namespace is init, process exits cause restart

2021-05-03 Thread Andres Freund
Hi,

A colleague debugged an issue where their postgres was occasionally
crash-restarting under load.

The cause turned out to be that a relatively complex archive_command was
used, which could in some rare circumstances have a bash subshell
pipeline not succeed.  It wasn't at all obvious why that'd cause a crash
though - the archive command handles the error.

The issue turns out to be that postgres was in a container, with pid
namespaces enabled. Because postgres was run directly in the container,
without a parent process inside, it thus becomes pid 1. Which mostly
works without a problem. Until, as the case here with the archive
command, a sub-sub process exits while it still has a child. Then that
child gets re-parented to postmaster (as init).

Such a child is likely to have exited not just with 0 or 1, but
something else. As the pid won't match anything in reaper(), we'll go to
CleanupBackend(). Where any exit status but 0/1 will unconditionally
trigger a restart:

if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
{
HandleChildCrash(pid, exitstatus, _("server process"));
return;
}


This kind of thing is pretty hard to debug, because it's not easy to
even figure out what the "crashing" pid belonged to.

I wonder if we should work a bit harder to try to identify whether an
exiting process was a "server process" before identifying it as such?

And perhaps we ought to warn about postgres running as "init" unless we
make that robust?

Greetings,

Andres Freund




Re: [PATCH] Identify LWLocks in tracepoints

2021-05-03 Thread Peter Eisentraut

On 30.04.21 05:22, Craig Ringer wrote:

On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut
 wrote:

So if you could produce a separate patch that adds the
_ENABLED guards targeting PG14 (and PG13), that would be helpful.


Here is a proposed patch for this.


LGTM.

Applies and builds fine on master and (with default fuzz) on
REL_13_STABLE. Works as expected.


committed




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Matthias van de Meent
On Mon, 3 May 2021 at 20:43, Peter Geoghegan  wrote:
>
> On Mon, May 3, 2021 at 10:57 AM Jeff Davis  wrote:
> > For the purposes of this discussion, what's making my life difficult is
> > that we don't have a good definition for TID, leaving me with two
> > options:
> >
> >   1. be overly conservative, accept MaxOffsetNumber=2048, wasting a
> > bunch of address space; or
>
> tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE,
> which is much lower than MaxOffsetNumber (it's almost 10x lower). I
> wonder what that means for your design.

One could relatively easily disable bitmap scans on the table AM by
not installing the relevant bitmap support functions on the registered
TableAM structure, and thus not touch that problem. Some indexes will
then never be accessed due to the bitmap scan requirement of their
IndexAM (gin, brin, bloom, to name a few), and as such won't make
sense to create on that table, but that's about it I think. We might
want to add some safeguards that bitmapscan-only indexams arent used
on tableams that don't support it, but I believe that's a nice-to-have
and not critical, on a similar level to the deduplication of constaint
indexes.

With regards,

Matthias van de Meent




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 2:48 PM Tom Lane  wrote:
> I'm still of the opinion that slicing and dicing this at the per-GUC
> level is a huge waste of effort.  Just invent one role that lets
> grantees set any GUC, document it as being superuser-equivalent,
> and be done.

If you want to grant someone full superuser rights, you can do that
already. The trick is what to do when you want someone to be able to
administer the cluster in a meaningful way without giving them full
superuser rights.

I agree that in some cases it's fine to have predefined roles that are
known to permit easy escalation to superuser privileges, like
pg_execute_server_program. It doesn't provide any real security, but
like you said, it can help prevent mistakes. However, there is a real
use cases for a privileged user who cannot be permitted to escalate to
superuser or to the OS account, but still needs to be able to do some
administration of the cluster. The scenario Mark laid out in his
original post is very common. In fact, it may already be the dominant
model for PostgreSQL deployment, and if it isn't now, it will be in 5
years. Letting each individual company that's providing a hosted
PostgreSQL solution hack up its own solution to that problem, all of
which are subtly incompatible with each other and with upstream, is
not good for users or the project.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Some oversights in query_id calculation

2021-05-03 Thread Bruce Momjian
On Sun, May  2, 2021 at 12:27:37PM +0800, Julien Rouhaud wrote:
> Hi Aleksander,
> 
> On Wed, Apr 28, 2021 at 03:22:39PM +0300, Aleksander Alekseev wrote:
> > Hi Julien,
> > 
> > > You should see failures doing a check-world or simply a make -C
> > > contrib/pg_stat_statements check
> > 
> > Sorry, my bad. I was running make check-world, but did it with -j4 flag
> > which was a mistake.
> > 
> > The patch is OK.
> 
> Thanks for reviewing!

Patch applied, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 2:41 PM Stephen Frost  wrote:
> In general, I agree that we should be looking at predefined roles as
> being similar to the Linux capabilities system- defining certain kinds
> of operations which the user who has that role is allowed to do, and
> then both in-core and extensions can make decisions based on what
> capabilities the user has been GRANT'd.

Cool.

> Hopefully that would limit the amount of cases where a given capability
> ends up being overly broad while at the same time allowing extensions to
> sensibly be able to use the defined capabilities for their own needs.

Yeah. I think it will be a little tricky to get right, as some of the
cases are a bit subjective, I think.

> As we do in other places, we should make it clear when a certain
> capability allows a user with that capability to gain superuser access
> as that may not always be clear to a user.

+1.

> One thing that seems missing from this discussion and is part of what
> paused my effort on the 'admin' role proposed towards the end of the
> last cycle is that we really need to consider how this all plays with
> ALTER SYSTEM and not just SUSET GUCs but also other (eg: POSTMASTER,
> SIGHUP) GUCs.  That is- imv we should have a sensible solution for
> more-or-less all GUCs and which would allow a non-superuser to be able
> to set POSTMASTER and SIGHUP GUCs (and perhaps others..) through
> ALTER SYSTEM.

I missed the earlier discussion on this topic, but I agree that this
is very important. I think that the discussion of capabilities might
help us get there. For instance, if I'm a service provider, and I give
user "bob" the pg_put_whatever_you_want_in_the_server_log role, and
GUCs are tagged so we know what GUCs that affects, then it seems
natural to me to allow Bob to set those GUCs via ALTER SYSTEM as well
as via ALTER USER or ALTER DATABASE. However, if I don't give him the
pg_frob_shell_commands role, he can't set archive_command.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Tom Lane
Stephen Frost  writes:
> One thing that seems missing from this discussion and is part of what
> paused my effort on the 'admin' role proposed towards the end of the
> last cycle is that we really need to consider how this all plays with
> ALTER SYSTEM and not just SUSET GUCs but also other (eg: POSTMASTER,
> SIGHUP) GUCs.

Yeah, I'd meant to bring that up too.  The ability to use ALTER
SYSTEM freely is probably a much bigger use-case than messing with
SUSET variables within one's own session.

I'm still of the opinion that slicing and dicing this at the per-GUC
level is a huge waste of effort.  Just invent one role that lets
grantees set any GUC, document it as being superuser-equivalent,
and be done.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 10:57 AM Jeff Davis  wrote:
> For the purposes of this discussion, what's making my life difficult is
> that we don't have a good definition for TID, leaving me with two
> options:
>
>   1. be overly conservative, accept MaxOffsetNumber=2048, wasting a
> bunch of address space; or

tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE,
which is much lower than MaxOffsetNumber (it's almost 10x lower). I
wonder what that means for your design.

>   2. risk the mapping between TID and row number could break at any
> time

Though this clearly is the immediate problem for you, I think that the
real problem is that the table AM kind of tacitly assumes that there
is a universality to item pointer TIDs -- which is obviously not true.
It might be useful for you to know what assumptions index AMs can make
about how TIDs work in general, but I think that you really need an
index-AM level infrastructure that advertises the capabilities of each
index AM with respect to handling each possible variation (I suppose
you have heapam, 6 byte uint, and maybe 8 byte uint).

The easiest reasonable short term design for you is probably to find a
way to make 6 byte TIDs into 48-bit unsigned integers (perhaps only
conceptually), at least in contexts where the columnar table AM is
used. You'll still need the index AM for that. This at least makes
64-bit TID-like identifiers a relatively simple conceptually shift.

-- 
Peter Geoghegan




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, May 3, 2021 at 12:25 PM Mark Dilger
>  wrote:
> > As things stand, all custom variables defined via the 
> > DefineCustom{Bool,Int,Real,String,Enum}Variable are placed in the 
> > CUSTOM_OPTIONS config_group.  We could add a role for controlling any SUSET 
> > CUSTOM_OPTIONS GUCs, or we could extend those functions to take a 
> > config_group option, or perhaps some of both.  I haven't thought too much 
> > yet about whether allowing extensions to place a custom GUC into one of the 
> > predefined groups would be problematic.  Any thoughts on that?
> 
> Well...
> 
> One idea would be to get rid of PGC_SUSET altogether and instead have
> a set of flags associated with each GUC, like PGF_SERVER_LOG,
> PGF_CORRUPT_DATA, PGF_CRASH_SERVER. Then you could associate those
> flags with particular predefined roles and grant them out to whoever
> you want.
> 
> So if a GUC is flagged PGF_SERVER_LOG|PGF_CRASH_SERVER, then the
> assumption is that it's security-sensitive because it both lets you
> alter the contents of the server log and also lets you crash the
> server. If you are granted both pg_server_log and pg_crash_server, you
> can set it, otherwise not.
> 
> This is just wild brainstorming, but my point is that I don't think
> doing it by options groups is particularly good, because it doesn't
> really have any relationship to why those things are marked SUSET in
> the first place. To take an example involving functions rather than
> GUCs, the pageinspect functions are super-user only because you can
> crash the server by inspecting malformed data that you supply as an
> arbitrarily literal, but AFAIK the functions in pgstattuple have no
> similar hazard, and are just super-only because we don't really know
> who the superuser wants to authorize, and maybe it's not everybody. So
> those cases are really different, even though both are extensions. I
> think the same likely holds true for GUCs.

In general, I agree that we should be looking at predefined roles as
being similar to the Linux capabilities system- defining certain kinds
of operations which the user who has that role is allowed to do, and
then both in-core and extensions can make decisions based on what
capabilities the user has been GRANT'd.

Hopefully that would limit the amount of cases where a given capability
ends up being overly broad while at the same time allowing extensions to
sensibly be able to use the defined capabilities for their own needs.

As we do in other places, we should make it clear when a certain
capability allows a user with that capability to gain superuser access
as that may not always be clear to a user.

One thing that seems missing from this discussion and is part of what
paused my effort on the 'admin' role proposed towards the end of the
last cycle is that we really need to consider how this all plays with
ALTER SYSTEM and not just SUSET GUCs but also other (eg: POSTMASTER,
SIGHUP) GUCs.  That is- imv we should have a sensible solution for
more-or-less all GUCs and which would allow a non-superuser to be able
to set POSTMASTER and SIGHUP GUCs (and perhaps others..) through
ALTER SYSTEM.  

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Jeff Davis
On Mon, 2021-05-03 at 10:38 -0700, Peter Geoghegan wrote:
> I don't think it's much good to just do that. You probably need a
> full
> 64-bits for something like a column store. But that's all you need.

I would definitely like that for citus columnar, and it would
definitely make it easier to manage the address space, but I won't
demand it today. 48 bits is a workable tuple address space for many
purposes, especially when you factor in logical partitioning.

I will be dealing with gaps though, so wasting 5 bits of address space
(2^16 / MaxOffsetNumber = 32) to bring it down to 43 bits is not great.

Regards,
Jeff Davis






Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Jeff Davis
On Mon, 2021-05-03 at 13:22 -0400, Robert Haas wrote:
> to look and work like heap TIDs; that is, there had better be a block
> number portion and an item number portion,

Right (at least for now).

> and the item number had
> better be smaller than MaxOffsetNumber,

That's not clear to me at all, and is the whole reason I began this
thread.

  a. You say "smaller than MaxOffsetNumber", but that's a little weird.
If an offset can't be MaxOffsetNumber, it's not really the maximum, is
it?
  b. If you actually meant "less than or equal to MaxOffsetNumber",
that will fail with the GIN posting list issue raised in my first
email. Do you agree that's a bug?
  c. Why can't we go all the way up to MovedPartitionsOffsetNumber - 1?
Right now, MaxOffsetNumber is poorly named, because it actually
represents the a number slightly higher than the maximum number of
items that can fit on a page. That essentially wastes 5 bits of address
space for no obvious reason.

> and if you want bitmap scans
> to run reasonably quickly, the block number had also better
> correspond
> to physical locality to some degree.

Right (at least for now).

Regards,
Jeff Davis







Re: Incorrect snapshots while promoting hot standby node when 2PC is used

2021-05-03 Thread Andres Freund
Hi,

On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote:
> I'm investigating somewhat resemblant case.
> We have an OLTP sharded installation where shards are almost always under 
> rebalancing. Data movement is implemented with 2pc.
> Switchover happens quite often due to datacenter drills. The installation is 
> running on PostgreSQL 12.6.

If you still have the data it would be useful if you could check if the
LSNs of the corrupted pages are LSNs from shortly after standby
promotion/switchover?


> Can this case be related to the problem that you described?

It seems possible, but it's hard to know without a lot more information.


> Or, perhaps, it looks more like a hardware problem? Data_checksums are
> on, but few years ago we observed ssd firmware that was loosing
> updates, but passing checksums. I'm sure that we would benefit from
> having separate relation fork for checksums or LSNs.

Right - checksums are "page local". They can only detect if a page is
corrupted, not if e.g. an older version of the page (with correct
checksum) has been restored. While there are schemes to have stronger
error detection properties, they do come with substantial overhead (at
least the ones I can think of right now).


Greetings,

Andres Freund




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 10:22 AM Matthias van de Meent
 wrote:
> For IoT, as far as I know, one of the constraints is that there exists
> some unique constraint on the table, which also defines the ordering.
> Assuming that that is the case, we can use  +  transaction id> to identify tuple versions.

Perhaps that's true in theory, but the resulting design seems likely
to be useless in the end. In any case I believe that systems that
generally use a heap but give you the choice of using an IoT (I'm
really thinking of Oracle) tend to not have many users that actually
avail of IoTs. On modern flash storage the trade-off made by an IoT or
clustered index design seems like the wrong one on average. You're
saving about 1 I/O on average with a PK lookup, which just isn't that
much of an upside compared to the many downsides.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Jeff Davis
On Mon, 2021-05-03 at 09:59 -0700, Peter Geoghegan wrote:
> You don't accept any of that, though. Fair enough. I predict that
> avoiding making a hard choice will make Jeff's work here a lot
> harder,
> though.

For the purposes of this discussion, what's making my life difficult is
that we don't have a good definition for TID, leaving me with two
options:

  1. be overly conservative, accept MaxOffsetNumber=2048, wasting a
bunch of address space; or
  2. risk the mapping between TID and row number could break at any
time

And compounding that, it seems that there's a bug in GIN that doesn't
honor MaxOffsetNumber, so actually neither of the rules above work
either. Instead, I need to use 2047 as the max offset number, which has
no real basis in the postgres design, but I'd be stuck with it for a
long time.

What I'm looking for is:
  * A declaration of what the actual maximum valid offset number is,
and that it will be stable enough to use for table AMs for now. (This
maximum valid offset number may or may not be called MaxOffsetNumber,
and may or may not be tied to the maximum number of items that fit on a
page.)
  * A confirmation that this GIN behavior is a bug that should be
fixed, now that there are table AMs in existence that need it fixed.

Even if we fix this in v15, we still need some guidance for what table
AMs should do in earlier versions.

If we change the way tuple IDs are represented or the table AM in v15
or beyond, that may require a REINDEX for indexes on some table AMs. As
long as we have some robust way to check that a REINDEX is necessary,
that's fine with me.

Regards,
Jeff Davis






Re: function for testing that causes the backend to terminate

2021-05-03 Thread Dave Cramer
Joe,

Thanks,

This works and I don't have to install anything!

Dave Cramer


On Thu, 29 Apr 2021 at 16:16, Joe Conway  wrote:

> On 4/29/21 6:56 AM, Dave Cramer wrote:
> > For testing unusual situations I'd like to be able to cause a backend to
> > terminate due to something like a segfault. Do we currently have this in
> > testing ?
>
> If you can run SQL as a superuser from that backend, try:
>
> COPY (SELECT pg_backend_pid())
>   TO PROGRAM 'xargs kill -SIGSEGV';
>
> HTH,
>
> Joe
>
> --
> Crunchy Data - http://crunchydata.com
> PostgreSQL Support for Secure Enterprises
> Consulting, Training, & Open Source Development
>


Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Chapman Flack
On 05/03/21 13:23, Robert Haas wrote:
> On Mon, May 3, 2021 at 11:45 AM Chapman Flack  wrote:
>> I guess I was thinking, but forgot to convey to the keyboard, that the
>> existence of a non-empty extra_check_hooks chain on a SUSET GUC (which
>> could only have been attached from a shared preload library) would
>> implicitly change SUSET to mean settable whenever accepted by the hook(s).
> 
> Sure, but the hook still needs a way to know which users are entitled
> to set the GUC.

I was contemplating a version 0 with only that minimal support in core
for allowing a shared preload library to set such hooks (and allowing
include-with-a-role in config files), assuming service providers already
do sophisticated building of stuff to construct the environments they
provide, and a C shared object with hooks that enforce their designed
constraints wouldn't be an onerous burden on top of that.

Such providers could then be the laboratories of democracy building
various forms of such things, and if one of those ends up having a
reasonably general configuration mechanism and gets offered as a
contrib module later or for inclusion in core, well, that's version 0.5.

Regards,
-Chap




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 10:22 AM Robert Haas  wrote:
> I don't really think so, or at least I don't see a reason why it
> should. As things stand today, I don't think it's possible for a table
> AM author to make any other choice than to assume that their TIDs have
> to look and work like heap TIDs; that is, there had better be a block
> number portion and an item number portion, and the item number had
> better be smaller than MaxOffsetNumber, and if you want bitmap scans
> to run reasonably quickly, the block number had also better correspond
> to physical locality to some degree. It's not clear to me how exactly
> someone would go about fixing all of that, but I think it would be
> great if they did. Even if that person wanted to assume for purposes
> of their own patch that fixed-width, integer-like TIDs are the only
> thing we care about, that would be fine with me. Getting to a point
> where the available 48 bits can be used in whatever way the table AM
> author wants is clearly better than what we have now.

I don't think it's much good to just do that. You probably need a full
64-bits for something like a column store. But that's all you need.

> Now I'm personally of the opinion that we shouldn't be content to stop
> there, but so what? I'm not insisting that Jeff or anyone else has to
> work on this problem, or that they have to fix more of it rather than
> less. I hope that nobody's going to try to back us into a corner by
> making design decisions that deliberately complicate the possibility
> of future improvements in that area, and that's about it. I don't
> really understand why you think that's unreasonable, or even
> problematic. I can't see that any way in which the assumption that we
> will NEVER want to further generalize the TID concept simplifies
> anything anyone wants to do today.

It creates ambiguity of the kind that deters related improvements. I
for one am not comfortable with (say) working on generalizing TID to
the extent required to facilitate Jeff's work if that obligates me to
make some legalistic and wholly insincere statement about future
improvements to the definition of TID still being quite possible (to
facilitate indirect indexes, or whatever). The truth is that I cannot
possibly know if facilitating Jeff's work in the short term blocks off
other things in the long term -- because I don't actually have a clue
how these other things could ever really be implemented sensible in
any case.

-- 
Peter Geoghegan




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 12:25 PM Mark Dilger
 wrote:
> As things stand, all custom variables defined via the 
> DefineCustom{Bool,Int,Real,String,Enum}Variable are placed in the 
> CUSTOM_OPTIONS config_group.  We could add a role for controlling any SUSET 
> CUSTOM_OPTIONS GUCs, or we could extend those functions to take a 
> config_group option, or perhaps some of both.  I haven't thought too much yet 
> about whether allowing extensions to place a custom GUC into one of the 
> predefined groups would be problematic.  Any thoughts on that?

Well...

One idea would be to get rid of PGC_SUSET altogether and instead have
a set of flags associated with each GUC, like PGF_SERVER_LOG,
PGF_CORRUPT_DATA, PGF_CRASH_SERVER. Then you could associate those
flags with particular predefined roles and grant them out to whoever
you want.

So if a GUC is flagged PGF_SERVER_LOG|PGF_CRASH_SERVER, then the
assumption is that it's security-sensitive because it both lets you
alter the contents of the server log and also lets you crash the
server. If you are granted both pg_server_log and pg_crash_server, you
can set it, otherwise not.

This is just wild brainstorming, but my point is that I don't think
doing it by options groups is particularly good, because it doesn't
really have any relationship to why those things are marked SUSET in
the first place. To take an example involving functions rather than
GUCs, the pageinspect functions are super-user only because you can
crash the server by inspecting malformed data that you supply as an
arbitrarily literal, but AFAIK the functions in pgstattuple have no
similar hazard, and are just super-only because we don't really know
who the superuser wants to authorize, and maybe it's not everybody. So
those cases are really different, even though both are extensions. I
think the same likely holds true for GUCs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 11:45 AM Chapman Flack  wrote:
> I guess I was thinking, but forgot to convey to the keyboard, that the
> existence of a non-empty extra_check_hooks chain on a SUSET GUC (which
> could only have been attached from a shared preload library) would
> implicitly change SUSET to mean settable whenever accepted by the hook(s).

Sure, but the hook still needs a way to know which users are entitled
to set the GUC.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Matthias van de Meent
On Mon, 3 May 2021 at 19:00, Peter Geoghegan  wrote:
>
> On Mon, May 3, 2021 at 9:45 AM Robert Haas  wrote:
> > But if you're saying those identifiers have to be fixed-width and 48
> > (or even 64) bits, I disagree that we wish to have such a requirement
> > in perpetuity.
>
> Once you require that TID-like identifiers must point to particular
> versions (as opposed to particular logical rows), you also virtually
> require that the identifiers must always be integer-like (though not
> necessarily block-based and not necessarily 6 bytes). You've
> practically ensured that clustered index tables (and indirect indexes)
> will never be possible by accepting this.

For IoT, as far as I know, one of the constraints is that there exists
some unique constraint on the table, which also defines the ordering.
Assuming that that is the case, we can use  +  to identify tuple versions.

With regards,

Matthias van de Meent.




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 1:00 PM Peter Geoghegan  wrote:
> You don't accept any of that, though. Fair enough. I predict that
> avoiding making a hard choice will make Jeff's work here a lot harder,
> though.

I don't really think so, or at least I don't see a reason why it
should. As things stand today, I don't think it's possible for a table
AM author to make any other choice than to assume that their TIDs have
to look and work like heap TIDs; that is, there had better be a block
number portion and an item number portion, and the item number had
better be smaller than MaxOffsetNumber, and if you want bitmap scans
to run reasonably quickly, the block number had also better correspond
to physical locality to some degree. It's not clear to me how exactly
someone would go about fixing all of that, but I think it would be
great if they did. Even if that person wanted to assume for purposes
of their own patch that fixed-width, integer-like TIDs are the only
thing we care about, that would be fine with me. Getting to a point
where the available 48 bits can be used in whatever way the table AM
author wants is clearly better than what we have now.

Now I'm personally of the opinion that we shouldn't be content to stop
there, but so what? I'm not insisting that Jeff or anyone else has to
work on this problem, or that they have to fix more of it rather than
less. I hope that nobody's going to try to back us into a corner by
making design decisions that deliberately complicate the possibility
of future improvements in that area, and that's about it. I don't
really understand why you think that's unreasonable, or even
problematic. I can't see that any way in which the assumption that we
will NEVER want to further generalize the TID concept simplifies
anything anyone wants to do today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 9:45 AM Robert Haas  wrote:
> But if you're saying those identifiers have to be fixed-width and 48
> (or even 64) bits, I disagree that we wish to have such a requirement
> in perpetuity.

Once you require that TID-like identifiers must point to particular
versions (as opposed to particular logical rows), you also virtually
require that the identifiers must always be integer-like (though not
necessarily block-based and not necessarily 6 bytes). You've
practically ensured that clustered index tables (and indirect indexes)
will never be possible by accepting this. Those designs are the only
real reason to have truly variable-length TID-like identifiers IMV (as
opposed to 2 or perhaps even 3 standard TID widths).

You don't accept any of that, though. Fair enough. I predict that
avoiding making a hard choice will make Jeff's work here a lot harder,
though.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 11:26 AM Peter Geoghegan  wrote:
> It just has to be able to accept the restriction that
> indexes must have a unique TID-like identifier for each version (not
> quite a version actually -- whatever the equivalent of a HOT chain
> is). This is a restriction that Jeff had pretty much planned on
> working within before starting this thread (I know this because we
> spoke about it privately).

Well, I think what I'm saying is that I'm not on board with such a restriction.

If you're just saying that it has to be possible to identify rows
somehow, I am in full agreement, and I think the universe is on board
as well.

But if you're saying those identifiers have to be fixed-width and 48
(or even 64) bits, I disagree that we wish to have such a requirement
in perpetuity.

That'd be like going around to automobile manufacturers in 1925 and
asking them to agree that all future cars ever manufactured must have
a clutch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-05-03 Thread Robert Haas
On Fri, Apr 30, 2021 at 5:07 PM Robert Haas  wrote:
> On Fri, Apr 30, 2021 at 4:26 PM Mark Dilger
>  wrote:
> > After further reflection, no other verbiage seems any better.  I'd say go 
> > ahead and commit it this way.
>
> OK. I'll plan to do that on Monday, barring objections.

Done now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Mark Dilger



> On May 3, 2021, at 8:22 AM, Robert Haas  wrote:
> 
> One problem with having a separate predefined role for every PGC_SUSET
> GUC is that it's no help for extensions. Both auto_explain and
> pg_stat_statements have such GUCs, and there may be out-of-core
> extensions that do as well. We should try to come up with a system
> that doesn't leave them out in the cold.

As things stand, all custom variables defined via the 
DefineCustom{Bool,Int,Real,String,Enum}Variable are placed in the 
CUSTOM_OPTIONS config_group.  We could add a role for controlling any SUSET 
CUSTOM_OPTIONS GUCs, or we could extend those functions to take a config_group 
option, or perhaps some of both.  I haven't thought too much yet about whether 
allowing extensions to place a custom GUC into one of the predefined groups 
would be problematic.  Any thoughts on that?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 8:03 AM Robert Haas  wrote:
> It's reasonable to wonder. I think it depends on whether the problem
> is bloat or just general slowness. To the extent that the problem is
> bloat, bottom-index deletion will help a lot, but it's not going to
> help with slowness because, as you say, we still have to dirty the
> pages. And I am pretty confident that slowness is a very significant
> part of the problem here.

It's all going to depend on workload of course -- we'll need to wait
and see what users still complain about with Postgres 14 to really
have some idea. You only freshly dirty those leaf pages that weren't
already dirty, and the costs will be much more linear, so it's a
complicated question.

Here is a more modest statement that might be more convincing: The
*relative* importance of making something like HOT more robust to
things like long-running xacts has increased now that we have
bottom-up index deletion. We could improve things here by adding
something like zheap, which allows a HOT chain to mostly live in UNDO,
and therefore pretty much become arbitrarily long. This seems
plausible because users will accept that UPDATEs that modify one or
more indexed columns kinda suck, as long as there is never any truly
pathological performance. Whereas users will not easily accept that
HOT (or something like it) doesn't quite work well enough to make
relation sizes stable when they avoid updating indexed columns.

I don't think that even the absence of UPDATEs that logically modify
indexes and the absence of long running transactions (that hold up
cleanup) is sufficient to make HOT work well enough to keep table
sizes stable over time. Minor inefficiencies (e.g. LP_DEAD line
pointer bloat) will tend to aggregate over time, leading to heap
fragmentation.

-- 
Peter Geoghegan




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Chapman Flack
On 05/03/21 11:22, Robert Haas wrote:
>> The GUC system would have to expose a way for the shared object to
>> chain extra_check_hooks off existing GUCs. An extra_check_hook can check
>> both the value and the role of the caller.
> 
> I think there are two parts to this problem. First, the SP needs to be
> able to delegate to some users but not others the ability to set
> superuser GUCs. Second, the SP needs to be able to control which GUCs
> the privileged users get to set and perhaps to what values. A hook of
> the type you propose here seems like it might work reasonably well for
> that second part, but it's not totally obvious to me how it helps with
> the first part.

I guess I was thinking, but forgot to convey to the keyboard, that the
existence of a non-empty extra_check_hooks chain on a SUSET GUC (which
could only have been attached from a shared preload library) would
implicitly change SUSET to mean settable whenever accepted by the hook(s).

Regards,
-Chap




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Peter Geoghegan
On Mon, May 3, 2021 at 7:41 AM Robert Haas  wrote:
> So here. The complexity of getting a table AM that does anything
> non-trivial working is formidable, and I don't expect it to happen
> right away. Picking one that is essentially block-based and can use
> 48-bit TIDs is very likely the right initial target because that's the
> closest we have now, and there's no sense attacking the hardest
> variant of the problem first.

It doesn't have to be block-based -- that's not what Jeff is
proposing. It just has to be able to accept the restriction that
indexes must have a unique TID-like identifier for each version (not
quite a version actually -- whatever the equivalent of a HOT chain
is). This is a restriction that Jeff had pretty much planned on
working within before starting this thread (I know this because we
spoke about it privately).

It's quite possible to rule out an index-organized table design
without ruling out a column store with logical TID-like identifiers,
that aren't block-based. It's fair to wonder if not tightening up the
rules for TID-like identifiers is actually helping table AM authors in
practice. I think it's actually making things harder.

-- 
Peter Geoghegan




Re: Granting control of SUSET gucs to non-superusers

2021-05-03 Thread Robert Haas
On Sat, May 1, 2021 at 12:37 PM Chapman Flack  wrote:
> Maybe version 0 is where the provider just builds a shared object
> to go in shared_preload_libraries. The provider has probably already
> done a bunch of other stuff more challenging than that.
>
> The GUC system would have to expose a way for the shared object to
> chain extra_check_hooks off existing GUCs. An extra_check_hook can check
> both the value and the role of the caller.

I think there are two parts to this problem. First, the SP needs to be
able to delegate to some users but not others the ability to set
superuser GUCs. Second, the SP needs to be able to control which GUCs
the privileged users get to set and perhaps to what values. A hook of
the type you propose here seems like it might work reasonably well for
that second part, but it's not totally obvious to me how it helps with
the first part.

Instead of going to the extreme of one predefined role per GUC, maybe
we could see if the PGC_SUSET GUCs could be divided into buckets based
on the reason they are so marked? For example, log_parser_stats,
log_planner_stats, log_executor_stats, log_statement_stats,
log_btree_build_stats, trace_locks, trace_userlocks, trace_lwlocks,
log_min_duration_statement, and a bunch of others are probably all
SUSET just on the theory that only the superuser should have the right
to control what ends up in the log. But we could make a predefined
role that represents the right to control what ends up in the log, and
then all of those GUCs could be tied to that role. Is that too
coarse-grained? It might be.

One problem with having a separate predefined role for every PGC_SUSET
GUC is that it's no help for extensions. Both auto_explain and
pg_stat_statements have such GUCs, and there may be out-of-core
extensions that do as well. We should try to come up with a system
that doesn't leave them out in the cold.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Robert Haas
On Fri, Apr 30, 2021 at 6:19 PM Peter Geoghegan  wrote:
> A remaining problem is that we must generate a new round of index
> tuples for each and every index when only one indexed column is
> logically modified by an UPDATE statement. I think that this is much
> less of a problem now due to bottom-up index deletion. Sure, it sucks
> that we still have to dirty the page at all. But it's nevertheless
> true that it all but eliminates version-driven page splits, which are
> where almost all of the remaining downside is. It's very reasonable to
> now wonder if this particular all-indexes problem is worth solving at
> all in light of that. (Modern hardware characteristics also make a
> comprehensive fix less valuable in practice.)

It's reasonable to wonder. I think it depends on whether the problem
is bloat or just general slowness. To the extent that the problem is
bloat, bottom-index deletion will help a lot, but it's not going to
help with slowness because, as you say, we still have to dirty the
pages. And I am pretty confident that slowness is a very significant
part of the problem here. It's pretty common for people migrating from
another database system to have, for example, a table with 10 indexes
and then repeatedly update a column that is covered by only one of
those indexes. Now, with bottom-up index deletion, this should cause a
lot less bloat, and that's good. But you still have to update all 10
indexes in the foreground, and that's bad, because the alternative is
to find just the one affected index and update it twice -- once to
insert the new tuple, and a second time to delete-mark the old tuple.
10 is a lot more than 2, and that's even ignoring the cost of deferred
cleanup on the other 9 indexes. So I don't really expect this to get
us out of the woods. Somebody whose workload runs five times slower on
a pristine data load is quite likely to give up on using PostgreSQL
before bloat even enters the picture.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: strange error reporting

2021-05-03 Thread Tom Lane
Robert Haas  writes:
> On Mon, May 3, 2021 at 6:08 AM Peter Eisentraut
>  wrote:
>> Throwing the socket address in there seems a bit distracting and
>> misleading, and it also pushes off the actual information very far to
>> the end.  (Also, in some cases the socket path is very long, making the
>> actual information even harder to find.)  By the time you get to this
>> error, you have already connected, so mentioning the server address
>> seems secondary at best.

> It feels a little counterintuitive to me too but I am nevertheless
> inclined to believe that it's an improvement. When multi-host
> connection strings are used, the server address may not be clear. In
> fact, even when they're not, it may not be clear to a new user that
> socket communication is used, and it may not be clear where the socket
> is located.

Yeah.  The specific problem I'm concerned about solving here is
"I wasn't connecting to the server I thought I was", which could be
a contributing factor in almost any connection-time failure.  The
multi-host-connection-string feature made that issue noticeably worse,
but surely we've all seen trouble reports that boiled down to that
even before that feature came in.

As you say, we could perhaps redesign the messages to provide this
info in another order.  But it'd be difficult, and I think it might
come out even more confusing in cases where libpq tried several
servers on the way to finally failing.  The old code's error
reporting for such cases completely sucked, whereas now you get
a reasonably complete trace of the attempts.  As a quick example,
for a case of bad hostname followed by wrong port:

$ psql -d "host=foo1,sss2 port=5432,5342"
psql: error: could not translate host name "foo1" to address: Name or service 
not known
connection to server at "sss2" (192.168.1.48), port 5342 failed: Connection 
refused
Is the server running on that host and accepting TCP/IP connections?

v13 renders this as

$ psql -d "host=foo1,sss2 port=5432,5342"
psql: error: could not translate host name "foo1" to address: Name or service 
not known
could not connect to server: Connection refused
Is the server running on host "sss2" (192.168.1.48) and accepting
TCP/IP connections on port 5342?

Now, of course the big problem there is the lack of consistency about
how the two errors are laid out; but I'd argue that putting the
server identity info first is better than putting it later.

Also, if you experiment with other cases such as some of the servers
complaining about wrong user name, the old behavior is even harder
to follow about which server said what.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-03 Thread Robert Haas
On Fri, Apr 30, 2021 at 5:22 PM Peter Geoghegan  wrote:
> I strongly suspect that index-organized tables (or indirect indexes,
> or anything else that assumes that TID-like identifiers map directly
> to logical rows as opposed to physical versions) are going to break
> too many assumptions to ever be tractable. Assuming I have that right,
> it would advance the discussion if we could all agree on that being a
> non-goal for the tableam interface in general.

I *emphatically* disagree with the idea of ruling such things out
categorically. This is just as naive as the TODO's statement that we
do not want "All backends running as threads in a single process".
Does anyone really believe that we don't want that any more? I
believed it 10 years ago, but not any more. It's costing us very
substantially not only in that in makes parallel query more
complicated and fragile, but more importantly in that we can't scale
up to connection counts that other databases can handle because we use
up too many operating system resources. Support threading in
PostgreSQL isn't a project that someone will pull off over a long
weekend and it's not something that has to be done tomorrow, but it's
pretty clearly the future.

So here. The complexity of getting a table AM that does anything
non-trivial working is formidable, and I don't expect it to happen
right away. Picking one that is essentially block-based and can use
48-bit TIDs is very likely the right initial target because that's the
closest we have now, and there's no sense attacking the hardest
variant of the problem first. However, as with the
threads-vs-processes example, I strongly suspect that having only one
table AM is leaving vast amounts of performance on the table. To say
that we're never going to pursue the parts of that space that require
a different kind of tuple identifier is to permanently write off tons
of ideas that have produced promising results in other systems. Let's
not do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Lowering the ever-growing heap->pd_lower

2021-05-03 Thread Matthias van de Meent
On Mon, 3 May 2021 at 16:26, John Naylor  wrote:
>
> On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan  wrote:
> > I would like to deal with this work within the scope of the project
> > we're discussing over on the "New IndexAM API controlling index vacuum
> > strategies" thread. The latest revision of that patch series includes
> > a modified version of your patch:
> >
> > https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopjmhfqj_g1raj4gwr3z...@mail.gmail.com
> >
> > Please take discussion around this project over to that other thread.
> > There are a variety of issues that can only really be discussed in
> > that context.
>
> Since that work has been committed as of 3c3b8a4b2689, I've marked this CF 
> entry as committed.

I disagree that this work has been fully committed. A derivative was
committed that would solve part of the problem, but it doesn't cover
all problem cases. I believe that I voiced such concern in the other
thread as well. As such, I am planning on fixing this patch sometime
before the next commit fest so that we can truncate the LP array
during hot pruning as well, instead of only doing so in the 2nd VACUUM
pass. This is especially relevant on pages where hot is highly
effective, but vacuum can't keep up and many unused (former HOT) line
pointers now exist on the page.

With regards,

Matthias van de Meent




Re: Identify missing publications from publisher while create/alter subscription.

2021-05-03 Thread vignesh C
On Mon, May 3, 2021 at 11:11 AM Bharath Rupireddy
 wrote:
>
> On Sun, May 2, 2021 at 10:04 PM vignesh C  wrote:
> > > 5) Instead of adding a new file 021_validate_publications.pl for
> > > tests, spawning a new test database which would make the overall
> > > regression slower, can't we add with the existing database nodes in
> > > 0001_rep_changes.pl? I would suggest adding the tests in there even if
> > > the number of tests are many, I don't mind.
> >
> > 001_rep_changes.pl has the changes mainly for checking the replicated
> > data. I did not find an appropriate file in the current tap tests, I
> > preferred these tests to be in a separate file. Thoughts?
>
> If 001_rep_changes.pl is not the right place, how about adding them
> into 007_ddl.pl? That file seems to be only for DDL changes, and since
> the feature tests cases are for CREATE/ALTER SUBSCRIPTION, it's the
> right place. I strongly feel that we don't need a new file for these
> tests.
>

Modified.

> Comment on the tests:
> 1) Instead of "pub_doesnt_exist" name, how about "non_existent_pub" or
> just pub_non_existent" or some other?
> +   "CREATE SUBSCRIPTION testsub2 CONNECTION '$publisher_connstr'
> PUBLICATION pub_doesnt_exist WITH (VALIDATE_PUBLICATION = TRUE)"
> The error message with this name looks a bit odd to me.
> + /ERROR:  publication "pub_doesnt_exist" does not exist in
> the publisher/,

Modified.

Thanks for the comments, these comments are handle in the v7 patch
posted in my earlier mail.

Regards,
Vignesh




Re: Identify missing publications from publisher while create/alter subscription.

2021-05-03 Thread vignesh C
On Mon, May 3, 2021 at 1:46 PM Dilip Kumar  wrote:
>
> On Mon, May 3, 2021 at 10:48 AM Dilip Kumar  wrote:
> >
> > On Sun, May 2, 2021 at 10:04 PM vignesh C  wrote:
> > >
> > > Thanks for the comments.
> > > The Attached patch has the fixes for the same.
> >
> > I was reviewing the documentation part, I think in the below paragraph
> > we should include validate_publication as well?
> >
> >
> > connect (boolean)
> > 
> >  
> >   Specifies whether the CREATE SUBSCRIPTION
> >   should connect to the publisher at all.  Setting this to
> >   false will change default values of
> >   enabled, create_slot and
> >   copy_data to false.
> >  
> >

Modified.

> > I will review/test the other parts of the patch and let you know.
>
> I have reviewed it and it mostly looks good to me.  I have some minor
> suggestions though.
>
> 1.
> +/*
> + * Check the specified publication(s) is(are) present in the publisher.
> + */
>
> vs
>
> +
> +/*
> + * Connect to the publisher and check if the publications exist.
> + */
>
> I think the formatting of the comments are not uniform.  Some places
> we are using "publication(s) is(are)" whereas other places are just
> "publications".
>

Modified.

> 2. Add a error case for connect=false and VALIDATE_PUBLICATION = true

Added.

Thanks for the comments, attached v7 patch has the fixes for the same.
Thoughts?

Regards,
Vignesh
From 09ae1cac60320bde08a6c380d3637eecdbb3d6ff Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Wed, 7 Apr 2021 22:05:53 +0530
Subject: [PATCH v7] Identify missing publications from publisher while
 create/alter subscription.

Creating/altering subscription is successful when we specify a publication which
does not exist in the publisher. This patch checks if the specified publications
are present in the publisher and throws an error if any of the publication is
missing in the publisher.
---
 doc/src/sgml/ref/alter_subscription.sgml  |  13 ++
 doc/src/sgml/ref/create_subscription.sgml |  18 +-
 src/backend/commands/subscriptioncmds.c   | 234 +++---
 src/bin/psql/tab-complete.c   |   7 +-
 src/test/subscription/t/007_ddl.pl|  68 ++-
 5 files changed, 305 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac814f4..81e156437b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -160,6 +160,19 @@ ALTER SUBSCRIPTION name RENAME TO <
  
 

+
+   
+validate_publication (boolean)
+
+ 
+  When true, the command verifies if all the specified publications
+  that are being subscribed to are present in the publisher and throws
+  an error if any of the publication doesn't exist. The default is
+  false.
+ 
+
+   
+
   
 

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812beee37..cad9285c16 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -207,8 +207,9 @@ CREATE SUBSCRIPTION subscription_nameCREATE SUBSCRIPTION
   should connect to the publisher at all.  Setting this to
   false will change default values of
-  enabled, create_slot and
-  copy_data to false.
+  enabled, create_slot,
+  copy_data and
+  validate_publication to false.
  
 
  
@@ -239,6 +240,19 @@ CREATE SUBSCRIPTION subscription_name
 

+
+   
+validate_publication (boolean)
+
+ 
+  When true, the command verifies if all the specified publications
+  that are being subscribed to are present in the publisher and throws
+  an error if any of the publication doesn't exist. The default is
+  false.
+ 
+
+   
+
   
 

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8edd3b..3de4488e4d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,9 @@ parse_subscription_options(List *options,
 		   char **synchronous_commit,
 		   bool *refresh,
 		   bool *binary_given, bool *binary,
-		   bool *streaming_given, bool *streaming)
+		   bool *streaming_given, bool *streaming,
+		   bool *validate_publication_given,
+		   bool *validate_publication)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -111,6 +113,12 @@ parse_subscription_options(List *options,
 		*streaming = false;
 	}
 
+	if (validate_publication)
+	{
+		*validate_publication_given = false;
+		*validate_publication = false;
+	}
+
 	/* Parse options */
 	foreach(lc, options)
 	{
@@ -215,6 +223,17 @@ parse_subscription_options(List *options,
 			*streaming_given = tr

Re: Lowering the ever-growing heap->pd_lower

2021-05-03 Thread John Naylor
On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan  wrote:
> I would like to deal with this work within the scope of the project
> we're discussing over on the "New IndexAM API controlling index vacuum
> strategies" thread. The latest revision of that patch series includes
> a modified version of your patch:
>
>
https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopjmhfqj_g1raj4gwr3z...@mail.gmail.com
>
> Please take discussion around this project over to that other thread.
> There are a variety of issues that can only really be discussed in
> that context.

Since that work has been committed as of 3c3b8a4b2689, I've marked this CF
entry as committed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: strange error reporting

2021-05-03 Thread Robert Haas
On Mon, May 3, 2021 at 6:08 AM Peter Eisentraut
 wrote:
> I find these new error messages to be more distracting than before in
> some cases.  For example:
>
> PG13:
>
> clusterdb: error: could not connect to database typo: FATAL:  database
> "typo" does not exist
>
> PG14:
>
> clusterdb: error: connection to server on socket "/tmp/.s.PGSQL.65432"
> failed: FATAL:  database "typo" does not exist
>
> Throwing the socket address in there seems a bit distracting and
> misleading, and it also pushes off the actual information very far to
> the end.  (Also, in some cases the socket path is very long, making the
> actual information even harder to find.)  By the time you get to this
> error, you have already connected, so mentioning the server address
> seems secondary at best.

It feels a little counterintuitive to me too but I am nevertheless
inclined to believe that it's an improvement. When multi-host
connection strings are used, the server address may not be clear. In
fact, even when they're not, it may not be clear to a new user that
socket communication is used, and it may not be clear where the socket
is located. New users may not even realize that there's a socket
involved; I certainly didn't know that for quite a while. It's a lot
harder for the database name to be unclear, because since a particular
connection attempt will never try more than one, and also because when
it's relevant to understanding why the connection failed, the server
will hopefully include it in the message string anyway, as here. So
the PG13 message is really kind of silly: it tells us the same thing
twice, which we must already know, instead of telling us something
that we might not know.

It might be more intuitive in some ways if the socket information were
demoted to the end of the message, but I think we'd lose more than we
gained. The standard way of reporting someone else's error is
basically "what i have to say about the problem: %s" and that's
exactly what we're doing here. We could find some way of gluing the
information about the socket onto the end of the server message, but
it seems unclear how to do that in a way that looks natural, and it
would depart from our usual practice. So even though I also find this
to be a bit distracting, I think we should just live with it, because
everything else seems worse.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Transactions involving multiple postgres foreign servers, take 2

2021-05-03 Thread Zhihong Yu
On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada 
wrote:

> On Sun, May 2, 2021 at 1:23 AM Zhihong Yu  wrote:
> >
> >
> >
> > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada 
> wrote:
> >>
> >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu  wrote:
> >> >
> >> > Hi,
> >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
> >>
> >> Thank you for reviewing the patch!
> >>
> >> >
> >> > With this commit, the foreign server modified within the transaction
> marked as 'modified'.
> >> >
> >> > transaction marked -> transaction is marked
> >>
> >> Will fix.
> >>
> >> >
> >> > +#define IsForeignTwophaseCommitRequested() \
> >> > +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
> >> >
> >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the
> macro should be named: IsForeignTwophaseCommitRequired.
> >>
> >> But even if foreign_twophase_commit is
> >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
> >> there is only one modified server, right? It seems the name
> >> IsForeignTwophaseCommitRequested is fine.
> >>
> >> >
> >> > +static bool
> >> > +checkForeignTwophaseCommitRequired(bool local_modified)
> >> >
> >> > +   if (!ServerSupportTwophaseCommit(fdw_part))
> >> > +   have_no_twophase = true;
> >> > ...
> >> > +   if (have_no_twophase)
> >> > +   ereport(ERROR,
> >> >
> >> > It seems the error case should be reported within the loop. This way,
> we don't need to iterate the other participant(s).
> >> > Accordingly, nserverswritten should be incremented for local server
> prior to the loop. The condition in the loop would become if
> (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
> >> > have_no_twophase is no longer needed.
> >>
> >> Hmm, I think If we process one 2pc-non-capable server first and then
> >> process another one 2pc-capable server, we should raise an error but
> >> cannot detect that.
> >
> >
> > Then the check would stay as what you have in the patch:
> >
> >   if (!ServerSupportTwophaseCommit(fdw_part))
> >
> > When the non-2pc-capable server is encountered, we would report the
> error in place (following the ServerSupportTwophaseCommit check) and come
> out of the loop.
> > have_no_twophase can be dropped.
>
> But if we processed only one non-2pc-capable server, we would raise an
> error but should not in that case.
>
> On second thought, I think we can track how many servers are modified
> or not capable of 2PC during registration and unr-egistration. Then we
> can consider both 2PC is required and there is non-2pc-capable server
> is involved without looking through all participants. Thoughts?
>

That is something worth trying.

Thanks


>
> Regards,
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>


Re: Replication slot stats misgivings

2021-05-03 Thread Amit Kapila
On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada  wrote:
>
> On Mon, May 3, 2021 at 2:27 PM Amit Kapila  wrote:
> >
> > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > > I am not sure if any of these alternatives are a good idea. What do
> > > > > you think? Do you have any other ideas for this?
> > > >
> > > > I've been considering some ideas but don't come up with a good one
> > > > yet. It’s just an idea and not tested but how about having
> > > > CreateDecodingContext() register before_shmem_exit() callback with the
> > > > decoding context to ensure that we send slot stats even on
> > > > interruption. And FreeDecodingContext() cancels the callback.
> > > >
> > >
> > > Is it a good idea to send stats while exiting and rely on the same? I
> > > think before_shmem_exit is mostly used for the cleanup purpose so not
> > > sure if we can rely on it for this purpose. I think we can't be sure
> > > that in all cases we will send all stats, so maybe Vignesh's patch is
> > > sufficient to cover the cases where we avoid losing it in cases where
> > > we would have sent a large amount of data.
> > >
> >
> > Sawada-San, any thoughts on this point?
>
> before_shmem_exit is mostly used to the cleanup purpose but how about
> on_shmem_exit()? pgstats relies on that to send stats at the
> interruption. See pgstat_shutdown_hook().
>

Yeah, that is worth trying. Would you like to give it a try? I think
it still might not cover the cases where we error out in the backend
while decoding via APIs because at that time we won't exit, maybe for
that we can consider Vignesh's patch.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Allow CustomScan nodes to signal projection support

2021-05-03 Thread Aleksander Alekseev
Hi Sven,

> The attached patch allows CustomScan nodes to signal whether they
> support projection.

I noticed that you didn't change custom-scan.sgml accordingly. The
updated patch is attached. Otherwise, it seems to be fine in terms of
compiling, passing tests etc.

> I named the flag CUSTOMPATH_SUPPORT_PROJECTION similar to the other
> custom node flags, but this would revert the current logic

This seems to be a typical Kobayashi Maru situation, i.e any choice is
a bad one. I suggest keeping the patch as is and hoping that the
developers of existing extensions read the release notes.

-- 
Best regards,
Aleksander Alekseev


v2-0001-costumscan_projection_flag.patch
Description: Binary data


Re: Enhanced error message to include hint messages for redundant options error

2021-05-03 Thread vignesh C
On Mon, May 3, 2021 at 12:08 PM Bharath Rupireddy
 wrote:
>
> On Sun, May 2, 2021 at 8:44 PM vignesh C  wrote:
> >
> > On Sat, May 1, 2021 at 9:02 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Sat, May 1, 2021 at 7:25 PM vignesh C  wrote:
> > > > > > I'm not attaching above one line change as a patch, maybe Vignesh 
> > > > > > can
> > > > > > merge this into the main patch.
> > > >
> > > > Thanks for the comments. I have merged the change into the attached 
> > > > patch.
> > > > Thoughts?
> > >
> > > Thanks! v4 basically LGTM. Can we park this in the current commitfest
> > > if not done already?
> > >
> > > Upon looking at the number of places where we have the "option \"%s\"
> > > specified more than once" error, I, now strongly feel that we should
> > > use goto duplicate_error approach like in compute_common_attribute, so
> > > that we will have only one ereport(ERROR. We can change it in
> > > following files: copy.c, dbcommands.c, extension.c,
> > > compute_function_attributes, sequence.c, subscriptioncmds.c,
> > > typecmds.c, user.c, walsender.c, pgoutput.c. This will reduce the LOC
> > > greatly.
> > >
> > > Thoughts?
> >
> > I have made the changes for this, I have posted the same in the v5
> > patch posted in my earlier mail.
>
> Thanks! The v5 patch looks good to me. Let's see if all agree on the
> goto duplicate_error approach which could reduce the LOC by ~80.
>
> I don't see it in the current commitfest, can we park it there so that
> the patch will get tested on cfbot systems?

I have added an entry in commitfest:
https://commitfest.postgresql.org/33/3103/

Regards,
Vignesh




Toast compression method options

2021-05-03 Thread Dilip Kumar
We have already pushed the configurable lz4 toast compression code[1].
In the custom compression thread, we were already having the patch to
support the compression method options[2].  But the design for the
base patches was heavily modified before commit but I never rebased
this patch based on the new design.  Now, I have rebased this patch so
that we don't lose track and we can work on this for v15.  This is
still a WIP patch.

For v15 I will work on improving the code and I will also work on
analyzing the usage of compression method options (compression
speed/ratio).

[1] 
https://www.postgresql.org/message-id/E1lNKw9-0008DT-1L%40gemulon.postgresql.org
[2] 
https://www.postgresql.org/message-id/CAFiTN-s7fno8pGwfK7jwSf7uNaVhPZ38C3LAcF%2B%3DWHu7jNvy7g%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


WIP-0001-Toast-compression-method-options.patch
Description: Binary data


Re: Logical Replication - behavior of TRUNCATE ... CASCADE

2021-05-03 Thread Bharath Rupireddy
On Mon, May 3, 2021 at 1:02 PM Dilip Kumar  wrote:
>
> I think you are comparing the user-exposed behavior with the internal
> code comments.  The meaning of the comments is that it should not
> truncate any table on subscriber using cascade, because there might be
> some subscriber-specific relations that depend upon the primary table
> and those should not get truncated as a side-effect of the cascade.
>
> For example, you can slightly change your example as below
> > On subscriber:
> > CREATE TABLE tbl_pk(id int primary key);
> > CREATE TABLE tbl_fk_sub(fkey int references tbl_pk(id));   -> this table 
> > doesn't refer to tbl_pk on the publisher
>
> So now as part of the truncate tbl_pk the tbl_fk_subould not get
> truncated and that is what the comment is trying to say.

Thanks. I was of the thinking that the subscriber table can not have
references to other subscriber-local tables and they should also be
having the same column constraints as the publisher table columns.
But I was wrong. I tried the use case [1] where the subscriber table
tbl_pk, that was subscribed to the changes from the publisher, is
being referenced by another subscriber-local table tbl_fk. In this
case, the comment and the code that sends only RESTRICT behaviour
ignoring the upstream CASCADE option make sense.

Having said that, isn't it good if we can provide a subscription
(CREATE/ALTER) level option say "cascade"(similar to other options
such as binary, synchronous_commit, stream)  default being false, when
set to true, we send upstream CASCADE option to ExecuteTruncateGuts in
apply_handle_truncate? It will be useful to truncate all the dependent
tables in the subscriber. Users will have to use it with caution
though.

Note that the comment already says this:
/*
 * Even if we used CASCADE on the upstream primary we explicitly default
 * to replaying changes without further cascading. This might be later
 * changeable with a user specified option.
 */

Thoughts?

[1]
On publisher:
DROP TABLE tbl_pk CASCADE;
CREATE TABLE tbl_pk(id int primary key);
INSERT INTO tbl_pk (SELECT x FROM generate_series(1,10) x);
DROP PUBLICATION testpub;
CREATE PUBLICATION testpub FOR TABLE tbl_pk;

On subscriber:
DROP TABLE tbl_pk CASCADE;
CREATE TABLE tbl_pk(id int primary key);
DROP TABLE tbl_fk;
CREATE TABLE tbl_fk(id1 int references tbl_pk(id));
DROP SUBSCRIPTION testsub;
CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost dbname=postgres
user=bharath port=5432' PUBLICATION testpub;
INSERT INTO tbl_fk (SELECT x FROM generate_series(1,10) x);

On publisher:
TRUNCATE tbl_pk CASCADE;
SELECT count(id) FROM tbl_pk; -- 0

On subscriber we get error, because the RESTRICT option is passed to
ExecuteTruncateGuts in logical apply worker and the table tbl_pk is
referenced by tbl_fk, so tbl_pk is not truncated.
SELECT count(id) FROM tbl_pk; -- non-zero

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Transactions involving multiple postgres foreign servers, take 2

2021-05-03 Thread Masahiko Sawada
On Sun, May 2, 2021 at 1:23 AM Zhihong Yu  wrote:
>
>
>
> On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada  wrote:
>>
>> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu  wrote:
>> >
>> > Hi,
>> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>>
>> Thank you for reviewing the patch!
>>
>> >
>> > With this commit, the foreign server modified within the transaction 
>> > marked as 'modified'.
>> >
>> > transaction marked -> transaction is marked
>>
>> Will fix.
>>
>> >
>> > +#define IsForeignTwophaseCommitRequested() \
>> > +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >
>> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the 
>> > macro should be named: IsForeignTwophaseCommitRequired.
>>
>> But even if foreign_twophase_commit is
>> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> there is only one modified server, right? It seems the name
>> IsForeignTwophaseCommitRequested is fine.
>>
>> >
>> > +static bool
>> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >
>> > +   if (!ServerSupportTwophaseCommit(fdw_part))
>> > +   have_no_twophase = true;
>> > ...
>> > +   if (have_no_twophase)
>> > +   ereport(ERROR,
>> >
>> > It seems the error case should be reported within the loop. This way, we 
>> > don't need to iterate the other participant(s).
>> > Accordingly, nserverswritten should be incremented for local server prior 
>> > to the loop. The condition in the loop would become if 
>> > (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
>> > have_no_twophase is no longer needed.
>>
>> Hmm, I think If we process one 2pc-non-capable server first and then
>> process another one 2pc-capable server, we should raise an error but
>> cannot detect that.
>
>
> Then the check would stay as what you have in the patch:
>
>   if (!ServerSupportTwophaseCommit(fdw_part))
>
> When the non-2pc-capable server is encountered, we would report the error in 
> place (following the ServerSupportTwophaseCommit check) and come out of the 
> loop.
> have_no_twophase can be dropped.

But if we processed only one non-2pc-capable server, we would raise an
error but should not in that case.

On second thought, I think we can track how many servers are modified
or not capable of 2PC during registration and unr-egistration. Then we
can consider both 2PC is required and there is non-2pc-capable server
is involved without looking through all participants. Thoughts?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Replication slot stats misgivings

2021-05-03 Thread Masahiko Sawada
On Mon, May 3, 2021 at 2:29 PM Amit Kapila  wrote:
>
> On Fri, Apr 30, 2021 at 1:47 PM Amit Kapila  wrote:
> >
> > LGTM. I have slightly edited the comments in the attached. I'll push
> > this early next week unless there are more comments.
> >
>
> Pushed.

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Replication slot stats misgivings

2021-05-03 Thread Masahiko Sawada
On Mon, May 3, 2021 at 2:27 PM Amit Kapila  wrote:
>
> On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila  wrote:
> >
> > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > I am not sure if any of these alternatives are a good idea. What do
> > > > you think? Do you have any other ideas for this?
> > >
> > > I've been considering some ideas but don't come up with a good one
> > > yet. It’s just an idea and not tested but how about having
> > > CreateDecodingContext() register before_shmem_exit() callback with the
> > > decoding context to ensure that we send slot stats even on
> > > interruption. And FreeDecodingContext() cancels the callback.
> > >
> >
> > Is it a good idea to send stats while exiting and rely on the same? I
> > think before_shmem_exit is mostly used for the cleanup purpose so not
> > sure if we can rely on it for this purpose. I think we can't be sure
> > that in all cases we will send all stats, so maybe Vignesh's patch is
> > sufficient to cover the cases where we avoid losing it in cases where
> > we would have sent a large amount of data.
> >
>
> Sawada-San, any thoughts on this point?

before_shmem_exit is mostly used to the cleanup purpose but how about
on_shmem_exit()? pgstats relies on that to send stats at the
interruption. See pgstat_shutdown_hook().

That being said, I agree Vignesh' patch would cover most cases. If we
don't find any better solution, I think we can go with Vignesh's
patch.

> Apart from this, I think you
> have suggested somewhere in this thread to slightly update the
> description of stream_bytes. I would like to update the description of
> stream_bytes and total_bytes as below:
>
> stream_bytes
> Amount of transaction data decoded for streaming in-progress
> transactions to the decoding output plugin while decoding changes from
> WAL for this slot. This and other streaming counters for this slot can
> be used to tune logical_decoding_work_mem.
>
> total_bytes
> Amount of transaction data decoded for sending transactions to the
> decoding output plugin while decoding changes from WAL for this slot.
> Note that this includes data that is streamed and/or spilled.
>
> This update considers two points:
> a. we don't send this data across the network because plugin might
> decide to filter this data, ex. based on publications.
> b. not all of the decoded changes are sent to plugin, consider
> REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
> REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, etc.

Looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: strange error reporting

2021-05-03 Thread Peter Eisentraut

On 21.01.21 02:33, Tom Lane wrote:

I'd be inclined to spell it "connection to server at ... failed",
but that sort of wording is surely also possible.



"connection to server" rather than "connection to database" works for
me; in fact, I think I like it slightly better.


If I don't hear any other opinions, I'll change these messages to

"connection to server at socket \"%s\" failed:"
"connection to server at \"%s\" (%s), port %s failed:"

(or maybe "server on socket"?  "at" sounds right for the IP address
case, but it feels a little off in the socket pathname case.)


I was just trying some stuff with PG14, which led me to this thread.

I find these new error messages to be more distracting than before in 
some cases.  For example:


PG13:

clusterdb: error: could not connect to database typo: FATAL:  database 
"typo" does not exist


PG14:

clusterdb: error: connection to server on socket "/tmp/.s.PGSQL.65432" 
failed: FATAL:  database "typo" does not exist


Throwing the socket address in there seems a bit distracting and 
misleading, and it also pushes off the actual information very far to 
the end.  (Also, in some cases the socket path is very long, making the 
actual information even harder to find.)  By the time you get to this 
error, you have already connected, so mentioning the server address 
seems secondary at best.





Re: how to correctly cast json value to text?

2021-05-03 Thread Pavel Stehule
po 3. 5. 2021 v 11:26 odesílatel Marko Tiikkaja  napsal:

> On Mon, May 3, 2021 at 12:24 PM Pavel Stehule 
> wrote:
>
>> Is it possible to do this with built functionality?
>>
>> I miss the cast function for json scalar string value to string.
>>
>
> #>>'{}'
>

It is working. Thank you. But this syntax is a little bit scary.  Maybe we
can introduce some functions for this case. Until to pg 14 this
functionality was not necessary, but now it can be nice to have it.

DO $$
DECLARE v jsonb;
BEGIN
  -- hodnota musi byt validni json
  v['a'] = '"Ahoj"';
  RAISE NOTICE '%', v['a'] #>> '{}';
END;
$$;
NOTICE:  Ahoj
DO

Some ideas about the name of this function?

CREATE OR REPLACE FUNCTION jsonscalar_to_text(jsonb)
RETURNS text AS $$
SELECT $1 #>> '{}'
$$ LANGUAGE sql;



>
> .m
>


Re: how to correctly cast json value to text?

2021-05-03 Thread Marko Tiikkaja
On Mon, May 3, 2021 at 12:24 PM Pavel Stehule 
wrote:

> Is it possible to do this with built functionality?
>
> I miss the cast function for json scalar string value to string.
>

#>>'{}'


.m


Re: how to correctly cast json value to text?

2021-05-03 Thread Pavel Stehule
Hi

po 3. 5. 2021 v 11:15 odesílatel Pavel Stehule 
napsal:

> Hi
>
> I am testing a new subscripting interface for jsonb, and I found one issue.
>
> DO $$
> DECLARE v jsonb;
> BEGIN
>   v['a'] = '"Ahoj"';
>   RAISE NOTICE '%', v['a'];
> END;
> $$;
> NOTICE:  "Ahoj"
> DO
>
> When I use this interface for reading, the jsonb type is returned. What is
> the correct way for casting from jsonb text to text value? I would not
> double quotes inside the result. Cast to text doesn't help. For operator
> API we can use "->>" symbol. But we have nothing similar for subscript API.
>

now I need function like

CREATE OR REPLACE FUNCTION public.value_to_text(jsonb)
 RETURNS text
 LANGUAGE plpgsql
 IMMUTABLE
AS $function$
DECLARE x jsonb;
BEGIN
  x['x'] = $1;
  RETURN x->>'x';
END;
$function$

DO $$
DECLARE v jsonb;
BEGIN
  -- hodnota musi byt validni json
  v['a'] = '"Ahoj"';
  RAISE NOTICE '%', value_to_text(v['a']);
END;
$$;
NOTICE:  Ahoj
DO

Is it possible to do this with built functionality?

I miss the cast function for json scalar string value to string.

Regards

Pavel


> Regards
>
> Pavel
>
>
>


how to correctly cast json value to text?

2021-05-03 Thread Pavel Stehule
Hi

I am testing a new subscripting interface for jsonb, and I found one issue.

DO $$
DECLARE v jsonb;
BEGIN
  v['a'] = '"Ahoj"';
  RAISE NOTICE '%', v['a'];
END;
$$;
NOTICE:  "Ahoj"
DO

When I use this interface for reading, the jsonb type is returned. What is
the correct way for casting from jsonb text to text value? I would not
double quotes inside the result. Cast to text doesn't help. For operator
API we can use "->>" symbol. But we have nothing similar for subscript API.

Regards

Pavel


Re: Identify missing publications from publisher while create/alter subscription.

2021-05-03 Thread Dilip Kumar
On Mon, May 3, 2021 at 10:48 AM Dilip Kumar  wrote:
>
> On Sun, May 2, 2021 at 10:04 PM vignesh C  wrote:
> >
> > Thanks for the comments.
> > The Attached patch has the fixes for the same.
>
> I was reviewing the documentation part, I think in the below paragraph
> we should include validate_publication as well?
>
>
> connect (boolean)
> 
>  
>   Specifies whether the CREATE SUBSCRIPTION
>   should connect to the publisher at all.  Setting this to
>   false will change default values of
>   enabled, create_slot and
>   copy_data to false.
>  
>
> I will review/test the other parts of the patch and let you know.

I have reviewed it and it mostly looks good to me.  I have some minor
suggestions though.

1.
+/*
+ * Check the specified publication(s) is(are) present in the publisher.
+ */

vs

+
+/*
+ * Connect to the publisher and check if the publications exist.
+ */

I think the formatting of the comments are not uniform.  Some places
we are using "publication(s) is(are)" whereas other places are just
"publications".

2. Add a error case for connect=false and VALIDATE_PUBLICATION = true



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Enhanced error message to include hint messages for redundant options error

2021-05-03 Thread Dilip Kumar
On Mon, May 3, 2021 at 12:08 PM Bharath Rupireddy
 wrote:
>
> On Sun, May 2, 2021 at 8:44 PM vignesh C  wrote:
> >
> > On Sat, May 1, 2021 at 9:02 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Sat, May 1, 2021 at 7:25 PM vignesh C  wrote:
> > > > > > I'm not attaching above one line change as a patch, maybe Vignesh 
> > > > > > can
> > > > > > merge this into the main patch.
> > > >
> > > > Thanks for the comments. I have merged the change into the attached 
> > > > patch.
> > > > Thoughts?
> > >
> > > Thanks! v4 basically LGTM. Can we park this in the current commitfest
> > > if not done already?
> > >
> > > Upon looking at the number of places where we have the "option \"%s\"
> > > specified more than once" error, I, now strongly feel that we should
> > > use goto duplicate_error approach like in compute_common_attribute, so
> > > that we will have only one ereport(ERROR. We can change it in
> > > following files: copy.c, dbcommands.c, extension.c,
> > > compute_function_attributes, sequence.c, subscriptioncmds.c,
> > > typecmds.c, user.c, walsender.c, pgoutput.c. This will reduce the LOC
> > > greatly.
> > >
> > > Thoughts?
> >
> > I have made the changes for this, I have posted the same in the v5
> > patch posted in my earlier mail.
>
> Thanks! The v5 patch looks good to me. Let's see if all agree on the
> goto duplicate_error approach which could reduce the LOC by ~80.

I think the "goto duplicate_error" approach looks good, it avoids
duplicating the same error code multiple times.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Logical Replication - behavior of TRUNCATE ... CASCADE

2021-05-03 Thread Dilip Kumar
 sh,On Mon, May 3, 2021 at 12:37 PM Bharath Rupireddy
 wrote:
>
> On Mon, May 3, 2021 at 11:59 AM Dilip Kumar  wrote:
> >
> > On Mon, May 3, 2021 at 10:42 AM Bharath Rupireddy
> >  wrote:
> > >
> > > Hi,
> > >
> > > In apply_handle_truncate, the following comment before 
> > > ExecuteTruncateGuts says that it defaults to RESTRICT even if the CASCADE 
> > > option has been specified in publisher's TRUNCATE command.
> > > /*
> > >  * Even if we used CASCADE on the upstream primary we explicitly 
> > > default
> > >  * to replaying changes without further cascading. This might be later
> > >  * changeable with a user specified option.
> > >  */
> > > I tried the following use case to see if that's actually true:
> > > 1) Created two tables tbl_pk (primary key), tbl_fk(references tbl_pk 
> > > primary key via foreign key) on both publisher and subscriber.
> > > 2) In general, TRUNCATE tbl_pk; or TRUNCATE tbl_pk RESTRICT; would fail 
> > > because tbl_fk is dependent on tbl_pk.
> > > 3) TRUNCATE tbl_pk, tbl_fk; would work because the dependent table is 
> > > specified in the command.
> > > 4) TRUNCATE tbl_pk CASCADE; would work because of the CASCADE option and 
> > > both tbl_pk and tbl_fk are truncated. When this command is run on the 
> > > publisher, the CASCADE option is sent to the subscriber, see 
> > > DecodeTruncate. But the apply worker ignores it and passes DROP_RESTRICT 
> > > to ExecuteTruncateGuts. Therefore, the expectation(per the comment) is 
> > > that on the subscriber, the behavior should be equivalent to TRUNCATE 
> > > tbl_pk;, so an error is expected. But we are also receiving the tbl_fk in 
> > > the remote rels along with tbl_pk, so the behavior is equivalent to (3) 
> > > and both tbl_pk and tbl_fk are truncated.
> > >
> > > Does the comment still hold true? Does ignoring the CASCADE option make 
> > > sense in apply_handle_truncate, as we are receiving all the dependent 
> > > relations in the remote rels from the publisher? Am I missing something?
> > >
> > > The commit id of the feature "Logical replication support for TRUNCATE" 
> > > is 039eb6e92f, and adding relevant people in cc.
> >
> > Assume this case
> > publisher: tbl_pk -> tbl_fk_pub
> > subscriber: tbl_pk-> tbl_fk_sub
> >
> > Now, in this case, this comment is true right because we are not
> > supposed to truncate tbl_fk_sub on the subscriber side and this should
> > error out.
>
> Here's what I tried, let me know if I'm wrong:
>
> On publisher:
> CREATE TABLE tbl_pk(id int primary key);
> CREATE TABLE tbl_fk(fkey int references tbl_pk(id));
> INSERT INTO tbl_pk (SELECT x FROM generate_series(1,10) x);
> INSERT INTO tbl_fk (SELECT x % 10 + 1 FROM generate_series(5,25) x);
> DROP PUBLICATION testpub;
> CREATE PUBLICATION testpub FOR TABLE tbl_pk, tbl_fk;
>
> On subscriber:
> CREATE TABLE tbl_pk(id int primary key);
> CREATE TABLE tbl_fk(fkey int references tbl_pk(id));
> DROP SUBSCRIPTION testsub;
> CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost dbname=postgres
> user=bharath port=5432' PUBLICATION testpub;
>
> On both publisher and subscriber to ensure that the initial rows were
> replicated:
> SELECT count(id) FROM tbl_pk; -- non zero
> SELECT count(fkey) FROM tbl_fk; -- non zero
>
> On publisher:
> TRUNCATE tbl_pk CASCADE;
> SELECT count(id) FROM tbl_pk; -- 0
> SELECT count(fkey) FROM tbl_fk; -- 0
>
> On subscriber also we get to see 0 rows:
> SELECT count(id) FROM tbl_pk; -- 0
> SELECT count(fkey) FROM tbl_fk; -- 0
>
> But the comment says that tbl_fk shouldn't be truncated as it doesn't
> pass the cascade option to ExecuteTruncateGuts even though it was
> received from the publisher. This behaviour is not in accordance with
> the comment, right?

I think you are comparing the user-exposed behavior with the internal
code comments.  The meaning of the comments is that it should not
truncate any table on subscriber using cascade, because there might be
some subscriber-specific relations that depend upon the primary table
and those should not get truncated as a side-effect of the cascade.

For example, you can slightly change your example as below
> On subscriber:
> CREATE TABLE tbl_pk(id int primary key);
> CREATE TABLE tbl_fk_sub(fkey int references tbl_pk(id));   -> this table 
> doesn't refer to tbl_pk on the publisher

So now as part of the truncate tbl_pk the tbl_fk_subould not get
truncated and that is what the comment is trying to say.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Logical Replication - behavior of TRUNCATE ... CASCADE

2021-05-03 Thread Bharath Rupireddy
On Mon, May 3, 2021 at 11:59 AM Dilip Kumar  wrote:
>
> On Mon, May 3, 2021 at 10:42 AM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > In apply_handle_truncate, the following comment before ExecuteTruncateGuts 
> > says that it defaults to RESTRICT even if the CASCADE option has been 
> > specified in publisher's TRUNCATE command.
> > /*
> >  * Even if we used CASCADE on the upstream primary we explicitly default
> >  * to replaying changes without further cascading. This might be later
> >  * changeable with a user specified option.
> >  */
> > I tried the following use case to see if that's actually true:
> > 1) Created two tables tbl_pk (primary key), tbl_fk(references tbl_pk 
> > primary key via foreign key) on both publisher and subscriber.
> > 2) In general, TRUNCATE tbl_pk; or TRUNCATE tbl_pk RESTRICT; would fail 
> > because tbl_fk is dependent on tbl_pk.
> > 3) TRUNCATE tbl_pk, tbl_fk; would work because the dependent table is 
> > specified in the command.
> > 4) TRUNCATE tbl_pk CASCADE; would work because of the CASCADE option and 
> > both tbl_pk and tbl_fk are truncated. When this command is run on the 
> > publisher, the CASCADE option is sent to the subscriber, see 
> > DecodeTruncate. But the apply worker ignores it and passes DROP_RESTRICT to 
> > ExecuteTruncateGuts. Therefore, the expectation(per the comment) is that on 
> > the subscriber, the behavior should be equivalent to TRUNCATE tbl_pk;, so 
> > an error is expected. But we are also receiving the tbl_fk in the remote 
> > rels along with tbl_pk, so the behavior is equivalent to (3) and both 
> > tbl_pk and tbl_fk are truncated.
> >
> > Does the comment still hold true? Does ignoring the CASCADE option make 
> > sense in apply_handle_truncate, as we are receiving all the dependent 
> > relations in the remote rels from the publisher? Am I missing something?
> >
> > The commit id of the feature "Logical replication support for TRUNCATE" is 
> > 039eb6e92f, and adding relevant people in cc.
>
> Assume this case
> publisher: tbl_pk -> tbl_fk_pub
> subscriber: tbl_pk-> tbl_fk_sub
>
> Now, in this case, this comment is true right because we are not
> supposed to truncate tbl_fk_sub on the subscriber side and this should
> error out.

Here's what I tried, let me know if I'm wrong:

On publisher:
CREATE TABLE tbl_pk(id int primary key);
CREATE TABLE tbl_fk(fkey int references tbl_pk(id));
INSERT INTO tbl_pk (SELECT x FROM generate_series(1,10) x);
INSERT INTO tbl_fk (SELECT x % 10 + 1 FROM generate_series(5,25) x);
DROP PUBLICATION testpub;
CREATE PUBLICATION testpub FOR TABLE tbl_pk, tbl_fk;

On subscriber:
CREATE TABLE tbl_pk(id int primary key);
CREATE TABLE tbl_fk(fkey int references tbl_pk(id));
DROP SUBSCRIPTION testsub;
CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost dbname=postgres
user=bharath port=5432' PUBLICATION testpub;

On both publisher and subscriber to ensure that the initial rows were
replicated:
SELECT count(id) FROM tbl_pk; -- non zero
SELECT count(fkey) FROM tbl_fk; -- non zero

On publisher:
TRUNCATE tbl_pk CASCADE;
SELECT count(id) FROM tbl_pk; -- 0
SELECT count(fkey) FROM tbl_fk; -- 0

On subscriber also we get to see 0 rows:
SELECT count(id) FROM tbl_pk; -- 0
SELECT count(fkey) FROM tbl_fk; -- 0

But the comment says that tbl_fk shouldn't be truncated as it doesn't
pass the cascade option to ExecuteTruncateGuts even though it was
received from the publisher. This behaviour is not in accordance with
the comment, right?
If we see why this is so: the publisher sends both tbl_pk and tbl_fk
rels to the subscriber and the TRUNCATE tbl_pk, tbl_fk; is allowed
(see the code in heap_truncate_check_FKs) even if RESTRICT option is
specified.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com