date:20210414

Re: File truncation within PostgresNode::issues_sql_like() wrong on Windows

2021-04-14 Thread Michael Paquier

On Wed, Apr 14, 2021 at 09:26:19PM -0400, Andrew Dunstan wrote:
> Well, let me try it on fairywren tomorrow. Since we manage this on the
> buildfarm without any use at all of Win32API::File it might not be
> necessary in TAP code either, particularly if we're not truncating the file.

Thanks.  If that proves to not be necessary, +1 to remove this code.
I have been playing with this stuff, and the attached patch seems to
work properly on Windows.  On top of that, I have also tested the
non-Win32 path on an MSVC box to see that it was working, but my
environment is not really noisy usually with such compatibility
issues.
--
Michael
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e26b2b3f30..e209ea7163 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1917,21 +1917,7 @@ sub connect_ok
 		@log_unlike = @{ $params{log_unlike} };
 	}
 
-	if (@log_like or @log_unlike)
-	{
-		# Don't let previous log entries match for this connection.
-		# On Windows, the truncation would not work, so rotate the log
-		# file before restarting the server afresh.
-		if ($TestLib::windows_os)
-		{
-			$self->rotate_logfile;
-			$self->restart;
-		}
-		else
-		{
-			truncate $self->logfile, 0;
-		}
-	}
+	my $log_location = -s $self->logfile;
 
 	# Never prompt for a password, any callers of this routine should
 	# have set up things properly, and this should not block.
@@ -1950,7 +1936,8 @@ sub connect_ok
 	}
 	if (@log_like or @log_unlike)
 	{
-		my $log_contents = TestLib::slurp_file($self->logfile);
+		my $log_contents = TestLib::slurp_file($self->logfile,
+		   $log_location);
 
 		while (my $regex = shift @log_like)
 		{
@@ -2001,21 +1988,7 @@ sub connect_fails
 		@log_unlike = @{ $params{log_unlike} };
 	}
 
-	if (@log_like or @log_unlike)
-	{
-		# Don't let previous log entries match for this connection.
-		# On Windows, the truncation would not work, so rotate the log
-		# file before restarting the server afresh.
-		if ($TestLib::windows_os)
-		{
-			$self->rotate_logfile;
-			$self->restart;
-		}
-		else
-		{
-			truncate $self->logfile, 0;
-		}
-	}
+	my $log_location = -s $self->logfile;
 
 	# Never prompt for a password, any callers of this routine should
 	# have set up things properly, and this should not block.
@@ -2034,7 +2007,8 @@ sub connect_fails
 
 	if (@log_like or @log_unlike)
 	{
-		my $log_contents = TestLib::slurp_file($self->logfile);
+		my $log_contents = TestLib::slurp_file($self->logfile,
+		   $log_location);
 
 		while (my $regex = shift @log_like)
 		{
@@ -2194,9 +2168,6 @@ sub command_checks_all
 Run a command on the node, then verify that $expected_sql appears in the
 server log file.
 
-Reads the whole log file so be careful when working with large log outputs.
-The log file is truncated prior to running the command, however.
-
 =cut
 
 sub issues_sql_like
@@ -2207,10 +2178,11 @@ sub issues_sql_like
 
 	local %ENV = $self->_get_env();
 
-	truncate $self->logfile, 0;
+	my $log_location = -s $self->logfile;
+
 	my $result = TestLib::run_log($cmd);
 	ok($result, "@$cmd exit code 0");
-	my $log = TestLib::slurp_file($self->logfile);
+	my $log = TestLib::slurp_file($self->logfile, $log_location);
 	like($log, $expected_sql, "$test_name: SQL found in server log");
 	return;
 }
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 1baf6bd001..017d0f08e0 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -124,7 +124,7 @@ BEGIN
 	if ($windows_os)
 	{
 		require Win32API::File;
-		Win32API::File->import(qw(createFile OsFHandleOpen CloseHandle));
+		Win32API::File->import(qw(createFile OsFHandleOpen CloseHandle setFilePointer));
 	}
 
 	# Specifies whether to use Unix sockets for test setups.  On
@@ -430,21 +430,27 @@ sub slurp_dir
 
 =pod
 
-=item slurp_file(filename)
+=item slurp_file(filename [, $offset])
 
-Return the full contents of the specified file.
+Return the full contents of the specified file, beginning from an
+offset position if specified.
 
 =cut
 
 sub slurp_file
 {
-	my ($filename) = @_;
+	my ($filename, $offset) = @_;
 	local $/;
 	my $contents;
 	if ($Config{osname} ne 'MSWin32')
 	{
 		open(my $in, '<', $filename)
 		  or croak "could not read \"$filename\": $!";
+		if (defined($offset))
+		{
+			seek($in, $offset, 0)
+			  or croak "could not seek \"$filename\": $!";
+		}
 		$contents = <$in>;
 		close $in;
 	}
@@ -454,6 +460,11 @@ sub slurp_file
 		  or croak "could not open \"$filename\": $^E";
 		OsFHandleOpen(my $fh = IO::Handle->new(), $fHandle, 'r')
 		  or croak "could not read \"$filename\": $^E\n";
+		if (defined($offset))
+		{
+			setFilePointer($fh, $offset, qw(FILE_BEGIN))
+			  or croak "could not seek \"$filename\": $^E\n";
+		}
 		$contents = <$fh>;
 		CloseHandle($fHandle)
 		  or croak "could not close \"$filename\": $^E\n";


signature.asc
Description: PGP signature

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 8:38 PM Andres Freund  wrote:
> The reason I didn't do further reviews for things in this thread was
> that I was trying really hard to get the shared memory stats patch into
> a committable shape - there were just not enough hours in the day. I
> think it's to be expected that, during the final CF, there aren't a lot
> of resources for reviewing patches that are substantially new.  Why
> should these new patches have gotten priority over a much older patch
> set that also address significant operational issues?

We're all doing our best.

> It's very common for larger / busier databases to *substantially*
> increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
> of headroom, but a few hundred million. The cost of doing unnecessary
> anti-wraparound vacuums is just too great. And databases on the busier &
> larger side of things are precisely the ones that are more likely to hit
> wraparound issues (otherwise you're just not that likely to burn through
> that many xids).

I think that this was once true, but is now much less common, mostly
due to the freeze map stuff in 9.6. And due a general recognition that
the *risk* of increasing them is just too great (a risk that we can
hope was diminished by the failsafe, incidentally). As an example of
this, Christophe Pettus had a Damascene conversion when it came to
increasing autovacuum_freeze_max_age aggressively, which we explains
here:

https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

In short, he went from regularly advising clients to increase
autovacuum_freeze_max_age to telling them to specifically advising
them to never touch them.

Even if we assume that I'm 100% wrong about autovacuum_freeze_max_age,
it's still true that the vacuum_failsafe_age GUC is interpreted with
reference to autovacuum_freeze_max_age -- it will always be
interpreted as if it was set to 105% of whatever the current value of
autovacuum_freeze_max_age happens to be (so it's symmetric with the
freeze_table_age GUC and its 95% behavior). So it's never completely
unreasonable in the sense that it directly clashes with an existing
autovacuum_freeze_max_age setting from before the upgrade.

Of course this doesn't mean that there couldn't possibly be any
problems with the new mechanism clashing with
autovacuum_freeze_max_age in some unforeseen way. But, the worst that
can happen is that a user that is sophisticated enough to very
aggressively increase autovacuum_freeze_max_age upgrades to Postgres
14, and then finds that index vacuuming is sometimes skipped. Which
they'll see lots of annoying and scary messages about if they ever
look in the logs. I think that that's an acceptable price to pay to
protect the majority of less sophisticated users.

> And my concern isn't really that vacuum would have finished without a
> problem if cost limiting hadn't been disabled, but that having multiple
> autovacuum workers going all out will cause problems. Like the system
> slowing down so much that it's hard to fix the actual root cause of the
> wraparound - I've seen systems with a bunch unthrottled autovacuum
> overwhelme the IO subsystem so much that simply opening a connection to
> fix the issue took 10+ minutes. Especially on systems with provisioned
> IO (i.e. just about all cloud storage) that's not too hard to hit.

I don't think that it's reasonable to expect an intervention like this
to perfectly eliminate all risk, while at the same time never
introducing any new theoretical risks. (Especially while also being
simple and obviously correct.)

> > If it's intrinsically impossible to advance relfrozenxid, then surely
> > all bets are off. But even in this scenario it's very unlikely that we
> > wouldn't at least do index vacuuming for those index tuples that are
> > dead and safe to delete according to the OldestXmin cutoff. You still
> > have 1.6 billion XIDs before the failsafe first kicks in, regardless
> > of the issue of the OldestXmin/FreezeLimit being excessively far in
> > the past.
>
> As I said above, I don't think the "1.6 billion XIDs" argument has
> merit, because it's so reasonable (and common) to set
> autovacuum_freeze_max_age to something much larger.

No merit? Really? Not even a teeny, tiny, microscopic little bit of
merit? You're sure?

As I said, we handle the case where autovacuum_freeze_max_age is set
to something larger than vacuum_failsafe_age is a straightforward and
pretty sensible way. I am curious, though: what
autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
but somehow also not extremely ill-advised and dangerous? What number
is that, precisely? Apparently this is common, but I must confess that
it's the first I've heard about it.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-14 19:53:29 -0700, Peter Geoghegan wrote:
> > Or at least
> > tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
> > such should make that doable, or even just running a small test with
> > lower thresholds).
>
> You know what else doesn't have test coverage? Any kind of aggressive
> VACUUM. There is a problem with our culture around testing. I would
> like to address that in the scope of this project, but you know how it
> is. Can I take it that I'll have your support with adding those tests?

Sure!

> > I think there are good arguments for having logic for an "emergency
> > vacuum" mode (and also some good ones against). I'm not convinced that
> > the current set of things that are [not] skipped in failsafe mode is the
> > "obviously right set of things"™ but am convinced that there wasn't
> > enough consensus building o what that set of things is. This all also
> > would be different if it were the start of the development window,
> > rather than the end.
>
> I all but begged you to review the patches. Same with Robert. While
> the earlier patches (where almost all of the complexity is) did get
> review from both you and Robert (which I was grateful to receive), for
> whatever reason neither of you looked at the later patches in detail.

Based on a quick scan of the thread, the first version of a patch that
kind of resembles what got committed around the topic at hand seems to
be 
https://postgr.es/m/CAH2-Wzm7Y%3D_g3FjVHv7a85AfUbuSYdggDnEqN1hodVeOctL%2BOw%40mail.gmail.com
posted 2021-03-15. That's well into the last CF.

The reason I didn't do further reviews for things in this thread was
that I was trying really hard to get the shared memory stats patch into
a committable shape - there were just not enough hours in the day. I
think it's to be expected that, during the final CF, there aren't a lot
of resources for reviewing patches that are substantially new.  Why
should these new patches have gotten priority over a much older patch
set that also address significant operational issues?

> > I think there's also a clear danger in having "cliffs" where the
> > behaviour changes appruptly once a certain threshold is reached. It's
> > not unlikely for systems to fall over entirely over when
> >
> > a) autovacuum cost limiting is disabled. E.g. reaching your disk
> >iops/throughput quota and barely being able to log into postgres
> >anymore to kill the stuck connection causing the wraparound issue.
>
> Let me get this straight: You're concerned that hurrying up vacuuming
> when we have 500 million XIDs left to burn will overwhelm the system,
> which would presumably have finished in time otherwise?
> Even though it would have to do way more work in absolute terms in the
> absence of the failsafe? And even though the 1.6 billion XID age that
> we got to before the failsafe kicked in was clearly not enough? You'd
> want to "play it safe", and stick with the original plan at that
> point?

It's very common for larger / busier databases to *substantially*
increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
of headroom, but a few hundred million. The cost of doing unnecessary
anti-wraparound vacuums is just too great. And databases on the busier &
larger side of things are precisely the ones that are more likely to hit
wraparound issues (otherwise you're just not that likely to burn through
that many xids).

And my concern isn't really that vacuum would have finished without a
problem if cost limiting hadn't been disabled, but that having multiple
autovacuum workers going all out will cause problems. Like the system
slowing down so much that it's hard to fix the actual root cause of the
wraparound - I've seen systems with a bunch unthrottled autovacuum
overwhelme the IO subsystem so much that simply opening a connection to
fix the issue took 10+ minutes. Especially on systems with provisioned
IO (i.e. just about all cloud storage) that's not too hard to hit.

> > b) No index cleanup happens anymore. E.g. a workload with a lot of
> >bitmap index scans (which do not support killtuples) could end up a
> >off a lot worse because index pointers to dead tuples aren't being
> >cleaned up. In cases where an old transaction or leftover replication
> >slot is causing the problem (together a significant percentage of
> >wraparound situations) this situation will persist across repeated
> >(explicit or automatic) vacuums for a table, because relfrozenxid
> >won't actually be advanced. And this in turn might actually end up
> >slowing resolution of the wraparound issue more than doing all the
> >index scans.
>
> If it's intrinsically impossible to advance relfrozenxid, then surely
> all bets are off. But even in this scenario it's very unlikely that we
> wouldn't at least do index vacuuming for those index tuples that are
> dead and safe to delete according to the OldestXmin cutoff. You still
> have 1.6 billion XIDs before the

Re: logical replication empty transactions

2021-04-14 Thread Ajin Cherian

On Thu, Sep 17, 2020 at 3:29 PM Michael Paquier  wrote:

> On Wed, Jul 29, 2020 at 08:08:06PM +0530, Rahila Syed wrote:
> > The make check passes.
>
> Since then, the patch is failing to apply, waiting on author and the
> thread has died 6 weeks or so ago, so I am marking it as RwF in the
> CF.
>
>
I've rebased the patch and made changes so that the patch supports
"streaming in-progress transactions" and handling of logical decoding
messages (transactional and non-transactional).
I see that this patch not only makes sure that empty transactions are not
sent but also does call OutputPluginUpdateProgress when an empty
transaction is not sent, as a result the confirmed_flush_lsn is kept
moving. I also see no hangs when synchronous_standby is configured.
Do let me know your thoughts on this patch.

regards,
Ajin Cherian
Fujitsu Australia


v2-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 6:53 PM Andres Freund  wrote:
> > To a large degree the failsafe is something that is written in the
> > hope that it will never be needed. This is unlike most other things,
> > and has its own unique risks.
>
> Among them that the code is not covered by tests and is unlikely to be
> meaningfully exercised within the beta timeframe due to the timeframes
> for hitting it (hard to actually hit below a 1/2 day running extreme
> workloads, weeks for more realistic ones). Which means that this code
> has to be extra vigorously reviewed, not the opposite.

There is test coverage for the optimization to bypass index vacuuming
with very few dead tuples. Plus we can expect it to kick in very
frequently. That's not as good as testing this other mechanism
directly, which I agree ought to happen too. But the difference
between those two cases is pretty much entirely around when and how
they kick in. I have normalized the idea that index vacuuming is
optional in principle, so in an important sense it is tested all the
time.

> Or at least
> tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
> such should make that doable, or even just running a small test with
> lower thresholds).

You know what else doesn't have test coverage? Any kind of aggressive
VACUUM. There is a problem with our culture around testing. I would
like to address that in the scope of this project, but you know how it
is. Can I take it that I'll have your support with adding those tests?

> This line of argumentation scares me. Not explained arguments, about
> running in conditions that we otherwise don't run in, when in
> exceptional circumstances. This code has a history of super subtle
> interactions, with quite a few data loss causing bugs due to us not
> forseeing some combination of circumstances.

I'll say it again: I was wrong to not make that clearer prior to
committing the fixup. I regret that error, which probably had a lot to
do with being fatigued.

> I think there are good arguments for having logic for an "emergency
> vacuum" mode (and also some good ones against). I'm not convinced that
> the current set of things that are [not] skipped in failsafe mode is the
> "obviously right set of things"™ but am convinced that there wasn't
> enough consensus building o what that set of things is. This all also
> would be different if it were the start of the development window,
> rather than the end.

I all but begged you to review the patches. Same with Robert. While
the earlier patches (where almost all of the complexity is) did get
review from both you and Robert (which I was grateful to receive), for
whatever reason neither of you looked at the later patches in detail.
(Robert said that the failsafe ought to cover single-pass/no-indexes
VACUUM at one point, which did influence the design of the failsafe,
but for the most part his input on the later stuff was minimal and
expressed in general terms.)

Of course, nothing stops us from improving the mechanism in the
future. Though I maintain that the fundamental approach of finishing
as quickly as possible is basically sound (short of fixing the problem
directly, for example by obviating the need for freezing).

> In my experience the big problem with vacuums in a wraparound situation
> isn't actually things like truncation or eventhe index scans (although
> they certainly can cause bad problems), but that VACUUM modifies
> (prune/vacuum and WAL log or just setting hint bits) a crapton of pages
> that don't actually need to be modified just to be able to get out of
> the wraparound situation.

Things like UUID indexes are very popular, and are likely to have an
outsized impact on dirtying pages (which I agree is the real problem).
Plus some people just have a ridiculous amount of indexes (e.g., the
Discourse table that they pointed out as a particularly good target
for deduplication had a total of 13 indexes). There is an excellent
chance that stuff like that is involved in installations that actually
have huge problems. The visibility map works pretty well these days --
but not for indexes.

> And that the overhead of writing out all those
> dirty pages + WAL logging causes the VACUUM to take unacceptably
> long. E.g. because your storage is cloud storage with a few ms of
> latency, and the ringbuffer + wal_buffer sizes cause so many synchronous
> writes that you end up with < 10MB/s of data being processed.

This is a false dichotomy. There probably is an argument for making
the failsafe not do pruning that isn't strictly necessary (or
something like that) in a future release. I don't see what particular
significance that has for the failsafe mechanism now. The sooner we
can advance relfrozenxid when it's dangerously far in the past, the
better. It's true that the mechanism doesn't exploit every possible
opportunity to do so. But it mostly manages to do that.

> I think there's also a clear danger in having "cliffs" where the
> behaviour changes

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread James Coleman

On Wed, Apr 14, 2021 at 5:42 PM James Coleman  wrote:
>
> On Mon, Apr 12, 2021 at 8:37 AM Tomas Vondra
>  wrote:
> >
> > On 4/12/21 2:24 PM, Luc Vlaming wrote:
> > > Hi,
> > >
> > > When trying to run on master (but afaik also PG-13) TPC-DS queries 94,
> > > 95 and 96 on a SF10 I get the error "could not find pathkey item to sort".
> > > When I disable enable_gathermerge the problem goes away and then the
> > > plan for query 94 looks like below. I tried figuring out what the
> > > problem is but to be honest I would need some pointers as the code that
> > > tries to matching equivalence members in prepare_sort_from_pathkeys is
> > > something i'm really not familiar with.
> > >
> >
> > Could be related to incremental sort, which allowed some gather merge
> > paths that were impossible before. We had a couple issues related to
> > that fixed in November, IIRC.
> >
> > > To reproduce you can either ingest and test using the toolkit I used too
> > > (see https://github.com/swarm64/s64da-benchmark-toolkit/), or
> > > alternatively just use the schema (see
> > > https://github.com/swarm64/s64da-benchmark-toolkit/tree/master/benchmarks/tpcds/schemas/psql_native)
> > >
> >
> > Thanks, I'll see if I can reproduce that with your schema.
> >
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
>
> The query in question is:
>
> select  count(*)
> from store_sales
> ,household_demographics
> ,time_dim, store
> where ss_sold_time_sk = time_dim.t_time_sk
> and ss_hdemo_sk = household_demographics.hd_demo_sk
> and ss_store_sk = s_store_sk
> and time_dim.t_hour = 15
> and time_dim.t_minute >= 30
> and household_demographics.hd_dep_count = 7
> and store.s_store_name = 'ese'
> order by count(*)
> limit 100;
>
> From debugging output it looks like this is the plan being chosen
> (cheapest total path):
> Gather(store_sales household_demographics time_dim) rows=60626
> cost=3145.73..699910.15
> HashJoin(store_sales household_demographics time_dim)
> rows=25261 cost=2145.73..692847.55
>   clauses: store_sales.ss_hdemo_sk =
> household_demographics.hd_demo_sk
> HashJoin(store_sales time_dim) rows=252609
> cost=1989.73..692028.08
>   clauses: store_sales.ss_sold_time_sk =
> time_dim.t_time_sk
> SeqScan(store_sales) rows=11998564
> cost=0.00..658540.64
> SeqScan(time_dim) rows=1070
> cost=0.00..1976.35
> SeqScan(household_demographics) rows=720
> cost=0.00..147.00
>
> prepare_sort_from_pathkeys fails to find a pathkey because
> tlist_member_ignore_relabel returns null -- which seemed weird because
> the sortexpr is an Aggref (in a single member equivalence class) and
> the tlist contains a single member that's also an Aggref. It turns out
> that the only difference between the two Aggrefs is that the tlist
> entry has "aggsplit = AGGSPLIT_INITIAL_SERIAL" while the sortexpr has
> aggsplit = AGGSPLIT_SIMPLE.
>
> That's as far as I've gotten so far, but I figured I'd get that info
> out to see if it means anything obvious to anyone else.

This really goes back to [1] where we fixed a similar issue by making
find_em_expr_usable_for_sorting_rel parallel the rules in
prepare_sort_from_pathkeys.

Most of those conditions got copied, and the case we were trying to
handle is the fact that prepare_sort_from_pathkeys can generate a
target list entry under those conditions if one doesn't exist. However
there's a further restriction there I don't remember looking at: it
uses pull_var_clause and tlist_member_ignore_relabel to ensure that
all of the vars that feed into the sort expression are found in the
target list. As I understand it, that is: it will build a target list
entry for something like "md5(column)" if "column" (and that was one
of our test cases for the previous fix) is in the target list already.

But there's an additional detail here: the call to pull_var_clause
requests aggregates, window functions, and placeholders be treated as
vars. That means for our Aggref case it would require that the two
Aggrefs be fully equal, so the differing aggsplit member would cause a
target list entry not to be built, hence our error here.

I've attached a quick and dirty patch that encodes that final rule
from prepare_sort_from_pathkeys into
find_em_expr_usable_for_sorting_rel. I can't help but think that
there's a cleaner way to do with this with less code duplication, but
hindering that is that prepare_sort_from_pathkeys is working with a
TargetList while find_em_expr_usable_for_sorting_rel is working with a
list of expressions.

James

1: 
https://www.postgresql.org/message-id/CAAaqYe9C3f6A_tZCRfr9Dm7hPpgGwpp4i-K_%3DNS9GWXuNiFANg%40mail.gmail.com

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-14 14:55:36 -0700, Peter Geoghegan wrote:
> On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> > I'm getting a bit bothered by the speed at which you're pushing fairly
> > substantial behavioural for vacuum. In this case without even a warning
> > that you're about to do so.
> 
> To a large degree the failsafe is something that is written in the
> hope that it will never be needed. This is unlike most other things,
> and has its own unique risks.

Among them that the code is not covered by tests and is unlikely to be
meaningfully exercised within the beta timeframe due to the timeframes
for hitting it (hard to actually hit below a 1/2 day running extreme
workloads, weeks for more realistic ones). Which means that this code
has to be extra vigorously reviewed, not the opposite.  Or at least
tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
such should make that doable, or even just running a small test with
lower thresholds).

> I just went one step further than that in this recent commit. I didn't
> point these details out before now because (to me) this is beside the
> point. Which is that the failsafe is just that -- a failsafe. Anything
> that adds unnecessary unpredictable delay in reaching the point of
> advancing relfrozenxid should be avoided. (Besides, the design of
> should_attempt_truncation() and lazy_truncate_heap() is very far from
> guaranteeing that truncation will take place at the best of times.)

This line of argumentation scares me. Not explained arguments, about
running in conditions that we otherwise don't run in, when in
exceptional circumstances. This code has a history of super subtle
interactions, with quite a few data loss causing bugs due to us not
forseeing some combination of circumstances.

I think there are good arguments for having logic for an "emergency
vacuum" mode (and also some good ones against). I'm not convinced that
the current set of things that are [not] skipped in failsafe mode is the
"obviously right set of things"™ but am convinced that there wasn't
enough consensus building o what that set of things is. This all also
would be different if it were the start of the development window,
rather than the end.

In my experience the big problem with vacuums in a wraparound situation
isn't actually things like truncation or eventhe index scans (although
they certainly can cause bad problems), but that VACUUM modifies
(prune/vacuum and WAL log or just setting hint bits) a crapton of pages
that don't actually need to be modified just to be able to get out of
the wraparound situation. And that the overhead of writing out all those
dirty pages + WAL logging causes the VACUUM to take unacceptably
long. E.g. because your storage is cloud storage with a few ms of
latency, and the ringbuffer + wal_buffer sizes cause so many synchronous
writes that you end up with < 10MB/s of data being processed.

I think there's also a clear danger in having "cliffs" where the
behaviour changes appruptly once a certain threshold is reached. It's
not unlikely for systems to fall over entirely over when

a) autovacuum cost limiting is disabled. E.g. reaching your disk
   iops/throughput quota and barely being able to log into postgres
   anymore to kill the stuck connection causing the wraparound issue.

b) No index cleanup happens anymore. E.g. a workload with a lot of
   bitmap index scans (which do not support killtuples) could end up a
   off a lot worse because index pointers to dead tuples aren't being
   cleaned up. In cases where an old transaction or leftover replication
   slot is causing the problem (together a significant percentage of
   wraparound situations) this situation will persist across repeated
   (explicit or automatic) vacuums for a table, because relfrozenxid
   won't actually be advanced. And this in turn might actually end up
   slowing resolution of the wraparound issue more than doing all the
   index scans.

Because this is a hard cliff rather than something phasing in, it's not
really possible to for a user to see this slowly getting worse and
addressing the issue. Especially for a) this could be addressed by not
turning off cost limiting at once, but instead have it decrease the
closer you get to some limit.

Greetings,

Andres Freund

Re: psql - add SHOW_ALL_RESULTS option

2021-04-14 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Apr 12, 2021 at 03:33:01PM -0400, Alvaro Herrera wrote:
>> I, for one, would prefer to see the feature repaired in this cycle.

> If it is possible to get that fixed, I would not mind waiting a bit
> more but it would be nice to see some actual proposals.  There are
> already three identified bugs in psql introduced by this commit,
> including the query cancellation.

> That's a lot IMO, so my vote would be to discard this feature for now
> and revisit it properly in the 15 dev cycle, so as resources are
> redirected into more urgent issues (13 open items as of the moment of
> writing this email).

I don't wish to tell people which open issues they ought to work on
... but this patch seems like it could be quite a large can of worms,
and I'm not detecting very much urgency about getting it fixed.
If it's not to be reverted then some significant effort needs to be
put into it soon.

regards, tom lane

Replacing pg_depend PIN entries with a fixed range check

2021-04-14 Thread Tom Lane

In [1] Andres and I speculated about whether we really need all
those PIN entries in pg_depend.  Here is a draft patch that gets
rid of them.

It turns out to be no big problem to replace the PIN entries
with an OID range check, because there's a well-defined point
in initdb where it wants to pin (almost) all existing objects,
and then no objects created after that are pinned.  In the earlier
thread I'd imagined having initdb record the OID counter at that
point in pg_control, and then we could look at the recorded counter
value to make is-it-pinned decisions.  However, that idea has a
fatal problem: what shall pg_resetwal fill into that field when
it has to gin up a pg_control file from scratch?  There's no
good way to reconstruct the value.

Hence, what this patch does is to establish a manually-managed cutoff
point akin to FirstBootstrapObjectId, and make initdb push the OID
counter up to that once it's made the small number of pinned objects
it's responsible for.  With the value I used here, a couple hundred
OIDs are wasted, but there seems to be a reasonable amount of headroom
still beyond that.  On my build, the OID counter at the end of initdb
is 15485 (with a reasonable number of glibc and ICU locales loaded).
So we still have about 900 free OIDs there; and there are 500 or so
free just below FirstBootstrapObjectId, too.  So this approach does
hasten the day when we're going to run out of free OIDs below 16384,
but not by all that much.

There are a couple of objects, namely template1 and the public
schema, that are in the catalog .dat files but are not supposed
to be pinned.  The existing code accomplishes that by excluding them
(in two different ways :-() while filling pg_depend.  This patch
just hard-wires exceptions for them in IsPinnedObject(), which seems
to me not much uglier than what we had before.  The existing code
also handles pinning of the standard tablespaces in an idiosyncratic
way; I just dropped that and made them be treated as pinned.

One interesting point about doing things this way is that
IsPinnedObject() will give correct answers throughout initdb, whereas
before the backend couldn't tell what was supposed to be pinned until
after initdb loaded pg_depend.  This means we don't need the hacky
truncation of pg_depend and pg_shdepend that initdb used to do,
because now the backend will correctly not make entries relating to
objects it now knows are pinned.  Aside from saving a few cycles,
this is more correct.  For example, if some object that initdb made
after bootstrap but before truncating pg_depend had a dependency on
the public schema, the existing coding would lose track of that fact.
(There's no live issue of that sort, I hasten to say, and really it
would be a bug to set things up that way because then you couldn't
drop the public schema.  But the existing coding would make things
worse by not detecting the mistake.)

Anyway, as to concrete results:

* pg_depend's total relation size, in a freshly made database,
drops from 1269760 bytes to 368640 bytes.

* There seems to be a small but noticeable reduction in the time
to run check-world.  I compared runtimes on a not-particularly-modern
machine with spinning-rust storage, using -j4 parallelism:

HEAD
real5m4.248s
user2m59.390s
sys 1m21.473s

+ patch
real5m2.924s
user2m36.196s
sys 1m19.724s

These top-line numbers don't look too impressive, but the CPU-time
reduction seems quite significant.  Probably on a different hardware
platform that would translate more directly to runtime savings.

I didn't try to reproduce the original performance bottleneck
that was complained of in [1], but that might be fun to check.

Anyway, I'll stick this in the next CF so we don't lose track
of it.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/947172.1617684433%40sss.pgh.pa.us#6a3d250a9c4a994cb3a26c87384fc823

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2656786d1e..83a391eed1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3249,8 +3249,7 @@ SCRAM-SHA-256$iteration count:
(references pg_class.oid)
   
   
-   The OID of the system catalog the dependent object is in,
-   or zero for a DEPENDENCY_PIN entry
+   The OID of the system catalog the dependent object is in
   
  
 
@@ -3260,8 +3259,7 @@ SCRAM-SHA-256$iteration count:
(references any OID column)
   
   
-   The OID of the specific dependent object,
-   or zero for a DEPENDENCY_PIN entry
+   The OID of the specific dependent object
   
  
 
@@ -3464,19 +3462,6 @@ SCRAM-SHA-256$iteration count:
   
  
 
-
-
- DEPENDENCY_PIN (p)
- 
-  
-   There is no dependent object; this type of entry is a signal
-   that the system itself depends on the referenced object, and so
-   that object must never be deleted.  Entries of this type are
-   created

Re: Proposal: Save user's original authenticated identity for logging

2021-04-14 Thread Michael Paquier

On Tue, Apr 13, 2021 at 03:47:21PM +, Jacob Champion wrote:
> Looks like the farm has gone green, after some test fixups. Thanks for
> all the reviews!

You may want to follow this thread as well, as the topic is related to
what has been discussed on this thread as there is an impact in a
different code path for the TAP tests, and not only the connection
tests:
https://www.postgresql.org/message-id/yhajnhcmai3++...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Re: File truncation within PostgresNode::issues_sql_like() wrong on Windows

2021-04-14 Thread Andrew Dunstan



On 4/14/21 8:10 PM, Michael Paquier wrote:
> On Wed, Apr 14, 2021 at 05:10:41PM -0400, Andrew Dunstan wrote:
>> That seems rather heavy-handed. The buildfarm's approach is a bit
>> different. Essentially it seeks to the previous position of the log file
>> before reading contents. Here is its equivalent of slurp_file:
>>
>> use Fcntl qw(:seek);
>> sub file_lines
>> {
>>     my $filename = shift;
>>     my $filepos  = shift;
>>     my $handle;
>>     open($handle, '<', $filename) || croak "opening $filename: $!";
>>     seek($handle, $filepos, SEEK_SET) if $filepos;
>>     my @lines = <$handle>;
>>     close $handle;
>>     return @lines;
>> }
> That's a bit surprising to see that you can safely open a file handle
> with perl like that without using Win32API::File, and I would have
> assumed that this would have conflicted with the backend redirecting
> its output to stderr the same way as a truncation on Windows.
>
>> A client wanting what's done in PostgresNode would do something like:
>>
>> my $logpos  = -s $logfile;
>> do_some_stuff();
>> my @lines = file_lines($logfile, $logpos);
>>
>> This has the benefit of working the same on all platforms, and no
>> truncation, rotation, or restart is required.
> Jacob has suggested something like that a couple of days ago, but all
> this code was not centralized yet in a single place.
>
> For this code, the cleanest approach would be to extend slurp_file()
> with an extra argument to seek the file before fetching its contents
> based on a location given by the caller?  Looking at the docs of
> Win32API::File, we'd need to use SetFilePointer() instead of seek().



Well, let me try it on fairywren tomorrow. Since we manage this on the
buildfarm without any use at all of Win32API::File it might not be
necessary in TAP code either, particularly if we're not truncating the file.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: psql - add SHOW_ALL_RESULTS option

2021-04-14 Thread Michael Paquier

On Mon, Apr 12, 2021 at 03:33:01PM -0400, Alvaro Herrera wrote:
> Please note that there's no "for now" about it -- if the patch is
> reverted, the only way to get it back is to wait for PG15.  That's
> undesirable.  A better approach is to collect all those bugs and get
> them fixed.  There's plenty of time to do that.
> 
> I, for one, would prefer to see the feature repaired in this cycle.

If it is possible to get that fixed, I would not mind waiting a bit
more but it would be nice to see some actual proposals.  There are
already three identified bugs in psql introduced by this commit,
including the query cancellation.

That's a lot IMO, so my vote would be to discard this feature for now
and revisit it properly in the 15 dev cycle, so as resources are
redirected into more urgent issues (13 open items as of the moment of
writing this email).
--
Michael


signature.asc
Description: PGP signature

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread James Coleman

On Wed, Apr 14, 2021 at 8:21 PM Robert Haas  wrote:
>
> On Wed, Apr 14, 2021 at 5:43 PM James Coleman  wrote:
> > The query in question is:
> > select  count(*)
> > from store_sales
> > ,household_demographics
> > ,time_dim, store
> > where ss_sold_time_sk = time_dim.t_time_sk
> > and ss_hdemo_sk = household_demographics.hd_demo_sk
> > and ss_store_sk = s_store_sk
> > and time_dim.t_hour = 15
> > and time_dim.t_minute >= 30
> > and household_demographics.hd_dep_count = 7
> > and store.s_store_name = 'ese'
> > order by count(*)
> > limit 100;
> >
> > From debugging output it looks like this is the plan being chosen
> > (cheapest total path):
> > Gather(store_sales household_demographics time_dim) rows=60626
> > cost=3145.73..699910.15
> > HashJoin(store_sales household_demographics time_dim)
> > rows=25261 cost=2145.73..692847.55
> >   clauses: store_sales.ss_hdemo_sk =
> > household_demographics.hd_demo_sk
> > HashJoin(store_sales time_dim) rows=252609
> > cost=1989.73..692028.08
> >   clauses: store_sales.ss_sold_time_sk =
> > time_dim.t_time_sk
> > SeqScan(store_sales) rows=11998564
> > cost=0.00..658540.64
> > SeqScan(time_dim) rows=1070
> > cost=0.00..1976.35
> > SeqScan(household_demographics) rows=720
> > cost=0.00..147.00
>
> This doesn't really make sense to me given the strack trace in the OP.
> That seems to go Limit -> Sort -> Agg -> NestLoop -> NestLoop ->
> NestLoop -> GatherMerge -> Sort. If the plan were as you have it here,
> there would be no Sort and no Gather Merge, so where would be getting
> a failure related to pathkeys?

Also I just realized why this didn't make sense -- I'm not sure what
the above path is. It'd gotten logged as part of the debug options I
have configured, but it must be 1.) incomplete (perhaps at a lower
level of path generation) and/or not the final path selected.

My apologies for the confusion.

James

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread James Coleman

On Wed, Apr 14, 2021 at 8:21 PM Robert Haas  wrote:
>
> On Wed, Apr 14, 2021 at 5:43 PM James Coleman  wrote:
> > The query in question is:
> > select  count(*)
> > from store_sales
> > ,household_demographics
> > ,time_dim, store
> > where ss_sold_time_sk = time_dim.t_time_sk
> > and ss_hdemo_sk = household_demographics.hd_demo_sk
> > and ss_store_sk = s_store_sk
> > and time_dim.t_hour = 15
> > and time_dim.t_minute >= 30
> > and household_demographics.hd_dep_count = 7
> > and store.s_store_name = 'ese'
> > order by count(*)
> > limit 100;
> >
> > From debugging output it looks like this is the plan being chosen
> > (cheapest total path):
> > Gather(store_sales household_demographics time_dim) rows=60626
> > cost=3145.73..699910.15
> > HashJoin(store_sales household_demographics time_dim)
> > rows=25261 cost=2145.73..692847.55
> >   clauses: store_sales.ss_hdemo_sk =
> > household_demographics.hd_demo_sk
> > HashJoin(store_sales time_dim) rows=252609
> > cost=1989.73..692028.08
> >   clauses: store_sales.ss_sold_time_sk =
> > time_dim.t_time_sk
> > SeqScan(store_sales) rows=11998564
> > cost=0.00..658540.64
> > SeqScan(time_dim) rows=1070
> > cost=0.00..1976.35
> > SeqScan(household_demographics) rows=720
> > cost=0.00..147.00
>
> This doesn't really make sense to me given the strack trace in the OP.
> That seems to go Limit -> Sort -> Agg -> NestLoop -> NestLoop ->
> NestLoop -> GatherMerge -> Sort. If the plan were as you have it here,
> there would be no Sort and no Gather Merge, so where would be getting
> a failure related to pathkeys?

Ah, yeah, I'm not sure where the original stacktrace came from, but
here's the stack for the query I reproduced it with (perhaps it does
so on different queries or there was some other GUC change in the
reported plan):

#0  errfinish (filename=filename@entry=0x56416eefa845 "createplan.c",
lineno=lineno@entry=6186,
funcname=funcname@entry=0x56416eefb660 <__func__.24872>
"prepare_sort_from_pathkeys") at elog.c:514
#1  0x56416eb6ed52 in prepare_sort_from_pathkeys
(lefttree=0x564170552658, pathkeys=0x5641704f2640, relids=0x0,
reqColIdx=reqColIdx@entry=0x0,
adjust_tlist_in_place=adjust_tlist_in_place@entry=false,
p_numsortkeys=p_numsortkeys@entry=0x7fff1252817c,
p_sortColIdx=0x7fff12528170,
p_sortOperators=0x7fff12528168, p_collations=0x7fff12528160,
p_nullsFirst=0x7fff12528158) at createplan.c:6186
#2  0x56416eb6ee69 in make_sort_from_pathkeys (lefttree=, pathkeys=, relids=) at
createplan.c:6313
#3  0x56416eb71fc7 in create_sort_plan
(root=root@entry=0x564170511a70,
best_path=best_path@entry=0x56417054f650, flags=flags@entry=1)
at createplan.c:2118
#4  0x56416eb6f638 in create_plan_recurse
(root=root@entry=0x564170511a70, best_path=0x56417054f650,
flags=flags@entry=1) at createplan.c:489
#5  0x56416eb72e06 in create_gather_merge_plan
(root=root@entry=0x564170511a70,
best_path=best_path@entry=0x56417054f6e8) at createplan.c:1885
#6  0x56416eb6f723 in create_plan_recurse
(root=root@entry=0x564170511a70, best_path=0x56417054f6e8,
flags=flags@entry=4) at createplan.c:541
#7  0x56416eb726fb in create_agg_plan
(root=root@entry=0x564170511a70,
best_path=best_path@entry=0x56417054f8c8) at createplan.c:2238
#8  0x56416eb6f67b in create_plan_recurse
(root=root@entry=0x564170511a70, best_path=0x56417054f8c8,
flags=flags@entry=3) at createplan.c:509
#9  0x56416eb71f8e in create_sort_plan
(root=root@entry=0x564170511a70,
best_path=best_path@entry=0x56417054f560, flags=flags@entry=1)
at createplan.c:2109
#10 0x56416eb6f638 in create_plan_recurse
(root=root@entry=0x564170511a70, best_path=0x56417054f560,
flags=flags@entry=1) at createplan.c:489
#11 0x56416eb72c83 in create_limit_plan
(root=root@entry=0x564170511a70,
best_path=best_path@entry=0x56417054ffa0, flags=flags@entry=1)
at createplan.c:2784
#12 0x56416eb6f713 in create_plan_recurse
(root=root@entry=0x564170511a70, best_path=0x56417054ffa0,
flags=flags@entry=1) at createplan.c:536
#13 0x56416eb6f79d in create_plan (root=root@entry=0x564170511a70,
best_path=) at createplan.c:349
#14 0x56416eb7fe93 in standard_planner (parse=0x564170437268,
query_string=, cursorOptions=2048,
boundParams=)
at planner.c:407

> I think if we can get the correct plan the thing to look at would be
> the tlists at the relevant levels of the plan.

Does the information in [1] help at all? The tlist does have an
Aggref, as expected, but its aggsplit value doesn't match the
pathkey's Aggref's aggsplit value.

James

1: 
https://www.postgresql.org/message-id/CAAaqYe_NU4hO9COoJdcXWqjtH%3DdGMknYdsSdJjZ%3DJOHPTea-Nw%40mail.gmail.com

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-14 20:08:10 -0400, Robert Haas wrote:
> On Wed, Apr 14, 2021 at 5:55 PM Peter Geoghegan  wrote:
> > On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> > > I'm getting a bit bothered by the speed at which you're pushing fairly
> > > substantial behavioural for vacuum. In this case without even a warning
> > > that you're about to do so.
> >
> > To a large degree the failsafe is something that is written in the
> > hope that it will never be needed. This is unlike most other things,
> > and has its own unique risks.
> >
> > I think that the proper thing to do is to accept a certain amount of
> > risk in this area. The previous status quo was *appalling*, and so it
> > seems very unlikely that the failsafe hasn't mostly eliminated a lot
> > of risk for users. That factor is not everything, but it should count
> > for a lot. The only way that we're going to have total confidence in
> > anything like this is through the experience of it mostly working over
> > several releases.
> 
> I think this is largely missing the point Andres was making, which is
> that you made a significant behavior change after feature freeze
> without any real opportunity for discussion. More generally, you've
> changed a bunch of other stuff relatively quickly based on input from
> a relatively limited number of people. Now, it's fair to say that it's
> often hard to get input on things, and sometimes you have to just take
> your best shot and hope you're right. But in this particular case, you
> didn't even try to get broader participation or buy-in. That's not
> good.

Yep, that was what I was trying to get at.

- Andres

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 8:20 PM James Coleman  wrote:
> > Hmm, could be. Although, the stack trace at issue doesn't seem to show
> > a call to create_incrementalsort_plan().
>
> The changes to gather merge path generation made it possible to use
> those paths in more cases for both incremental sort and regular sort,
> so by "incremental sort" I read Tomas as saying "the patches that
> brought in incremental sort" not specifically "incremental sort
> itself".

I agree. That's why I said "hmm, could be" even though the plan
doesn't involve one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 5:43 PM James Coleman  wrote:
> The query in question is:
> select  count(*)
> from store_sales
> ,household_demographics
> ,time_dim, store
> where ss_sold_time_sk = time_dim.t_time_sk
> and ss_hdemo_sk = household_demographics.hd_demo_sk
> and ss_store_sk = s_store_sk
> and time_dim.t_hour = 15
> and time_dim.t_minute >= 30
> and household_demographics.hd_dep_count = 7
> and store.s_store_name = 'ese'
> order by count(*)
> limit 100;
>
> From debugging output it looks like this is the plan being chosen
> (cheapest total path):
> Gather(store_sales household_demographics time_dim) rows=60626
> cost=3145.73..699910.15
> HashJoin(store_sales household_demographics time_dim)
> rows=25261 cost=2145.73..692847.55
>   clauses: store_sales.ss_hdemo_sk =
> household_demographics.hd_demo_sk
> HashJoin(store_sales time_dim) rows=252609
> cost=1989.73..692028.08
>   clauses: store_sales.ss_sold_time_sk =
> time_dim.t_time_sk
> SeqScan(store_sales) rows=11998564
> cost=0.00..658540.64
> SeqScan(time_dim) rows=1070
> cost=0.00..1976.35
> SeqScan(household_demographics) rows=720
> cost=0.00..147.00

This doesn't really make sense to me given the strack trace in the OP.
That seems to go Limit -> Sort -> Agg -> NestLoop -> NestLoop ->
NestLoop -> GatherMerge -> Sort. If the plan were as you have it here,
there would be no Sort and no Gather Merge, so where would be getting
a failure related to pathkeys?

I think if we can get the correct plan the thing to look at would be
the tlists at the relevant levels of the plan.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 5:08 PM Robert Haas  wrote:
> I think this is largely missing the point Andres was making, which is
> that you made a significant behavior change after feature freeze
> without any real opportunity for discussion.

I don't believe that it was a significant behavior change, for the
reason I gave: the fact of the matter is that it's practically
impossible for us to truncate the heap anyway, provided we have
already decided to not vacuum (as opposed to prune) heap pages that
almost certainly have some LP_DEAD items in them. Note that later heap
pages are the most likely to still have some LP_DEAD items once the
failsafe triggers, which are precisely the ones that will affect
whether or not we can truncate the whole heap.

I accept that I could have done better with the messaging. I'll try to
avoid repeating that mistake in the future.

> More generally, you've
> changed a bunch of other stuff relatively quickly based on input from
> a relatively limited number of people. Now, it's fair to say that it's
> often hard to get input on things, and sometimes you have to just take
> your best shot and hope you're right.

I agree in general, and I agree that that's what I've done in this
instance. It goes without saying, but I'll say it anyway: I accept
full responsibility.

> But in this particular case, you
> didn't even try to get broader participation or buy-in. That's not
> good.

I will admit to being somewhat burned out by this project. That might
have been a factor.

-- 
Peter Geoghegan

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread James Coleman

On Wed, Apr 14, 2021 at 8:16 PM Robert Haas  wrote:
>
> On Mon, Apr 12, 2021 at 8:37 AM Tomas Vondra
>  wrote:
> > Could be related to incremental sort, which allowed some gather merge
> > paths that were impossible before. We had a couple issues related to
> > that fixed in November, IIRC.
>
> Hmm, could be. Although, the stack trace at issue doesn't seem to show
> a call to create_incrementalsort_plan().

The changes to gather merge path generation made it possible to use
those paths in more cases for both incremental sort and regular sort,
so by "incremental sort" I read Tomas as saying "the patches that
brought in incremental sort" not specifically "incremental sort
itself".

James

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Vik Fearing

On 4/15/21 12:18 AM, Mark Dilger wrote:
> 
> 
>> On Apr 14, 2021, at 2:47 PM, Vik Fearing  wrote:
>>
>> On 4/14/21 7:36 PM, Tom Lane wrote:
>>> Mark Dilger  writes:
> On Apr 13, 2021, at 3:26 PM, Tom Lane  wrote:
> However I think we may still need an assumption that earthdistance
> and cube are in the same schema --- any comments on that?
>>>
 This is probably not worth doing, and we are already past feature
 freeze, but adding syntax to look up the namespace of an extension might
 help.
>>>
>>> Yeah, that idea was discussed before (perhaps only in private
>>> security-team threads, though).  We didn't do anything about it because
>>> at the time there didn't seem to be pressing need, but in the context
>>> of SQL function bodies there's an obvious use-case.
>>>
 We could get something like this working just inside the CREATE EXTENSION 
 command if we expanded on the @extschema@ idea a bit.  At first I thought 
 this idea would suffer race conditions with concurrent modifications of 
 pg_extension or pg_namespace, but it looks like we already have a snapshot 
 when processing the script file, so:
>>>
 -CREATE DOMAIN earth AS cube
 +CREATE DOMAIN @@earthdistance@@::earth AS @@cube@@::cube
>>>
>>> Right, extending the @extschema@ mechanism is what was discussed,
>>> though I think I'd lean towards something like @extschema:cube@
>>> to denote the schema of a referenced extension "cube".
>>>
>>> I'm not sure this is useful enough to break feature freeze for,
>>> but I'm +1 for investigating it for v15.
>> Just like we have a pseudo "$user" schema, could we have a pseudo
>> "$extension" catalog?  That should avoid changing grammar rules too much.
>>
>> CREATE TABLE unaccented_words (
>>word "$extension".citext.citext,
>>CHECK (word = "$extension".unaccent.unaccent(word)
>> );
> 
> Having a single variable $extension might help in many cases, but I don't see 
> how to use it to handle the remaining cross-extension references, such as 
> earthdistance needing to reference cube.


Sorry, I hadn't realized that was a real example so I made up my own.

Basically my idea is to use the fully qualified catalog.schema.object
syntax where the catalog is a special "$extension" value (meaning we
would have to forbid that as an actual database name) and the schema is
the name of the extension whose schema we want.  The object is then just
the object.


CREATE DOMAIN earth AS "$extension".cube.cube
  CONSTRAINT not_point check("$extension".cube.cube_is_point(value))
  CONSTRAINT not_3d check("$extension".cube.cube_dim(value <= 3)
  ...;


CREATE FUNCTION earth_box(earth, float8)
 RETURNS "$extension".cube.cube
 LANGUAGE sql
 IMMUTABLE PARALLEL SAFE STRICT
RETURN "$extension".cube.cube_enlarge($1, gc_to_sec($2), 3);


If I had my druthers, we would spell it pg_extension instead of
"$extension" because I hate double-quoting identifiers, but that's just
bikeshedding and has little to do with the concept itself.
-- 
Vik Fearing

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread Robert Haas

On Mon, Apr 12, 2021 at 8:37 AM Tomas Vondra
 wrote:
> Could be related to incremental sort, which allowed some gather merge
> paths that were impossible before. We had a couple issues related to
> that fixed in November, IIRC.

Hmm, could be. Although, the stack trace at issue doesn't seem to show
a call to create_incrementalsort_plan().

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: File truncation within PostgresNode::issues_sql_like() wrong on Windows

2021-04-14 Thread Michael Paquier

On Wed, Apr 14, 2021 at 05:10:41PM -0400, Andrew Dunstan wrote:
> That seems rather heavy-handed. The buildfarm's approach is a bit
> different. Essentially it seeks to the previous position of the log file
> before reading contents. Here is its equivalent of slurp_file:
> 
> use Fcntl qw(:seek);
> sub file_lines
> {
>     my $filename = shift;
>     my $filepos  = shift;
>     my $handle;
>     open($handle, '<', $filename) || croak "opening $filename: $!";
>     seek($handle, $filepos, SEEK_SET) if $filepos;
>     my @lines = <$handle>;
>     close $handle;
>     return @lines;
> }

That's a bit surprising to see that you can safely open a file handle
with perl like that without using Win32API::File, and I would have
assumed that this would have conflicted with the backend redirecting
its output to stderr the same way as a truncation on Windows.

> A client wanting what's done in PostgresNode would do something like:
> 
> my $logpos  = -s $logfile;
> do_some_stuff();
> my @lines = file_lines($logfile, $logpos);
> 
> This has the benefit of working the same on all platforms, and no
> truncation, rotation, or restart is required.

Jacob has suggested something like that a couple of days ago, but all
this code was not centralized yet in a single place.

For this code, the cleanest approach would be to extend slurp_file()
with an extra argument to seek the file before fetching its contents
based on a location given by the caller?  Looking at the docs of
Win32API::File, we'd need to use SetFilePointer() instead of seek().
--
Michael


signature.asc
Description: PGP signature

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 5:55 PM Peter Geoghegan  wrote:
> On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> > I'm getting a bit bothered by the speed at which you're pushing fairly
> > substantial behavioural for vacuum. In this case without even a warning
> > that you're about to do so.
>
> To a large degree the failsafe is something that is written in the
> hope that it will never be needed. This is unlike most other things,
> and has its own unique risks.
>
> I think that the proper thing to do is to accept a certain amount of
> risk in this area. The previous status quo was *appalling*, and so it
> seems very unlikely that the failsafe hasn't mostly eliminated a lot
> of risk for users. That factor is not everything, but it should count
> for a lot. The only way that we're going to have total confidence in
> anything like this is through the experience of it mostly working over
> several releases.

I think this is largely missing the point Andres was making, which is
that you made a significant behavior change after feature freeze
without any real opportunity for discussion. More generally, you've
changed a bunch of other stuff relatively quickly based on input from
a relatively limited number of people. Now, it's fair to say that it's
often hard to get input on things, and sometimes you have to just take
your best shot and hope you're right. But in this particular case, you
didn't even try to get broader participation or buy-in. That's not
good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Can a child process detect postmaster death when in pg_usleep?

2021-04-14 Thread Thomas Munro

Hi Bharath,

On Thu, Apr 15, 2021 at 2:06 AM Bharath Rupireddy
 wrote:
> 1) Is it really harmful to use pg_usleep in a postmaster child process
> as it doesn't let the child process detect postmaster death?

Yeah, that's a bad idea.  Any long-term waiting (including short waits
in a loop) should ideally be done with the latch infrastructure.

One interesting and unusual case is recovery: it can run for a very
long time without reaching a waiting primitive of any kind (other than
LWLock, which doesn't count), because it can be busy applying records
for hours at a time.  In that case, we take special measures and
explicitly check if the postmaster is dead in the redo loop.  In
theory, you could do the same in any loop containing pg_usleep() (we
used to have several loops doing that, especially around replication
code), but it'd be better to use the existing wait-event-multiplexing
technology we have, and keep improving that.

Some people have argued that long running queries should *also* react
faster when the PM exits, a bit like recovery ... which leads to the
next point...

> 2) Can pg_usleep() always detect signals? I see the caution in the
> pg_usleep function definition in pgsleep.c, saying the signal handling
> is platform dependent. We have code blocks like below in the code. Do
> we actually process interrupts before going to sleep with pg_usleep()?
> while/for loop
> {
> ..
> ..
>  CHECK_FOR_INTERRUPTS();
>  pg_usleep();
> }
> and
> if (PostAuthDelay)
>pg_usleep();

CHECK_FOR_INTERRUPTS() has nothing to do with postmaster death
detection, currently, so that'd be for dealing with interrupts, not
for that.  Also, there would be a race: a signal on its own isn't
enough on systems where we have them and where select() is guaranteed
to wake up, because  the signal might arrive between CFI() and
pg_usleep(100 years).  latch.c knows how to void such problems.

There may be an argument that CFI() *should* be a potential
postmaster-death-exit point, instead of having WaitLatch() (or its
caller) handle it directly, but it's complicated.  At the time the
postmaster pipe system was invented we didn't have a signals for this
so it wasn't even a candidate for treatment as an "interrupt".  On
systems that have postmaster death signals today (Linux + FreeBSD, but
I suspect we can extend this to every Unix we support, see CF #3066,
and a solution for Windows has been mentioned too), clearly the signal
handler could set a new interrupt flag PostmasterLost +
InterruptPending, and then CHECK_FOR_INTERRUPTS() could see it and
exit.  The argument against this is that exiting isn't always the
right thing!  In a couple of places, we do something special, such as
printing a special error message (examples: sync rep and the main FEBE
client read).  Look for WL_POSTMASTER_DEATH (as opposed to
WL_EXIT_ON_PM_DEATH).  So I guess you'd need to reverse those
decisions and standardise on "exit immediately, no message", or
invented a way to suppress that behaviour in code regions.

> 3) Is it intentional to use pg_usleep in some places in the code? If
> yes, what are they? At least, I see one place where it's intentional
> in the wait_pid function which is used while running the regression
> tests.

There are plenty of places that do a short sleep for various reasons,
more like a deliberate stall or backoff or auth thing, and it's
probably OK if they're shortish and not really a condition polling
loop with an obvious latch/CV-based replacement.  Note also that
LWLock waits are similar.

> 4) Are there any places where we need to replace pg_usleep with
> WaitLatch/equivalent of pg_sleep to detect the postmaster death
> properly?

We definitely have replaced a lot of sleeps with latch.c primitives
over the past few years, since we got WL_EXIT_ON_PM_DEATH and
condition variables.  There may be many more to improve...  You
mentioned autovacuum... yeah, Stephen fixed one of these with commit
4753ef37, but yeah it's not great to have those others in there...

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Mark Dilger




> On Apr 14, 2021, at 2:47 PM, Vik Fearing  wrote:
> 
> On 4/14/21 7:36 PM, Tom Lane wrote:
>> Mark Dilger  writes:
 On Apr 13, 2021, at 3:26 PM, Tom Lane  wrote:
 However I think we may still need an assumption that earthdistance
 and cube are in the same schema --- any comments on that?
>> 
>>> This is probably not worth doing, and we are already past feature
>>> freeze, but adding syntax to look up the namespace of an extension might
>>> help.
>> 
>> Yeah, that idea was discussed before (perhaps only in private
>> security-team threads, though).  We didn't do anything about it because
>> at the time there didn't seem to be pressing need, but in the context
>> of SQL function bodies there's an obvious use-case.
>> 
>>> We could get something like this working just inside the CREATE EXTENSION 
>>> command if we expanded on the @extschema@ idea a bit.  At first I thought 
>>> this idea would suffer race conditions with concurrent modifications of 
>>> pg_extension or pg_namespace, but it looks like we already have a snapshot 
>>> when processing the script file, so:
>> 
>>> -CREATE DOMAIN earth AS cube
>>> +CREATE DOMAIN @@earthdistance@@::earth AS @@cube@@::cube
>> 
>> Right, extending the @extschema@ mechanism is what was discussed,
>> though I think I'd lean towards something like @extschema:cube@
>> to denote the schema of a referenced extension "cube".
>> 
>> I'm not sure this is useful enough to break feature freeze for,
>> but I'm +1 for investigating it for v15.
> Just like we have a pseudo "$user" schema, could we have a pseudo
> "$extension" catalog?  That should avoid changing grammar rules too much.
> 
> CREATE TABLE unaccented_words (
>word "$extension".citext.citext,
>CHECK (word = "$extension".unaccent.unaccent(word)
> );

Having a single variable $extension might help in many cases, but I don't see 
how to use it to handle the remaining cross-extension references, such as 
earthdistance needing to reference cube.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> I'm getting a bit bothered by the speed at which you're pushing fairly
> substantial behavioural for vacuum. In this case without even a warning
> that you're about to do so.

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

I think that the proper thing to do is to accept a certain amount of
risk in this area. The previous status quo was *appalling*, and so it
seems very unlikely that the failsafe hasn't mostly eliminated a lot
of risk for users. That factor is not everything, but it should count
for a lot. The only way that we're going to have total confidence in
anything like this is through the experience of it mostly working over
several releases.

> I don't think it's that blindingly obvious that skipping truncation is
> the right thing to do that it doesn't need review. Consider e.g. the
> case that you're close to wraparound because you ran out of space for
> the amount of WAL VACUUM produces, previously leading to autovacuums
> being aborted / the server restarted. The user might then stop regular
> activity and try to VACUUM. Skipping the truncation might now make it
> harder to actually vacuum all the tables without running out of space.

Note that the criteria for whether or not "hastup=false" for a page is
slightly different in lazy_scan_prune() -- I added a comment that
points this out directly (the fact that it works that way is not new,
and might have originally been a happy mistake). Unlike
count_nondeletable_pages(), which is used by heap truncation,
lazy_scan_prune() is concerned about whether or not it's *likely to be
possible* to truncate away the page by the time lazy_truncate_heap()
is reached (if it is reached at all). And so it's optimistic about
LP_DEAD items that it observes being removed by
lazy_vacuum_heap_page() before we get to lazy_truncate_heap(). It's
inherently race-prone anyway, so it might as well assume that LP_DEAD
items will eventually become LP_UNUSED items later on.

It follows that the chances of lazy_truncate_heap() failing to
truncate anything when the failsafe has already triggered are
exceptionally high -- all the LP_DEAD items are still there, and
cannot be safely removed during truncation (for the usual reasons). I
just went one step further than that in this recent commit. I didn't
point these details out before now because (to me) this is beside the
point. Which is that the failsafe is just that -- a failsafe. Anything
that adds unnecessary unpredictable delay in reaching the point of
advancing relfrozenxid should be avoided. (Besides, the design of
should_attempt_truncation() and lazy_truncate_heap() is very far from
guaranteeing that truncation will take place at the best of times.)

FWIW, my intention is to try to get as much feedback about the
failsafe as I possibly can -- it's hard to reason about exceptional
events. I'm also happy to further discuss the specifics with you now.

-- 
Peter Geoghegan

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Vik Fearing

On 4/14/21 7:36 PM, Tom Lane wrote:
> Mark Dilger  writes:
>>> On Apr 13, 2021, at 3:26 PM, Tom Lane  wrote:
>>> However I think we may still need an assumption that earthdistance
>>> and cube are in the same schema --- any comments on that?
> 
>> This is probably not worth doing, and we are already past feature
>> freeze, but adding syntax to look up the namespace of an extension might
>> help.
> 
> Yeah, that idea was discussed before (perhaps only in private
> security-team threads, though).  We didn't do anything about it because
> at the time there didn't seem to be pressing need, but in the context
> of SQL function bodies there's an obvious use-case.
> 
>> We could get something like this working just inside the CREATE EXTENSION 
>> command if we expanded on the @extschema@ idea a bit.  At first I thought 
>> this idea would suffer race conditions with concurrent modifications of 
>> pg_extension or pg_namespace, but it looks like we already have a snapshot 
>> when processing the script file, so:
> 
>> -CREATE DOMAIN earth AS cube
>> +CREATE DOMAIN @@earthdistance@@::earth AS @@cube@@::cube
> 
> Right, extending the @extschema@ mechanism is what was discussed,
> though I think I'd lean towards something like @extschema:cube@
> to denote the schema of a referenced extension "cube".
> 
> I'm not sure this is useful enough to break feature freeze for,
> but I'm +1 for investigating it for v15.
Just like we have a pseudo "$user" schema, could we have a pseudo
"$extension" catalog?  That should avoid changing grammar rules too much.

CREATE TABLE unaccented_words (
word "$extension".citext.citext,
CHECK (word = "$extension".unaccent.unaccent(word)
);

-- 
Vik Fearing

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-14 Thread James Coleman

On Mon, Apr 12, 2021 at 8:37 AM Tomas Vondra
 wrote:
>
> On 4/12/21 2:24 PM, Luc Vlaming wrote:
> > Hi,
> >
> > When trying to run on master (but afaik also PG-13) TPC-DS queries 94,
> > 95 and 96 on a SF10 I get the error "could not find pathkey item to sort".
> > When I disable enable_gathermerge the problem goes away and then the
> > plan for query 94 looks like below. I tried figuring out what the
> > problem is but to be honest I would need some pointers as the code that
> > tries to matching equivalence members in prepare_sort_from_pathkeys is
> > something i'm really not familiar with.
> >
>
> Could be related to incremental sort, which allowed some gather merge
> paths that were impossible before. We had a couple issues related to
> that fixed in November, IIRC.
>
> > To reproduce you can either ingest and test using the toolkit I used too
> > (see https://github.com/swarm64/s64da-benchmark-toolkit/), or
> > alternatively just use the schema (see
> > https://github.com/swarm64/s64da-benchmark-toolkit/tree/master/benchmarks/tpcds/schemas/psql_native)
> >
>
> Thanks, I'll see if I can reproduce that with your schema.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

The query in question is:

select  count(*)
from store_sales
,household_demographics
,time_dim, store
where ss_sold_time_sk = time_dim.t_time_sk
and ss_hdemo_sk = household_demographics.hd_demo_sk
and ss_store_sk = s_store_sk
and time_dim.t_hour = 15
and time_dim.t_minute >= 30
and household_demographics.hd_dep_count = 7
and store.s_store_name = 'ese'
order by count(*)
limit 100;

>From debugging output it looks like this is the plan being chosen
(cheapest total path):
Gather(store_sales household_demographics time_dim) rows=60626
cost=3145.73..699910.15
HashJoin(store_sales household_demographics time_dim)
rows=25261 cost=2145.73..692847.55
  clauses: store_sales.ss_hdemo_sk =
household_demographics.hd_demo_sk
HashJoin(store_sales time_dim) rows=252609
cost=1989.73..692028.08
  clauses: store_sales.ss_sold_time_sk =
time_dim.t_time_sk
SeqScan(store_sales) rows=11998564
cost=0.00..658540.64
SeqScan(time_dim) rows=1070
cost=0.00..1976.35
SeqScan(household_demographics) rows=720
cost=0.00..147.00

prepare_sort_from_pathkeys fails to find a pathkey because
tlist_member_ignore_relabel returns null -- which seemed weird because
the sortexpr is an Aggref (in a single member equivalence class) and
the tlist contains a single member that's also an Aggref. It turns out
that the only difference between the two Aggrefs is that the tlist
entry has "aggsplit = AGGSPLIT_INITIAL_SERIAL" while the sortexpr has
aggsplit = AGGSPLIT_SIMPLE.

That's as far as I've gotten so far, but I figured I'd get that info
out to see if it means anything obvious to anyone else.

James

Re: File truncation within PostgresNode::issues_sql_like() wrong on Windows

2021-04-14 Thread Andrew Dunstan



On 4/14/21 4:13 AM, Michael Paquier wrote:
> Hi all,
>
> As fairywren has proved a couple of days ago, it is not really a good
> idea to rely on a file truncation to check for patterns in the logs of
> the backend:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2021-04-07%2013%3A29%3A28
>
> Visibly, a logic based on the log file truncation fails on Windows
> because of the concurrent access of the backend that outputs its logs
> there.  In PostgresNode.pm, connect_ok() and connect_access() enforce
> a rotation of the log file before restarting the server on Windows to
> make sure that a given step does not find logs generated by a previous
> test, but that's not the case of issues_sql_like().  Looking at the
> existing tests using this routine (src/bin/scripts/), I have found on
> test in 090_reindexdb.pl that could lead to a false positive.  The
> test is marked in the patch attached, just for awareness.
>
> Would there be any objections to change this routine so as we avoid
> the file truncation on Windows?  The patch attached achieves that.
>
> Any thoughts?


That seems rather heavy-handed. The buildfarm's approach is a bit
different. Essentially it seeks to the previous position of the log file
before reading contents. Here is its equivalent of slurp_file:


use Fcntl qw(:seek);
sub file_lines
{
    my $filename = shift;
    my $filepos  = shift;
    my $handle;
    open($handle, '<', $filename) || croak "opening $filename: $!";
    seek($handle, $filepos, SEEK_SET) if $filepos;
    my @lines = <$handle>;
    close $handle;
    return @lines;
}



A client wanting what's done in PostgresNode would do something like:


my $logpos  = -s $logfile;
do_some_stuff();
my @lines = file_lines($logfile, $logpos);


This has the benefit of working the same on all platforms, and no
truncation, rotation, or restart is required.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Possible typo/unclear comment in joinpath.c

2021-04-14 Thread James Coleman

On Wed, Apr 14, 2021 at 1:27 PM Tom Lane  wrote:
>
> Justin Pryzby  writes:
> > On Wed, Apr 14, 2021 at 11:36:38AM -0400, James Coleman wrote:
> >> In joinpath.c three times we reference "extra_lateral_rels" (with
> >> underscores like it's a field), but as far as I can tell that's not a
> >> field anywhere in the source code, and looking at the code that
> >> follows it seems like it should be referencing "lateral_relids" (and
> >> the "extra" is really "extra [in relation to relids]").
>
> > It looks like a loose end from
>
> > commit edca44b1525b3d591263d032dc4fe500ea771e0e
> > Author: Tom Lane 
> > Date:   Mon Dec 7 18:56:14 2015 -0500
>
> > Simplify LATERAL-related calculations within add_paths_to_joinrel().
>
> Yeah :-(.  I'm usually pretty careful about grepping for comment
> references as well as code references to a field when I do something
> like that, but obviously I missed that step that time.
>
> Will fix, thanks James!
>
> regards, tom lane

Thanks for fixing, Tom!

James

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Tom Lane

Andrew Dunstan  writes:
> On 4/14/21 2:03 PM, Tom Lane wrote:
>> This may mean that squeezing these contrib changes into v14 is a lost
>> cause.  We certainly shouldn't try to do what I suggest above for
>> v14; but without it, these changes are just moving the security
>> issue to a different place rather than eradicating it completely.

> Is there anything else we should be doing along the eat your own dogfood
> line that don't have these security implications?

We can still convert the initdb-created SQL functions to new style,
since there's no security threat during initdb.  I'll make a patch
for that soon.

regards, tom lane

Re: Possible typo/unclear comment in joinpath.c

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 2:32 PM Tom Lane  wrote:
> No, I take that back.  There were no references to extra_lateral_rels
> after that commit; these comments were added by 45be99f8c, about
> six weeks later.  The latter was a pretty large patch and had
> presumably been under development for quite some time, so the comments
> were probably accurate when written but didn't get updated.

Woops.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-13 12:59:03 -0700, Peter Geoghegan wrote:
> I agree. Bypassing heap truncation is exactly the kind of thing that
> risks adding significant, unpredictable delay at a time when we need
> to advance relfrozenxid as quickly as possible.
> 
> I pushed a trivial commit that makes the failsafe bypass heap
> truncation as well just now.

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

I don't think it's that blindingly obvious that skipping truncation is
the right thing to do that it doesn't need review. Consider e.g. the
case that you're close to wraparound because you ran out of space for
the amount of WAL VACUUM produces, previously leading to autovacuums
being aborted / the server restarted. The user might then stop regular
activity and try to VACUUM. Skipping the truncation might now make it
harder to actually vacuum all the tables without running out of space.

FWIW, I also don't like that substantial behaviour changes to how vacuum
works were discussed only in a thread titled "New IndexAM API
controlling index vacuum strategies".

Greetings,

Andres Freund

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Andrew Dunstan



On 4/14/21 2:03 PM, Tom Lane wrote:
> Robert Haas  writes:
>> On Wed, Apr 14, 2021 at 1:41 PM Tom Lane  wrote:
>>> Could we hack things so that extension scripts are only allowed to
>>> reference objects created (a) by the system, (b) earlier in the
>>> same script, or (c) owned by one of the declared prerequisite
>>> extensions?  Seems like that might provide a pretty bulletproof
>>> defense against trojan-horse objects, though I'm not sure how much
>>> of a pain it'd be to implement.
>> That doesn't seem like a crazy idea, but the previous idea of having
>> some magic syntax that means "the schema where extension FOO is" seems
>> like it might be easier to implement and more generally useful.
> I think that's definitely useful, but it's not a fix for the
> reference-capture problem unless you care to assume that the other
> extension's schema is free of trojan-horse objects.  So I'm thinking
> that we really ought to pursue both ideas.
>
> This may mean that squeezing these contrib changes into v14 is a lost
> cause.  We certainly shouldn't try to do what I suggest above for
> v14; but without it, these changes are just moving the security
> issue to a different place rather than eradicating it completely.
>
>   



Is there anything else we should be doing along the eat your own dogfood
line that don't have these security implications?


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-14 Thread Bruce Momjian

On Tue, Apr 13, 2021 at 01:30:16PM -0400, Álvaro Herrera wrote:
> On 2021-Apr-12, Bruce Momjian wrote:
> 
> > OK, the attached patch renames pg_stat_activity.queryid to 'query_id'. I
> > have not changed any of the APIs which existed before this feature was
> > added, and are called "queryid" or "queryId" --- it is kind of a mess. 
> > I assume I should leave those unchanged.  It will also need a catversion
> > bump.
> 
> I think it is fine actually.  These names appear in structs Query and
> PlannedStmt, and every single member of those already uses camelCase
> naming.  Changing those to use "query_id" would look out of place.
> You did change the one in PgBackendStatus to st_query_id, which also
> matches the naming style in that struct, so that looks fine also.
> 
> So I'm -1 on Julien's first proposed change, and +1 on his second
> proposed change (the name of the first argument of
> pgstat_report_query_id should be query_id).

Thanks for your analysis.  Updated patch attached with the change
suggested above.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



query_id.diff.gz
Description: application/gzip

Re: Possible typo/unclear comment in joinpath.c

2021-04-14 Thread Tom Lane

I wrote:
> Justin Pryzby  writes:
>> It looks like a loose end from 
>> commit edca44b1525b3d591263d032dc4fe500ea771e0e

> Yeah :-(.  I'm usually pretty careful about grepping for comment
> references as well as code references to a field when I do something
> like that, but obviously I missed that step that time.

No, I take that back.  There were no references to extra_lateral_rels
after that commit; these comments were added by 45be99f8c, about
six weeks later.  The latter was a pretty large patch and had
presumably been under development for quite some time, so the comments
were probably accurate when written but didn't get updated.

regards, tom lane

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Tom Lane

Robert Haas  writes:
> On Wed, Apr 14, 2021 at 1:41 PM Tom Lane  wrote:
>> Could we hack things so that extension scripts are only allowed to
>> reference objects created (a) by the system, (b) earlier in the
>> same script, or (c) owned by one of the declared prerequisite
>> extensions?  Seems like that might provide a pretty bulletproof
>> defense against trojan-horse objects, though I'm not sure how much
>> of a pain it'd be to implement.

> That doesn't seem like a crazy idea, but the previous idea of having
> some magic syntax that means "the schema where extension FOO is" seems
> like it might be easier to implement and more generally useful.

I think that's definitely useful, but it's not a fix for the
reference-capture problem unless you care to assume that the other
extension's schema is free of trojan-horse objects.  So I'm thinking
that we really ought to pursue both ideas.

This may mean that squeezing these contrib changes into v14 is a lost
cause.  We certainly shouldn't try to do what I suggest above for
v14; but without it, these changes are just moving the security
issue to a different place rather than eradicating it completely.

regards, tom lane

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 1:41 PM Tom Lane  wrote:
> Doesn't help that much, because you still have to reference objects
> already created by your own extension, so it's hard to see how the
> target schema won't need to be in the path.

Oh, woops.

> Could we hack things so that extension scripts are only allowed to
> reference objects created (a) by the system, (b) earlier in the
> same script, or (c) owned by one of the declared prerequisite
> extensions?  Seems like that might provide a pretty bulletproof
> defense against trojan-horse objects, though I'm not sure how much
> of a pain it'd be to implement.

That doesn't seem like a crazy idea, but the previous idea of having
some magic syntax that means "the schema where extension FOO is" seems
like it might be easier to implement and more generally useful. If we
taught the core system that %!!**&^%?(earthdistance) means "the schema
where the earthdistance is located" that syntax might get some use
even outside of extension creation scripts, which seems like it could
be a good thing, just because code that is used more widely is more
likely to have been debugged to the point where it actually works.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Options given both on cmd-line and in the config with different values

2021-04-14 Thread Tom Lane

Honza Horak  writes:
> I'm trying to understand what is happening in the following bug report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1935301

> The upgrade process makes it a bit more difficult, but it seems to boil 
> down to this problem -- even when pg_ctl gets clear guidance where to 
> find datadir using -D option on the command-line, it forgets this 
> guidance once finding data_directory option in the postgresql.conf.

> Is this the expected behavior actually?

The rule actually is that -D on the command line says where to find
the configuration file.  While -D is then also the default for where
to find the data directory, the config file can override that by
giving data_directory explicitly.

This is intended to support situations where the config file is kept
outside the data directory for management reasons.  If you are not
actively doing that, I'd recommend *not* setting data_directory
explicitly in the file.

While I've not studied the bug report carefully, it sounds like the
update process you're using involves copying the old config file
across verbatim.  You'd at minimum need to filter out data_directory
and related settings to make that safe.

regards, tom lane

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Tom Lane

Robert Haas  writes:
> On Wed, Apr 14, 2021 at 10:49 AM Tom Lane  wrote:
>> The situation of interest is where you are trying to install an extension
>> into a schema that also contains malicious objects.  We've managed to make
>> most of the commands you might use in an extension script secure against
>> that situation, and Noah wants to hold SQL-function creation to that same
>> standard.

> Oh, I was forgetting that the creation schema has to be first in your
> search path. :-(

> Does the idea of allowing the creation schema to be set separately
> have any legs? Because it seems like that would help here.

Doesn't help that much, because you still have to reference objects
already created by your own extension, so it's hard to see how the
target schema won't need to be in the path.

[ thinks for awhile ... ]

Could we hack things so that extension scripts are only allowed to
reference objects created (a) by the system, (b) earlier in the
same script, or (c) owned by one of the declared prerequisite
extensions?  Seems like that might provide a pretty bulletproof
defense against trojan-horse objects, though I'm not sure how much
of a pain it'd be to implement.

regards, tom lane

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Tom Lane

Mark Dilger  writes:
>> On Apr 13, 2021, at 3:26 PM, Tom Lane  wrote:
>> However I think we may still need an assumption that earthdistance
>> and cube are in the same schema --- any comments on that?

> This is probably not worth doing, and we are already past feature
> freeze, but adding syntax to look up the namespace of an extension might
> help.

Yeah, that idea was discussed before (perhaps only in private
security-team threads, though).  We didn't do anything about it because
at the time there didn't seem to be pressing need, but in the context
of SQL function bodies there's an obvious use-case.

> We could get something like this working just inside the CREATE EXTENSION 
> command if we expanded on the @extschema@ idea a bit.  At first I thought 
> this idea would suffer race conditions with concurrent modifications of 
> pg_extension or pg_namespace, but it looks like we already have a snapshot 
> when processing the script file, so:

> -CREATE DOMAIN earth AS cube
> +CREATE DOMAIN @@earthdistance@@::earth AS @@cube@@::cube

Right, extending the @extschema@ mechanism is what was discussed,
though I think I'd lean towards something like @extschema:cube@
to denote the schema of a referenced extension "cube".

I'm not sure this is useful enough to break feature freeze for,
but I'm +1 for investigating it for v15.

regards, tom lane

Re: Possible typo/unclear comment in joinpath.c

2021-04-14 Thread Tom Lane

Justin Pryzby  writes:
> On Wed, Apr 14, 2021 at 11:36:38AM -0400, James Coleman wrote:
>> In joinpath.c three times we reference "extra_lateral_rels" (with
>> underscores like it's a field), but as far as I can tell that's not a
>> field anywhere in the source code, and looking at the code that
>> follows it seems like it should be referencing "lateral_relids" (and
>> the "extra" is really "extra [in relation to relids]").

> It looks like a loose end from 

> commit edca44b1525b3d591263d032dc4fe500ea771e0e
> Author: Tom Lane 
> Date:   Mon Dec 7 18:56:14 2015 -0500

> Simplify LATERAL-related calculations within add_paths_to_joinrel().

Yeah :-(.  I'm usually pretty careful about grepping for comment
references as well as code references to a field when I do something
like that, but obviously I missed that step that time.

Will fix, thanks James!

regards, tom lane

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 10:49 AM Tom Lane  wrote:
> The situation of interest is where you are trying to install an extension
> into a schema that also contains malicious objects.  We've managed to make
> most of the commands you might use in an extension script secure against
> that situation, and Noah wants to hold SQL-function creation to that same
> standard.

Oh, I was forgetting that the creation schema has to be first in your
search path. :-(

Does the idea of allowing the creation schema to be set separately
have any legs? Because it seems like that would help here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Mark Dilger

> On Apr 13, 2021, at 3:26 PM, Tom Lane  wrote:
> 
> However I think we may still need an assumption that earthdistance
> and cube are in the same schema --- any comments on that?

This is probably not worth doing, and we are already past feature freeze, but 
adding syntax to look up the namespace of an extension might help.  The problem 
seems to be that we can't syntactically refer to the schema of an extension.  
We have to instead query pg_catalog.pg_extension joined against 
pg_catalog.pg_namespace and then interpolate the namespace name into strings 
that get executed, which is ugly.

This syntax is perhaps a non-starter, but conceptually something like:

-CREATE DOMAIN earth AS cube
+CREATE DOMAIN earthdistance::->earth AS cube::->cube

Then we'd perhaps extend RangeVar with an extensionname field and have either a 
schemaname or an extensionname be looked up in places where we currently lookup 
schemas, adding a catcache for extensions.  (Like I said, probably not worth 
doing.)

We could get something like this working just inside the CREATE EXTENSION 
command if we expanded on the @extschema@ idea a bit.  At first I thought this 
idea would suffer race conditions with concurrent modifications of pg_extension 
or pg_namespace, but it looks like we already have a snapshot when processing 
the script file, so:

-CREATE DOMAIN earth AS cube
+CREATE DOMAIN @@earthdistance@@::earth AS @@cube@@::cube

or such, with @@foo@@ being parsed out, looked up in pg_extension join 
pg_namespace, and substituted back in.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Bogus collation version recording in recordMultipleDependencies

2021-04-14 Thread Tom Lane

I noticed some broken-looking logic in recordMultipleDependencies
concerning how it records collation versions.  It was a bit harder
than I expected to demonstrate the bugs, but I eventually succeeded
with

u8=# create function foo(varchar) returns bool language sql return false;
CREATE FUNCTION
u8=# create collation mycoll from "en_US";
CREATE COLLATION
u8=# CREATE DOMAIN d4 AS character varying(3) COLLATE "aa_DJ"
CONSTRAINT yes_or_no_check CHECK (value = 'YES' collate mycoll or 
foo(value));
CREATE DOMAIN
u8=# select objid, pg_describe_object(classid,objid,objsubid) as obj, 
pg_describe_object(refclassid,refobjid,refobjsubid) as ref, deptype, 
refobjversion from pg_depend where objid = 'd4'::regtype;
 objid |   obj   |ref| deptype | refobjversion 
---+-+---+-+---
 37421 | type d4 | schema public | n   | 
 37421 | type d4 | collation "aa_DJ" | n   | 
(2 rows)

u8=# select objid, pg_describe_object(classid,objid,objsubid) as obj, 
pg_describe_object(refclassid,refobjid,refobjsubid) as ref, deptype, 
refobjversion from pg_depend where refobjid = 'd4'::regtype;
 objid |obj |   ref   | deptype | refobjversion 
---++-+-+---
 37420 | type d4[]  | type d4 | i   | 
 37422 | constraint yes_or_no_check | type d4 | a   | 
(2 rows)

u8=# select objid, pg_describe_object(classid,objid,objsubid) as obj, 
pg_describe_object(refclassid,refobjid,refobjsubid) as ref, deptype, 
refobjversion from pg_depend where objid = 37422;
 objid |obj |   ref   | deptype 
| refobjversion 
---++-+-+---
 37422 | constraint yes_or_no_check | type d4 | a   
| 
 37422 | constraint yes_or_no_check | collation mycoll| n   
| 2.28
 37422 | constraint yes_or_no_check | function foo(character varying) | n   
| 2.28
 37422 | constraint yes_or_no_check | collation "default" | n   
| 
(4 rows)

(This is in a glibc-based build, with C as the database's default
collation.)

One question here is whether it's correct that the domain's dependency
on collation "aa_DJ" is unversioned.  Maybe that's intentional, but it
seems worth asking.

Anyway, there are two pretty obvious bugs in the dependencies for the
domain's CHECK constraint: the version for collation mycoll leaks
into the entry for function foo, and an entirely useless (because
unversioned) dependency is recorded on the default collation.

... well, it's almost entirely useless.  If we fix things to not do that
(as per patch 0001 below), the results of the create_index regression
test become unstable, because there's two queries that inquire into the
dependencies of indexes, and their results change depending on whether
the default collation has a version or not.  I'd be inclined to just
take out the portions of that test that depend on that question, but
maybe somebody will complain that there's a loss of useful coverage.
I don't agree, but maybe I'll be overruled.

If we do feel we need to stay bug-compatible with that behavior, then
the alternate 0002 patch just fixes the version-leakage-across-entries
problem, while still removing the unnecessary assumption that C, POSIX,
and DEFAULT are the only pinned collations.

(To be clear: 0002 passes check-world as-is, while 0001 is not
committable without some regression-test fiddling.)

Thoughts?

regards, tom lane

diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 362db7fe91..1217c01b8a 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -73,7 +73,6 @@ recordMultipleDependencies(const ObjectAddress *depender,
 max_slots,
 slot_init_count,
 slot_stored_count;
-	char	   *version = NULL;
 
 	if (nreferenced <= 0)
 		return;	/* nothing to do */
@@ -104,31 +103,22 @@ recordMultipleDependencies(const ObjectAddress *depender,
 	slot_init_count = 0;
 	for (i = 0; i < nreferenced; i++, referenced++)
 	{
-		bool		ignore_systempin = false;
+		char	   *version = NULL;
 
 		if (record_version)
 		{
 			/* For now we only know how to deal with collations. */
 			if (referenced->classId == CollationRelationId)
 			{
-/* C and POSIX don't need version tracking. */
+/* These are unversioned, so don't waste cycles on them. */
 if (referenced->objectId == C_COLLATION_OID ||
 	referenced->objectId == POSIX_COLLATION_OID)
 	continue;
 
 version = get_collation_version_for_oid(referenced->objectId,
 		false);
-
-/*
- * Default collation is pinned, so we need to force recording
- * the dependency to store the version.
- */
-if (referenced->objectId == DEFAULT_COLLATION_OID)
-	ignore_systempin = true;
 			}
 		}
-		else
-

Re: pg_amcheck contrib application

2021-04-14 Thread Robert Haas

On Mon, Apr 12, 2021 at 11:06 PM Mark Dilger
 wrote:
> It now reports:
>
> # heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
> # toast value 16461 missing chunk 3 with expected size 1996
> # heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
> # toast value 16461 was expected to end at chunk 6 with total size 1, 
> but ended at chunk 5 with total size 8004
>
> It sounds like you weren't expecting the second of these reports.  I think it 
> is valuable, especially when there are multiple missing chunks and multiple 
> extraneous chunks, as it makes it easier for the user to reconcile the 
> missing chunks against the extraneous chunks.

I wasn't, but I'm not overwhelmingly opposed to it, either. I do think
I would be in favor of splitting this kind of thing up into two
messages:

# toast value 16459 unexpected chunks 1000 through 1004 each with
size 1996 followed by chunk 1005 with size 20

We'll have fewer message variants and I don't think any real
regression in usability if we say:

# toast value 16459 has unexpected chunks 1000 through 1004 each
with size 1996
# toast value 16459 has unexpected chunk 1005 with size 20

(Notice that I also inserted "has" so that the sentence a verb. Or we
could "contains.")

I committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: jsonb subscripting assignment performance

2021-04-14 Thread Alexander Korotkov

On Wed, Apr 14, 2021 at 10:57 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > sure - there is big room for optimization. But this patch was big enough
> > without its optimization. And it was not clean, if I will be committed or
> > not (it waited in commitfest application for 4 years). So I accepted
> > implemented behaviour (without inplace update). Now, this patch is in core,
> > and anybody can work on others possible optimizations.
>
> Right, jsonb subscripting deals mostly with the syntax part and doesn't
> change internal jsonb behaviour. If I understand the original question
> correctly, "in-place" here means updating of e.g. just one particular
> key within a jsonb object, since jsonb_set looks like an overwrite of
> the whole jsonb. If so, then update will still cause the whole jsonb to
> be updated, there is no partial update functionality for the on-disk
> format. Although there is work going on to optimize this in case when
> jsonb is big enough to be put into a toast table (partial toast
> decompression thread, or bytea appendable toast).

+1

--
Regards,
Alexander Korotkov

[PATCH] expand the units that pg_size_pretty supports on output

2021-04-14 Thread David Christensen

Hi folks,

Enclosed is a patch that expands the unit output for
pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing
numeric output code to account for the larger number of units we're using
rather than just adding nesting levels.

There are also a few other places that could benefit from expanded units,
but this is a standalone starting point.

Best,

David


0001-Expand-the-units-that-pg_size_pretty-numeric-knows-a.patch
Description: Binary data

Re: Possible typo/unclear comment in joinpath.c

2021-04-14 Thread Justin Pryzby

On Wed, Apr 14, 2021 at 11:36:38AM -0400, James Coleman wrote:
> In joinpath.c three times we reference "extra_lateral_rels" (with
> underscores like it's a field), but as far as I can tell that's not a
> field anywhere in the source code, and looking at the code that
> follows it seems like it should be referencing "lateral_relids" (and
> the "extra" is really "extra [in relation to relids]").

It looks like a loose end from 

commit edca44b1525b3d591263d032dc4fe500ea771e0e
Author: Tom Lane 
Date:   Mon Dec 7 18:56:14 2015 -0500

Simplify LATERAL-related calculations within add_paths_to_joinrel().

-- 
Justin

Options given both on cmd-line and in the config with different values

2021-04-14 Thread Honza Horak


Hello hackers,

I'm trying to understand what is happening in the following bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1935301

The upgrade process makes it a bit more difficult, but it seems to boil 
down to this problem -- even when pg_ctl gets clear guidance where to 
find datadir using -D option on the command-line, it forgets this 
guidance once finding data_directory option in the postgresql.conf.


Is this the expected behavior actually? Or is the behavior in this case 
(i.e. when the same option is specified on the cmd-line and also in the 
datadir, with different values) defined at all?


(couldn't find it in the doc and even google does not return me anything 
useful)


Thanks for any tips,
Honza

Possible typo/unclear comment in joinpath.c

2021-04-14 Thread James Coleman

In joinpath.c three times we reference "extra_lateral_rels" (with
underscores like it's a field), but as far as I can tell that's not a
field anywhere in the source code, and looking at the code that
follows it seems like it should be referencing "lateral_relids" (and
the "extra" is really "extra [in relation to relids]").

Assuming that interpretation is correct, I'd attached a patch to
change all three occurrences to "extra lateral_relids" to reduce
confusion.

Thanks,
James


v1-0001-Fix-extra_lateral_rels-typo.patch
Description: Binary data

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Tom Lane

Robert Haas  writes:
> On Wed, Apr 14, 2021 at 8:58 AM Noah Misch  wrote:
>> Once CREATE EXTENSION is over, things are a great deal safer under this
>> proposal, as you say.  I suspect it makes CREATE EXTENSION more hazardous.
>> Today, typical SQL commands in extension creation scripts don't activate
>> inexact argument type matching.  You were careful to make each script clear
>> the search_path around commands deviating from that (commit 7eeb1d9).  I 
>> think
>> "CREATE FUNCTION plus1dot1(int) RETURNS numeric LANGUAGE SQL RETURN $1 + 
>> 1.1;"
>> in a trusted extension script would constitute a security vulnerability, 
>> since
>> it can lock in the wrong operator.

> I don't understand how that can happen, unless we've failed to secure
> the search_path. And, if we've failed to secure the search_path, I
> think we are in a lot of trouble no matter what else we do.

The situation of interest is where you are trying to install an extension
into a schema that also contains malicious objects.  We've managed to make
most of the commands you might use in an extension script secure against
that situation, and Noah wants to hold SQL-function creation to that same
standard.

My concern in this patch is rendering SQL functions safe against untrusted
search_path at *time of use*, which is really an independent security
concern.

If you're willing to assume there's nothing untrustworthy in your
search_path, then there's no issue and nothing to fix.  Unfortunately,
that seems like a rather head-in-the-sand standpoint.

regards, tom lane

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-14 Thread Robert Haas

On Tue, Apr 13, 2021 at 10:42 PM Craig Ringer
 wrote:
> I'd really love it if a committer could add an explanatory comment or
> two in the area though. I'm happy to draft a comment patch if anyone
> agrees my suggestion is sensible. The key things I needed to know when
> studying the code were:
>
> * A LWLock* is always part of a tranche, but locks within a given
> tranche are not necessarily co-located in memory, cross referenced or
> associated in any way.
> * A LWLock tranche does not track how many LWLocks are in the tranche
> or where they are in memory. It only groups up LWLocks into categories
> and maps the tranche ID to a name.
> * Not all LWLocks are part of the main LWLock array; others can be
> embedded in shmem structs elsewhere, including in DSM segments.
> * LWLocks in DSM segments may not have the same address between
> different backends, because the DSM base address can vary, so a
> LWLock* cannot be reliably compared between backends unless you know
> it's in the main LWLock array or in static shmem.
>
> Having that info in lwlock.c near the tranche management code or the
> tranche and main lwlock arrays would've been very handy.

I'm willing to review a comment patch along those lines.

> I'm actually inclined to revise the patch I sent in order to *remove*
> the LWLock name from the tracepoint argument. At least for the
> fast-path tracepoints on start-of-acquire and end-of-acquire. I think
> it's probably OK to report it in the lock wait tracepoints, which is
> where it's most important to have anyway. So the tracepoint will
> always report the LWLock* and tranche ID, but it won't report the
> tranche name for the fast-path. I'll add trace events for tranche ID
> registration, so trace tools can either remember the tranche ID->name
> mappings from there, or capture them from lock wait events and
> remember them.
>
> Reasonable? That way we retain the most important trace functionality,
> but we reduce the overheads.

Reducing the overheads is good, but I have no opinion on what's
important for people doing tracing, because I am not one of those
people.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-14 Thread Robert Haas

On Tue, Apr 13, 2021 at 4:46 PM Andres Freund  wrote:
> I still don't like the two bytes, fwiw ;). Especially because it's 4
> bytes due to padding right now.

I'm not surprised by that disclosure. But I think it's entirely worth
it. Making wait states visible in pg_stat_activity isn't the most
useful thing I've ever done to PostgreSQL, but it's far from the least
useful. If we can get the same benefit with less overhead, that's
great.

> I'd like to move the LWLock->waiters list to outside of the lwlock
> itself - at most TotalProcs LWLocks can be waited for, so we don't need
> millions of empty proclist_heads. That way we could also remove the
> proclist indirection - which shows up a fair bit in contended workloads.
>
> And if we had a separate "lwlocks being waited for" structure, we could
> also add more information to it if we wanted to...
>
> The difficulty of course is having space to indicate which of these
> "waiting for" lists are being used - there's not enough space in ->state
> right now to represent that. Two possibile approaches:
>
> - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but
>   while I personally think that's a sane upper limit, I already had a
>   hard time getting consensus to lower the limit to 2^18-1.
>
> - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as
>   well as an offset to one of MAX_BACKENDS "wait lists" into LWLock.

I'd rather not further reduce MAX_BACKENDS. I still think some day
we're going to want to make that bigger again. Maybe not for a while,
admittedly. But, do you need to fit this into "state"? If you just
replaced "waiters" with a 32-bit integer, you'd save 4 bytes and have
bits left over (and maybe restrict the tranche ID to 2^14 and squeeze
that in too, as you mention).

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2021-04-14 Thread Andrei Zubkov

Hello, Kuroda!

On Fri, 2021-04-09 at 00:23 +, kuroda.hay...@fujitsu.com wrote:
> I think you are right. 
> According to [1] we can bump up the version per one PG major version,
> and any features are not committed yet for 15.
> 
> [1]: https://www.postgresql.org/message-id/20201202040516.GA43757@nol

Version 2 of patch attached. 
pg_stat_statements version is now 1.10 and patch is based on 0f61727.

-- 
Andrei Zubkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From 321ce82f22894be39297515268c1f3bed74778e2 Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Wed, 14 Apr 2021 16:35:56 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

Added first_seen column in pg_stat_statements view. This field is
populated with current timestamp when a new statement is added to
pg_stat_statements hashtable. This field provides clean information
about statistics collection time interval for each statement. Besides
it can be used by sampling solutions to detect situations when a
statement was evicted and returned back between two samples.

Discussion: https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 contrib/pg_stat_statements/Makefile   |  3 +-
 .../expected/pg_stat_statements.out   | 29 +
 .../pg_stat_statements--1.9--1.10.sql | 59 +++
 .../pg_stat_statements/pg_stat_statements.c   | 31 +-
 .../pg_stat_statements.control|  2 +-
 .../sql/pg_stat_statements.sql|  9 +++
 doc/src/sgml/pgstatstatements.sgml|  9 +++
 7 files changed, 137 insertions(+), 5 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 3ec627b956..7d7b2f4808 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 674ed270a8..f2268dfd87 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -941,4 +941,33 @@ SELECT query, toplevel, plans, calls FROM pg_stat_statements WHERE query LIKE '%
  $$ LANGUAGE plpgsql   |  |   | 
 (3 rows)
 
+--
+-- statement timestamps
+--
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+SELECT 1 AS "STMTTS1";
+ STMTTS1 
+-
+   1
+(1 row)
+
+SELECT now() AS ref_ts \gset
+SELECT 1,2 AS "STMTTS2";
+ ?column? | STMTTS2 
+--+-
+1 |   2
+(1 row)
+
+SELECT first_seen >= :'ref_ts', count(*) FROM pg_stat_statements WHERE query LIKE '%STMTTS%' GROUP BY first_seen >= :'ref_ts' ORDER BY first_seen >= :'ref_ts';
+ ?column? | count 
+--+---
+ f| 1
+ t| 1
+(2 rows)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
new file mode 100644
index 00..969a30fbdc
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql
@@ -0,0 +1,59 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'" to load this file. \quit
+
+/* We need to redefine a view and a function */
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+OUT userid oid,
+OUT dbid oid,
+OUT toplevel bool,
+OUT queryid bigint,
+OUT query text,
+OUT plans int8,
+OUT total_plan_time float8,
+OUT min_plan_time float8,
+OUT max_plan_time float8,
+OUT mean_plan_time float8,
+OUT stddev_plan_time float8,
+OUT calls int8,
+OUT total_exec_time float8,
+OUT min_exec_time float8,
+OUT max_exec_time float8,
+OUT mean_exec_time float8,
+OUT stddev_exec_time float8,
+OUT rows int8,
+OUT

Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay

2021-04-14 Thread Mead, Scott



> On Mar 1, 2021, at 8:43 PM, Masahiko Sawada  wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On Mon, Feb 8, 2021 at 11:49 PM Mead, Scott  wrote:
>> 
>> Hello,
>>   I recently looked at what it would take to make a running autovacuum 
>> pick-up a change to either cost_delay or cost_limit.  Users frequently will 
>> have a conservative value set, and then wish to change it when autovacuum 
>> initiates a freeze on a relation.  Most users end up finding out they are in 
>> ‘to prevent wraparound’ after it has happened, this means that if they want 
>> the vacuum to take advantage of more I/O, they need to stop and then restart 
>> the currently running vacuum (after reloading the GUCs).
>> 
>>  Initially, my goal was to determine feasibility for making this dynamic.  I 
>> added debug code to vacuum.c:vacuum_delay_point(void) and found that changes 
>> to cost_delay and cost_limit are already processed by a running vacuum.  
>> There was a bug preventing the cost_delay or cost_limit from being 
>> configured to allow higher throughput however.
>> 
>> I believe this is a bug because currently, autovacuum will dynamically 
>> detect and increase the cost_limit or cost_delay, but it can never decrease 
>> those values beyond their setting when the vacuum began.  The current 
>> behavior is for vacuum to limit the maximum throughput of currently running 
>> vacuum processes to the cost_limit that was set when the vacuum process 
>> began.
> 
> Thanks for your report.
> 
> I've not looked at the patch yet but I agree that the calculation for
> autovacuum cost delay seems not to work fine if vacuum-delay-related
> parameters (e.g., autovacuum_vacuum_cost_delay etc) are changed during
> vacuuming a table to speed up running autovacuums. Here is my
> analysis:


I appreciate your in-depth analysis and will comment in-line.  That said, I 
still think it’s important that the attached path is applied.  As it is today, 
a simple few lines of code prevent users from being able to increase the 
throughput on vacuums that are running without having to cancel them first.

The patch that I’ve provided allows users to decrease their vacuum_cost_delay 
and get an immediate boost in performance to their running vacuum jobs.


> 
> Suppose we have the following parameters and 3 autovacuum workers are
> running on different tables:
> 
> autovacuum_vacuum_cost_delay = 100
> autovacuum_vacuum_cost_limit = 100
> 
> Vacuum cost-based delay parameters for each workers are follows:
> 
> worker->wi_cost_limit_base = 100
> worker->wi_cost_limit = 66
> worker->wi_cost_delay = 100
> 
> Each running autovacuum has "wi_cost_limit = 66" because the total
> limit (100) is equally rationed. And another point is that the total
> wi_cost_limit (198 = 66*3) is less than autovacuum_vacuum_cost_limit,
> 100. Which are fine.
> 
> Here let's change autovacuum_vacuum_cost_delay/limit value to speed up
> running autovacuums.
> 
> Case 1 : increasing autovacuum_vacuum_cost_limit to 1000.
> 
> After reloading the configuration file, vacuum cost-based delay
> parameters for each worker become as follows:
> 
> worker->wi_cost_limit_base = 100
> worker->wi_cost_limit = 100
> worker->wi_cost_delay = 100
> 
> If we rationed autovacuum_vacuum_cost_limit, 1000, to 3 workers, it
> would be 333. But since we cap it by wi_cost_limit_base, the
> wi_cost_limit is 100. I think this is what Mead reported here.


Yes, this is exactly correct.  The cost_limit is capped at the cost_limit that 
was set during the start of a running vacuum.  My patch changes this cap to be 
the max allowed cost_limit (10,000).



> 
> Case 2 : decreasing autovacuum_vacuum_cost_delay to 10.
> 
> After reloading the configuration file, vacuum cost-based delay
> parameters for each workers become as follows:
> 
> worker->wi_cost_limit_base = 100
> worker->wi_cost_limit = 100
> worker->wi_cost_delay = 100
> 
> Actually, the result is the same as case 1. But In this case, the
> total cost among the three workers is 300, which is greater than
> autovacuum_vacuum_cost_limit, 100. This behavior violates what the
> documentation explains in the description of
> autovacuum_vacuum_cost_limit:
> 
> ---
> Note that the value is distributed proportionally among the running
> autovacuum workers, if there is more than one, so that the sum of the
> limits for each worker does not exceed the value of this variable.
> ---
> 
> It seems to me that those problems come from the fact that we don't
> change both wi_cost_limit_base and wi_cost_delay during auto-vacuuming
> a table in spite of using autovacuum_vac_cost_limit/delay to calculate
> cost_avail. Such a wrong calculation happens until all running
> autovacuum workers finish the current vacuums. When a worker starts to
> process a new table, it resets both wi_cost_limit_base and
> wi_cost_delay.

ANALYZE counts LP_DEAD line pointers as n_dead_tup

2021-04-14 Thread Masahiko Sawada

Hi all,

If we create a table with vacuum_index_cleanup = off or execute VACUUM
with INDEX_CLEANUP = off, vacuum updates pg_stat_all_tables.n_dead_tup
to the number of HEAPTUPLE_RECENTLY_DEAD tuples. Whereas analyze
updates it to the sum of the number of HEAPTUPLE_DEAD/RECENTLY_DEAD
tuples and LP_DEAD line pointers. So if the table has many LP_DEAD
line pointers due to skipping index cleanup, autovacuum is triggered
every time after analyze/autoanalyze. This issue seems to happen also
on back branches, probably from 12 where INDEX_CLEANUP option was
introduced.

I think we can have heapam_scan_analyze_next_tuple() not count LP_DEAD
line pointer as lazy_scan_prune() does. Attached the patch for that.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


not_count_lp_dead_by_analyze.patch
Description: Binary data

Can a child process detect postmaster death when in pg_usleep?

2021-04-14 Thread Bharath Rupireddy

Hi,

In my dev system(Ubuntu) when the postmaster is killed with SIGKILL,
SIGPWR is being sent to its child processes (backends/any other bg
process). If a child process is waiting with pg_usleep, it looks like
it is not detecting the postmaster's death and it doesn't exit
immediately but stays forever until it gets killed explicitly. For
this experiment, I did 2 things to simulate the scenario i.e. a
backend waiting in pg_usleep and killing the postmaster. 1) I wrote a
wait function that uses pg_usleep and called it in a backend. This
backend doesn't exit on postmaster death. 2)  I set PostAuthDelay to
100 seconds and started the postmaster. Then, the "auotvacuum
launcher" process still stays around (as it has pg_usleep in its main
function), even after postmaster death.

Questions:
1) Is it really harmful to use pg_usleep in a postmaster child process
as it doesn't let the child process detect postmaster death?

2) Can pg_usleep() always detect signals? I see the caution in the
pg_usleep function definition in pgsleep.c, saying the signal handling
is platform dependent. We have code blocks like below in the code. Do
we actually process interrupts before going to sleep with pg_usleep()?
while/for loop
{
..
..
 CHECK_FOR_INTERRUPTS();
 pg_usleep();
}
and
if (PostAuthDelay)
   pg_usleep();

3) Is it intentional to use pg_usleep in some places in the code? If
yes, what are they? At least, I see one place where it's intentional
in the wait_pid function which is used while running the regression
tests.

4) Are there any places where we need to replace pg_usleep with
WaitLatch/equivalent of pg_sleep to detect the postmaster death
properly?

Correct me if I'm missing something or if my observation/understanding
of the pg_usleep() is wrong.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 8:58 AM Noah Misch  wrote:
> Once CREATE EXTENSION is over, things are a great deal safer under this
> proposal, as you say.  I suspect it makes CREATE EXTENSION more hazardous.
> Today, typical SQL commands in extension creation scripts don't activate
> inexact argument type matching.  You were careful to make each script clear
> the search_path around commands deviating from that (commit 7eeb1d9).  I think
> "CREATE FUNCTION plus1dot1(int) RETURNS numeric LANGUAGE SQL RETURN $1 + 1.1;"
> in a trusted extension script would constitute a security vulnerability, since
> it can lock in the wrong operator.

I don't understand how that can happen, unless we've failed to secure
the search_path. And, if we've failed to secure the search_path, I
think we are in a lot of trouble no matter what else we do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: sepgsql logging

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 8:42 AM Dave Page  wrote:
> Attached is a patch to clean this up. It will log denials as such regardless 
> of whether or not either selinux or sepgsql is in permissive mode. When 
> either is in permissive mode, it'll add " permissive=1" to the end of the log 
> messages. e.g.

Looks superficially reasonable on first glance, but I think we should
try to get an opinion from someone who knows more about SELinux.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] PREPARE TRANSACTION unexpected behavior with TEMP TABLE

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 6:45 AM Himanshu Upadhyaya
 wrote:
> The purpose of this FIX is to mainly focus on getting consistent behavior
> with PREPARE TRANSACTION. With the case that I had mentioned
> previously, my expectation was either both PREPARE TRANSACTION should fail
> or both should succeed but here second same "PREPARE TRANSACTION" was
> successful however first one was failing with an error, which is kind of 
> weird to me.

I agree that it's weird, but that doesn't mean that your patch is an
improvement, and I don't think it is. If we could make this thing more
consistent without incurring any negatives, I'd be in favor of that.
But the patch does have some negatives, which in my opinion are more
substantial than the problem you're trying to fix. Namely, possible
performance consequences, and undocumented and fragile assumptions
that, as it seems to me, may easily get broken in the future. I see
that you've repeatedly capitalized the word FIX in your reply, but
it's just not that simple. If this had really bad consequences like
corrupting data or crashing the server then it would be essential to
do something about it, but so far the worst consequence you've
indicated is that an obscure sequence of SQL commands that no real
user is likely to issue produces a slightly surprising result. That's
not great, but neither is it an emergency.

> I have also tried to reproduce the behavior.

Your test case isn't ideal for reproducing the problem that the
comment is worrying about. Normally, when we take a lock on a table,
we hold it until commit. But, that only applies when we run a query
that mentions the table, like a SELECT or an UPDATE. In your case, we
only end up opening the table to build a relcache entry for it, so
that we can look at the metadata. And, catalog scans used to build
syscache and relcache entries release locks immediately, rather than
waiting until the end of the transaction. So it might be that if we
failed to ever set XACT_FLAGS_ACCESSEDTEMPNAMESPACE in your test case,
everything would be fine.

That doesn't seem to be true in general though. I tried changing
"cannot PREPARE a transaction that has operated on temporary objects"
from an ERROR to a NOTICE and then ran 'make check'. It hung. I think
this test case is the same problem as the regression tests hit; in any
case, it also hangs:

rhaas=# begin;
BEGIN
rhaas=*# create temp table first ();
CREATE TABLE
rhaas=*# prepare transaction 'whatever';
NOTICE:  cannot PREPARE a transaction that has operated on temporary objects
PREPARE TRANSACTION
rhaas=# create temp table second ();
[ hangs ]

I haven't checked, but I think the problem here is that the first
transaction had to create this backend's pg_temp schema and the second
one can't see the results of the first one doing it so it wants to do
the same thing and that results in waiting for a lock the prepared
transaction already holds. I made a quick attempt to reproduce a hang
at backend exit time, but couldn't find a case where that happened.
That doesn't mean there isn't one, though. There's a very good chance
that the person who wrote that comment knew that a real problem
existed, and just didn't describe it well enough for you or I to
immediately know what it is. It is also possible that they were
completely wrong, or that things have changed since the comment was
written, but we can't assume that without considerably more research
and analysis than either of us has done so far.

I think one point to take away here is that question of whether a
temporary relation has been "accessed" is not all black and white. If
I ran a SELECT statement against a relation, I think it's clear that
I've accessed it. But, if I just used the name of that relation as a
type name in some other SQL command, did I really access it? The
current code's answer to that is that if we had to open and lock the
relation to get its metadata, then we accessed it, and if we already
had the details that we needed in cache, then we did not access it.
Now, again, I agree that looks a little weird from a user point of
view, but looking at the implementation, you can kind of see why it
ends up like that. From a certain point of view, it would be more
surprising if we never opened or locked the relation and yet ended up
deciding that we'd accessed it. Now maybe we should further explore
going the other direction and avoiding setting the flag at all in this
case, since I think we're neither retaining a lock nor touching any
relation buffers, but I think that needs more analysis. Even if we
decide that's safe, there's still the problem of finding a better
implementation that's not overly complex for what really feels like a
very minor issue.

--
Robert Haas
EDB: http://www.enterprisedb.com

RE: Support tab completion for upper character inputs in psql

2021-04-14 Thread tanghy.f...@fujitsu.com

On Thursday, April 8, 2021 4:14 PM, Peter Eisentraut 
 wrote

>Seeing the tests you provided, it's pretty obvious that the current 
>behavior is insufficient.  I think we could probably think of a few more 
>tests, for example exercising the "If case insensitive matching was 
>requested initially, adjust the case according to setting." case, or 
>something with quoted identifiers.

Thanks for your review and suggestions on my patch. 
I've added more tests in the latest patch V5, the added tests helped me find 
some bugs in my patch and I fixed them.
Now the patch can support not only the SET/SHOW [PARAMETER] but also UPDATE 
["aTable"|ATABLE], also UPDATE atable SET ["aColumn"|ACOLUMN].

I really hope someone can have more tests suggestions on my patch or kindly do 
some tests on my patch and share me if any bugs happened.

Differences from V4 are:
* fix some bugs related to quoted identifiers.
* add some tap tests.

Regards,
Tang


V5-0001-Support-tab-completion-with-a-query-result-for-upper.patch
Description:  V5-0001-Support-tab-completion-with-a-query-result-for-upper.patch

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-14 Thread Peter Eisentraut

On 12.04.21 07:46, Craig Ringer wrote:

 > To use systemtap semaphores (the _ENABLED macros) you need to run
dtrace
 > -g to generate a probes.o then link that into postgres.
 >
 > I don't think we do that. I'll double check soon.

We do that.  (It's -G.)

Huh. I could've sworn we didn't. My mistake, it's there in 
src/backend/Makefile .

In that case I'll amend the patch to use semaphore guards.

This whole thread is now obviously moved to consideration for PG15, but 
I did add an open item about this particular issue 
(https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items, search for 
"dtrace").  So if you could produce a separate patch that adds the 
_ENABLED guards targeting PG14 (and PG13), that would be helpful.

Re: CTE push down

2021-04-14 Thread Ashutosh Bapat

On Tue, Apr 13, 2021 at 6:58 PM Alexander Pyhalov
 wrote:
>
> Hi.
>
> Currently PostgreSQL supports CTE push down for SELECT statements, but
> it is implemented as turning each CTE reference into subquery.
>
> When CTE is referenced multiple times, we have choice - to materialize
> CTE (and disable quals distribution to the CTE query) or inline it (and
> so run CTE query multiple times,
> which can be inefficient, for example, when CTE references foreign
> tables).
>
> I was looking if it is possible to collect quals referencing CTE,
> combine in OR qual and add them to CTE query.
>
> So far I consider the following changes.
>
> 1) Modify SS_process_ctes() to add a list of RestrictInfo* to
> PlannerInfo - one NULL RestrictInfo pointer per CTE (let's call this
> list cte_restrictinfos for now)/
> 2) In distribute_restrictinfo_to_rels(), when we get rel of RTE_CTE
> relkind and sure that can safely pushdown restrictinfo, preserve
> restrictinfo in cte_restrictinfos, converting multiple restrictions to
> "OR" RestrictInfos.
> 3) In the end of subquery_planner() (after inheritance_planner() or
> grouping_planner()) we can check if cte_restrictinfos contain some
> non-null RestrictInfo pointers and recreate plan for corresponding CTEs,
> distributing quals to relations inside CTE queries.
>
> For now I'm not sure how to handle vars mapping when we push
> restrictinfos to the level of cte root or when we push it down to the
> cte plan, but properly mapping vars seems seems to be doable.

I think similar mapping happens when we push quals that reference a
named JOIN down to join rels. I didn't take a look at it, but I think
it happens before planning time. But some similar machinary might help
in this case.

I believe step2 is needed to avoid materializing rows which will never
be selected. That would be a good improvement. However, care needs to
be taken for volatile quals. I think, the quals on CTE will be
evaluated twice, once when materializing the CTE result and second
time when scanning the materialized result. volatile quals may produce
different results when run multiple times.

>
> Is there something else I miss?
> Does somebody work on alternative solution or see issues in such
> approach?

IMO, a POC patch will help understand your idea.

-- 
Best Wishes,
Ashutosh Bapat

Re: Converting contrib SQL functions to new style

2021-04-14 Thread Noah Misch

On Tue, Apr 13, 2021 at 11:11:13PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Tue, Apr 13, 2021 at 06:26:34PM -0400, Tom Lane wrote:
> >> Attached are some draft patches to convert almost all of the
> >> contrib modules' SQL functions to use SQL-standard function bodies.
> >> The point of this is to remove the residual search_path security
> >> hazards that we couldn't fix in commits 7eeb1d986 et al.  Since
> >> a SQL-style function body is fully parsed at creation time,
> >> its object references are not subject to capture by the run-time
> >> search path.
> 
> > Are there any inexact matches in those function/operator calls?  Will that
> > matter more or less than it does today?
> 
> I can't claim to have looked closely for inexact matches.  It should
> matter less than today, since there's a hazard only during creation
> (with a somewhat-controlled search path) and not during use.  But
> that doesn't automatically eliminate the issue.

Once CREATE EXTENSION is over, things are a great deal safer under this
proposal, as you say.  I suspect it makes CREATE EXTENSION more hazardous.
Today, typical SQL commands in extension creation scripts don't activate
inexact argument type matching.  You were careful to make each script clear
the search_path around commands deviating from that (commit 7eeb1d9).  I think
"CREATE FUNCTION plus1dot1(int) RETURNS numeric LANGUAGE SQL RETURN $1 + 1.1;"
in a trusted extension script would constitute a security vulnerability, since
it can lock in the wrong operator.

Re: sepgsql logging

2021-04-14 Thread Dave Page

Hi

On Thu, Apr 1, 2021 at 3:30 PM Dave Page  wrote:

>
>
> On Thu, Apr 1, 2021 at 3:23 PM Tom Lane  wrote:
>
>> Andrew Dunstan  writes:
>> > On 4/1/21 8:32 AM, Dave Page wrote:
>> >> It seems to me that sepgsql should also log the denial, but flag that
>> >> permissive mode is on.
>>
>> > +1 for doing what selinux does if possible.
>>
>> +1.  If selinux itself is doing that, it's hard to see a reason why
>> we should not; and I concur that the info is useful.
>>
>
> Thanks both. I'll take a look at the code and see if I can whip up a patch
> (it'll be a week or so as I'm taking some time off for Easter).
>

Attached is a patch to clean this up. It will log denials as such
regardless of whether or not either selinux or sepgsql is in permissive
mode. When either is in permissive mode, it'll add " permissive=1" to the
end of the log messages. e.g.

Regular user in permissive mode, with a restricted table column:

2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_table
name="public.tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column uid of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column name of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column mail of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column address of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column salt of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: denied { select }
scontext=user_u:user_r:user_t:s0
tcontext=system_u:object_r:sepgsql_secret_table_t:s0 tclass=db_column
name="column phash of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] STATEMENT:  SELECT * FROM tb_users;

The same user/table, but in enforcing mode:

2021-04-14 13:17:21.645 BST [22974] LOG:  SELinux: allowed { search }
scontext=user_u:user_r:user_t:s0
tcontext=system_u:object_r:sepgsql_schema_t:s0 tclass=db_schema
name="public" at character 15
2021-04-14 13:17:21.645 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_table
name="public.tb_users"
2021-04-14 13:17:21.646 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column uid of table tb_users"
2021-04-14 13:17:21.646 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column name of table tb_users"
2021-04-14 13:17:21.646 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column mail of table tb_users"
2021-04-14 13:17:21.646 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column address of table tb_users"
2021-04-14 13:17:21.646 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column salt of table tb_users"
2021-04-14 13:17:21.646 BST [22974] STATEMENT:  SELECT * FROM tb_users;
2021-04-14 13:17:21.646 BST [22974] LOG:  SELinux: denied {

Re: Extensions not dumped when --schema is used

2021-04-14 Thread Noah Misch

On Wed, Apr 14, 2021 at 10:38:17AM +0900, Michael Paquier wrote:
> On Tue, Apr 13, 2021 at 08:00:34AM -0700, Noah Misch wrote:
> > On Tue, Apr 13, 2021 at 02:43:11PM +0900, Michael Paquier wrote:
> >>> - If extschema='public', "pg_dump -e plpgsql --schema=public" includes
> >>>   commands to dump the relation data.  This surprised me.  (The
> >>>   --schema=public argument causes selectDumpableNamespace() to set
> >>>   nsinfo->dobj.dump=DUMP_COMPONENT_ALL instead of DUMP_COMPONENT_ACL.)

> > This isn't a problem of selecting schemas for inclusion in the dump.  This 
> > is
> > a problem of associating extensions with their pg_extension_config_dump()
> > relations and omitting those extension-member relations when "-e" causes
> > omission of the extension.
> 
> At code level, the decision to dump the data of any extension's
> dumpable table is done in processExtensionTables().  I have to admit
> that your feeling here keeps the code simpler than what I have been
> thinking if we apply an extra filtering based on the list of
> extensions in this code path.  So I can see the value in your argument
> to not dump at all the data of an extension's dumpable table as long
> as its extension is not listed, and this, even if its schema is
> explicitly listed.
> 
> So I got down to make the behavior more consistent with the patch
> attached.  This passes your case.

Yes.

> It is worth noting that if a
> table is part of a schema created by an extension, but that the table
> is not dependent on the extension, we would still dump its data if
> using --schema with this table's schema while the extension is not
> part of the list from --extension.  In the attached, that's just the
> extra test with without_extension_implicit_schema.

That's consistent with v13, and it seems fine.  I've not used a non-test
extension that creates a schema.

> --- a/src/test/modules/test_pg_dump/t/001_base.pl
> +++ b/src/test/modules/test_pg_dump/t/001_base.pl
> @@ -208,6 +208,30 @@ my %pgdump_runs = (
>   'pg_dump', '--no-sync', 
> "--file=$tempdir/without_extension.sql",
>   '--extension=plpgsql', 'postgres',
>   ],
> + },
> +
> + # plgsql in the list of extensions blocks the dump of extension
> + # test_pg_dump.
> + without_extension_explicit_schema => {
> + dump_cmd => [
> + 'pg_dump',
> + '--no-sync',
> + "--file=$tempdir/without_extension_explicit_schema.sql",
> + '--extension=plpgsql',
> + '--schema=public',
> + 'postgres',
> + ],
> + },
> +
> + without_extension_implicit_schema => {
> + dump_cmd => [
> + 'pg_dump',
> + '--no-sync',
> + "--file=$tempdir/without_extension_implicit_schema.sql",
> + '--extension=plpgsql',
> + '--schema=regress_pg_dump_schema',
> + 'postgres',
> + ],
>   },);

The name "without_extension_explicit_schema" arose because that test differs
from the "without_extension" test by adding --schema=public.  The test named
"without_extension_implicit_schema" differs from "without_extension" by adding
--schema=regress_pg_dump_schema, so the word "implicit" feels not-descriptive
of the test.  I recommend picking a different name.  Other than that, the
change looks good.

Re: Replication slot stats misgivings

2021-04-14 Thread vignesh C

On Wed, Apr 14, 2021 at 12:09 PM Amit Kapila  wrote:
>
> On Tue, Apr 13, 2021 at 1:37 PM vignesh C  wrote:
> >
> > On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > The following test for the latest v8 patch seems to show different.
> > > total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> > > something?
> > >
> > > postgres(1:85969)=# select pg_create_logical_replication_slot('s',
> > > 'test_decoding');
> > >  pg_create_logical_replication_slot
> > > 
> > >  (s,0/1884468)
> > > (1 row)
> > >
> > > postgres(1:85969)=# create table a (i int);
> > > CREATE TABLE
> > > postgres(1:85969)=# insert into a select generate_series(1, 10);
> > > INSERT 0 10
> > > postgres(1:85969)=# set logical_decoding_work_mem to 64;
> > > SET
> > > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> > >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > > ---++-++-+-+-+--+--+-
> > >  s |  0 |   0 |  0 |   0 |
> > >   0 |   0 |0 |0 |
> > > (1 row)
> > >
> > > postgres(1:85969)=# select count(*) from
> > > pg_logical_slot_peek_changes('s', NULL, NULL);
> > >  count
> > > 
> > >  14
> > > (1 row)
> > >
> > > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> > >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > > ---++-++-+-+-+--+--+-
> > >  s |  2 |1808 |  1 | 202 |
> > > 1320 |   0 |0 |0 |
> > > (1 row)
> > >
> >
> > Thanks for identifying this issue, while spilling the transactions
> > reorder buffer changes gets released, we will not be able to get the
> > total size for spilled transactions from reorderbuffer size. I have
> > fixed it by including spilledbytes to totalbytes in case of spilled
> > transactions. Attached patch has the fix for this.
> > Thoughts?
> >
>
> I am not sure if that is the best way to fix it because sometimes we
> clear the serialized flag in which case it might not give the correct
> answer. Another way to fix it could be that before we try to restore a
> new set of changes, we update totalBytes counter. See, the attached
> patch atop your v6-0002-* patch.

I felt calculating totalbytes this way is better than depending on
spill_bytes. I have taken your changes. Attached patch includes the
changes suggested.
Thoughts?

Regards,
Vignesh
From 9a4ca0fcef85d000856339cfcb2a58eb87ee5e72 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Wed, 14 Apr 2021 10:08:13 +0530
Subject: [PATCH v10 1/4] Added total txns and total txn bytes to replication
 statistics.

This adds the statistics about total transactions count and total transaction
data logically replicated to the decoding output plugin from ReorderBuffer.
Users can query the pg_stat_replication_slots view to check these stats.
---
 contrib/test_decoding/expected/stats.out  | 79 +--
 contrib/test_decoding/sql/stats.sql   | 48 +++
 doc/src/sgml/monitoring.sgml  | 25 ++
 src/backend/catalog/system_views.sql  |  2 +
 src/backend/postmaster/pgstat.c   |  6 ++
 src/backend/replication/logical/logical.c | 16 ++--
 .../replication/logical/reorderbuffer.c   | 17 
 src/backend/utils/adt/pgstatfuncs.c   |  8 +-
 src/include/catalog/pg_proc.dat   |  6 +-
 src/include/pgstat.h  |  4 +
 src/include/replication/reorderbuffer.h   |  4 +
 src/test/regress/expected/rules.out   |  4 +-
 12 files changed, 170 insertions(+), 49 deletions(-)

diff --git a/contrib/test_decoding/expected/stats.out b/contrib/test_decoding/expected/stats.out
index bca36fa903..bc8e601eab 100644
--- a/contrib/test_decoding/expected/stats.out
+++ b/contrib/test_decoding/expected/stats.out
@@ -8,7 +8,7 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
 
 CREATE TABLE stats_test(data text);
 -- function to wait for counters to advance
-CREATE FUNCTION wait_for_decode_stats(check_reset bool) RETURNS void AS $$
+CREATE FUNCTION wait_for_decode_stats(check_reset bool, check_spill_txns bool) RETURNS void AS $$
 DECLARE
   start_time timestamptz := clock_timestamp();
   updated bool;
@@ -16,12 +16,25 @@ BEGIN
   -- we don't want to wait forever; loop will exit after 30 seconds
   FOR i IN 1 .. 300 LOOP
 
--- check to see if all updates have been reset/updated
-SELECT CASE WHEN check_reset THEN (spill_txns = 0)
-ELSE

Re: logical replication worker accesses catalogs in error context callback

2021-04-14 Thread Bharath Rupireddy

On Wed, Mar 17, 2021 at 4:52 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 16, 2021 at 2:21 AM Tom Lane  wrote:
> > > Thanks for pointing to the changes in the commit message. I corrected
> > > them. Attaching v4 patch set, consider it for further review.
> >
> > I took a quick look at this.  I'm quite worried about the potential
> > performance cost of the v4-0001 patch (the one for fixing
> > slot_store_error_callback).  Previously, we didn't pay any noticeable
> > cost for having the callback unless there actually was an error.
> > As patched, we perform several catalog lookups per column per row,
> > even in the non-error code path.  That seems like it'd be a noticeable
> > performance hit.  Just to add insult to injury, it leaks memory.
> >
> > I propose a more radical but simpler solution: let's just not bother
> > with including the type names in the context message.  How much are
> > they really buying?
>
> Thanks. In that case, the message can only return the local and remote
> columns names and ignore the types (something like below). And the
> user will have to figure out what are the types of those columns in
> local and remote separately in case of error. Then the function
> logicalrep_typmap_gettypname can also be removed. I'm not sure if this
> is okay. Thoughts?

Hi Tom,

As suggested earlier, I'm attaching a v5 patch that avoids printing
the column type names in the context message thus no cache lookups
have to be done in the error context callback. I think the column name
is enough to know on which column the error occurred and if required
it's type can be known by the user. This patch gets rid of printing
local and remote type names in slot_store_error_callback and also
logicalrep_typmap_gettypname because it's unnecessary. I'm not sure if
this solution is acceptable. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From a65e13d591bb151eb71eb8ec42834ce28a3db304 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 14 Apr 2021 13:25:16 +0530
Subject: [PATCH v5] Avoid Catalogue Accesses In slot_store_error_callback

Avoid accessing system catalogues inside slot_store_error_callback
error context callback, because the the entire transaction might
have been broken at this point. logicalrep_typmap_gettypname()
and format_type_be could search the sys cache.

As per suggestion from Tom, a simple solution is to just avoid
printing the column type names in the message, just the column
name is enough to know on which column the error occurred.

The above solution makes logicalrep_typmap_gettypname redundant
hence removed it.
---
 src/backend/replication/logical/relation.c | 49 --
 src/backend/replication/logical/worker.c   | 25 +--
 src/include/replication/logicalrelation.h  |  2 -
 3 files changed, 2 insertions(+), 74 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index e861c0ff80..cbc5612dbe 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -490,55 +490,6 @@ logicalrep_typmap_update(LogicalRepTyp *remotetyp)
 	MemoryContextSwitchTo(oldctx);
 }
 
-/*
- * Fetch type name from the cache by remote type OID.
- *
- * Return a substitute value if we cannot find the data type; no message is
- * sent to the log in that case, because this is used by error callback
- * already.
- */
-char *
-logicalrep_typmap_gettypname(Oid remoteid)
-{
-	LogicalRepTyp *entry;
-	bool		found;
-
-	/* Internal types are mapped directly. */
-	if (remoteid < FirstGenbkiObjectId)
-	{
-		if (!get_typisdefined(remoteid))
-		{
-			/*
-			 * This can be caused by having a publisher with a higher
-			 * PostgreSQL major version than the subscriber.
-			 */
-			return psprintf("unrecognized %u", remoteid);
-		}
-
-		return format_type_be(remoteid);
-	}
-
-	if (LogicalRepTypMap == NULL)
-	{
-		/*
-		 * If the typemap is not initialized yet, we cannot possibly attempt
-		 * to search the hash table; but there's no way we know the type
-		 * locally yet, since we haven't received a message about this type,
-		 * so this is the best we can do.
-		 */
-		return psprintf("unrecognized %u", remoteid);
-	}
-
-	/* search the mapping */
-	entry = hash_search(LogicalRepTypMap, (void *) ,
-		HASH_FIND, );
-	if (!found)
-		return psprintf("unrecognized %u", remoteid);
-
-	Assert(OidIsValid(entry->remoteid));
-	return psprintf("%s.%s", entry->nspname, entry->typname);
-}
-
 /*
  * Partition cache: look up partition LogicalRepRelMapEntry's
  *
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fb3ba5c415..6494bef1d9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -132,7 +132,6 @@ static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
 typedef struct SlotErrCallbackArg
 {
 	LogicalRepRelMapEntry *rel;
-	int			local_attnum;
 	int

Re: Monitoring stats docs inconsistency

2021-04-14 Thread Amit Kapila

On Tue, Apr 13, 2021 at 6:08 PM vignesh C  wrote:
>
> Few of the statistics description in monitoring_stats.sgml doc is not
> consistent. Made all the descriptions consistent by including
> punctuation marks at the end of each description.
> Thoughts?
>

I think monitoring.sgml uses a similar pattern as we use for system
catalogs. I am not sure of the rules in this regard but it appears
that normally for single line descriptions (for fields like OID, name,
etc.), we don't use a full stop at the end.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Add extra statistics to explain for Nested Loop

2021-04-14 Thread e . sokolova

Thank you for working on this issue. Your comments helped me make this 
patch more correct.


Lines with "colon" format shouldn't use equal signs, and should use two 
spaces

between fields.
Done. Now extra line looks like "Loop min_rows: %.0f  max_rows: %.0f  
total_rows: %.0f" or "Loop min_time: %.3f  max_time: %.3f  min_rows: 
%.0f  max_rows: %.0f  total_rows: %.0f".


Since this is now on a separate line, the "if (nloops > 1 && 
es->verbose)"

can be after the existing "if (es->timing)", and shouldn't need its own
"if (es->timing)".  It should conditionally add a separate line, rather 
than

duplicating the "(actual.*" line.


-   if (es->timing)
+   if (nloops > 1 && es->verbose)
New version of patch contains this correction. It helped make the patch 
shorter.


In non-text mode, think you should not check "nloops > 1".  Rather, 
print the

field as 0.
The fields will not be zeros. New line will almost repeat the line with 
main sttistics.


I think the labels in non-text format should say "Loop Min Time" or 
similar.


And these variables should have a loop_ prefix like loop_min_t ?

There are good ideas. I changed it.

I apply new version of this patch. I hope it got better.
Please don't hesitate to share any thoughts on this topic.

--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyFrom: "Ekaterina Sokolova" 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b62a76e7e5a..bf8c37baefd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1615,6 +1615,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
 		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
 		double		rows = planstate->instrument->ntuples / nloops;
+		double		loop_total_rows = planstate->instrument->ntuples;
+		double		loop_min_r = planstate->instrument->min_tuples;
+		double		loop_max_r = planstate->instrument->max_tuples;
+		double		loop_min_t_ms = 1000.0 * planstate->instrument->min_t;
+		double		loop_max_t_ms = 1000.0 * planstate->instrument->max_t;
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
@@ -1626,6 +1631,19 @@ ExplainNode(PlanState *planstate, List *ancestors,
 appendStringInfo(es->str,
  " (actual rows=%.0f loops=%.0f)",
  rows, nloops);
+			if (nloops > 1 && es->verbose)
+			{
+appendStringInfo(es->str, "\n");
+ExplainIndentText(es);
+if (es->timing)
+	appendStringInfo(es->str,
+	 "Loop min_time: %.3f  max_time: %.3f  min_rows: %.0f  max_rows: %.0f  total_rows: %.0f",
+	 loop_min_t_ms, loop_max_t_ms, loop_min_r, loop_max_r, loop_total_rows);
+else
+	appendStringInfo(es->str,
+	 "Loop min_rows: %.0f  max_rows: %.0f  total_rows: %.0f",
+	 loop_min_r, loop_max_r, loop_total_rows);
+			}
 		}
 		else
 		{
@@ -1635,8 +1653,21 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	 3, es);
 ExplainPropertyFloat("Actual Total Time", "s", total_ms,
 	 3, es);
+if (nloops > 1 && es->verbose)
+{
+	ExplainPropertyFloat("Loop Min Time", "s", loop_min_t_ms,
+		 3, es);
+	ExplainPropertyFloat("Loop Max Time", "s", loop_max_t_ms,
+		 3, es);
+}
 			}
 			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+			if (nloops > 1 && es->verbose)
+			{
+ExplainPropertyFloat("Loop Min Rows", NULL, loop_min_r, 0, es);
+ExplainPropertyFloat("Loop Max Rows", NULL, loop_max_r, 0, es);
+ExplainPropertyFloat("Loop Total Rows", NULL, loop_total_rows, 0, es);
+			}
 			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 		}
 	}
@@ -1646,6 +1677,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			appendStringInfoString(es->str, " (never executed)");
 		else
 		{
+			/* without min and max values because actual result is 0 */
 			if (es->timing)
 			{
 ExplainPropertyFloat("Actual Startup Time", "ms", 0.0, 3, es);
@@ -1672,12 +1704,22 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			double		startup_ms;
 			double		total_ms;
 			double		rows;
+			double		loop_min_t_ms;
+			double		loop_max_t_ms;
+			double		loop_min_r;
+			double		loop_max_r;
+			double		loop_total_rows;
 
 			if (nloops <= 0)
 continue;
 			startup_ms = 1000.0 * instrument->startup / nloops;
 			total_ms = 1000.0 * instrument->total / nloops;
 			rows = instrument->ntuples / nloops;
+			loop_min_t_ms = 1000.0 * instrument->min_t;
+			loop_max_t_ms = 1000.0 * instrument->max_t;
+			loop_min_r = instrument->min_tuples;
+			loop_max_r = instrument->max_tuples;
+			loop_total_rows = instrument->ntuples;
 
 			ExplainOpenWorker(n, es);
 
@@ -1692,6 +1734,19 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	appendStringInfo(es->str,
 	 "actual rows=%.0f loops=%.0f\n",
 	 rows, nloops);
+if (nloops > 1)
+{
+

Re: jsonb subscripting assignment performance

2021-04-14 Thread Joel Jacobson

On Wed, Apr 14, 2021, at 11:07, Oleg Bartunov wrote:
> I and Nikita are working on OLTP jsonb 
> http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf
>  

Page 49/55 in the PDF:
"UPDATE test_toast SET jb = jsonb_set(jb, {keyN,0}, ?);"

Would you get similar improvements if updating jsonb variables in PL/pgSQL?
If not, could the infrastructure somehow be reused to improve the PL/pgSQL 
use-case as well?

I would be happy to help out if there is something I can do, such as testing.

/Joel

Re: [PATCH] PREPARE TRANSACTION unexpected behavior with TEMP TABLE

2021-04-14 Thread Himanshu Upadhyaya

Hi Robert,

Thanks for sharing your thoughts.
The purpose of this FIX is to mainly focus on getting consistent behavior
with PREPARE TRANSACTION. With the case that I had mentioned
previously, my expectation was either both PREPARE TRANSACTION should fail
or both should succeed but here second same "PREPARE TRANSACTION" was
successful however first one was failing with an error, which is kind of
weird to me.

I have also tried to reproduce the behavior.

[session:1]
postgres=# create temp table fullname (first text, last text);
CREATE TABLE

[session:2]
postgres=# select oid::regclass from pg_class where relname = 'fullname';
oid

 pg_temp_3.fullname

postgres=# BEGIN;
create function longname1(pg_temp_3.fullname) returns text language sql
as $$select $1.first || ' ' || $1.last$$;
BEGIN
CREATE FUNCTION
postgres=*# prepare transaction 'mytran2';
ERROR:  cannot PREPARE a transaction that has operated on temporary objects
postgres=# BEGIN;
create function longname1(pg_temp_3.fullname) returns text language sql
as $$select $1.first || ' ' || $1.last$$;
BEGIN
CREATE FUNCTION

[session:1]
postgres=# \q// no problem in exiting

[session:2]

postgres=*# prepare transaction 'mytran2';
PREPARE TRANSACTION
postgres=# \df
ERROR:  cache lookup failed for type 16429

looking at the comment in the code [session:1] should hang while exiting
but
I don't see a problem here, you have already explained that in your reply.
Even then I feel that behavior should be consistent when we mix temporary
objects in PREPARE TRANSACTION.

The comments in
> PrepareTransaction() justify this prohibition by saying that "Having
> the prepared xact hold locks on another backend's temp table seems
> a bad idea --- for instance it would prevent the backend from exiting.
> There are other problems too, such as how to clean up the source
> backend's local buffers and ON COMMIT state if the prepared xact
> includes a DROP of a temp table."
>
I can see from the above experiment that there is no problem with the
lock in the above case but not sure if there is any issue with "clean up
the source backend's local buffers", if not then we don't even need this
ERROR (ERROR:  cannot PREPARE a transaction that has operated
on temporary objects) in PREPARE TRANSACTION?

To really fix this, you'd need CREATE FUNCTION to take a lock on the
> containing namespace, whether permanent or temporary. You'd also need
> every other CREATE statement that creates a schema-qualified object to
> do the same thing. Maybe that's a good idea, but we've been reluctant
> to go that far in the past due to performance consequences, and it's
> not clear whether any of those problems are related to the issue that
> prompted you to submit the patch.

Yes, the purpose of this patch is to actually have a valid value in
XACT_FLAGS_ACCESSEDTEMPNAMESPACE, having said that it should
always be true if we access temporary object else false.
Even if we do changes to have lock in case of "CREATE FUNCTION", we
also need to have this FIX in place so that "PREPARE TRANSACTION"
mixed with TEMPORARY OBJECT will always be restricted and will not
cause any hang issue(which we will start observing once we implement
these "CREATE STATEMENT" changes) as mentioned in the comment
in the PrepareTransaction().

Just thinking if it's acceptable to FIX this and make it consistent by
properly
setting XACT_FLAGS_ACCESSEDTEMPNAMESPACE, so that it should always
fail if we access the temporary object, I also agree here that it will not
actually cause
any issue because of xact lock but then from user perspective it seems
weird
when the same PREPARE TRANSACTION is working second time onwards, thoughts?

Thanks,
Himanshu

On Thu, Apr 8, 2021 at 7:43 PM Robert Haas  wrote:

> On Wed, Apr 7, 2021 at 12:19 PM Himanshu Upadhyaya
>  wrote:
> > Please find attached the V2 patch.
>
> Hi,
>
> This patch is essentially taking the position that calling
> load_typcache_tupdesc before using that tupdesc for anything is
> required for correctness. I'm not sure whether that's a good
> architectural decision: to me, it looks like whoever wrote this code
> originally - I think it was Tom - had the idea that it would be OK to
> skip calling that function whenever we already have the value.
> Changing that has some small performance cost, and it also just looks
> kind of weird, because you don't expect a function called
> load_typcache_tupdesc() to have the side effect of preventing some
> kind of bad thing from happening. You just expect it to be loading
> stuff. The comments in this code are not exactly stellar as things
> stand, but the patch also doesn't update them in a meaningful way.
> Sure, it corrects a few comments that would be flat-out wrong
> otherwise, but it doesn't add any kind of explanation that would help
> the next person who looks at this code understand why they shouldn't
> just put back the exact same performance optimization you're proposing
> to rip

Re: Unresolved repliaction hang and stop problem.

2021-04-14 Thread Amit Kapila

On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
 wrote:
>
> Hello,
> After upgrading the cluster from 10.x to 13.1 we've started getting a problem 
> describe pgsql-general:
> https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> We've noticed similar issue being described on this list in
> https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> with a fix being rolled out in 13.2.
>

The fix for the problem discussed in the above threads is committed
only in PG-14, see [1]. I don't know what makes you think it is fixed
in 13.2. Also, it is not easy to back-patch that because this fix
depends on some of the infrastructure introduced in PG-14.

[1] - 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

-- 
With Regards,
Amit Kapila.

Re: Truncate in synchronous logical replication failed

2021-04-14 Thread Amit Kapila

On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek
 wrote:
>
> > On 12 Apr 2021, at 08:58, Amit Kapila  wrote:
> >
> > The problem happens only when we try to fetch IDENTITY_KEY attributes
> > because pgoutput uses RelationGetIndexAttrBitmap() to get that
> > information which locks the required indexes. Now, because TRUNCATE
> > has already acquired an exclusive lock on the index, it seems to
> > create a sort of deadlock where the actual Truncate operation waits
> > for logical replication of operation to complete and logical
> > replication waits for actual Truncate operation to finish.
> >
> > Do we really need to use RelationGetIndexAttrBitmap() to build
> > IDENTITY_KEY attributes? During decoding, we don't even lock the main
> > relation, we just scan the system table and build that information
> > using a historic snapshot. Can't we do something similar here?
> >
> > Adding Petr J. and Peter E. to know their views as this seems to be an
> > old problem (since the decoding of Truncate operation is introduced).
>
> We used RelationGetIndexAttrBitmap because it already existed, no other 
> reason.
>

Fair enough. But I think we should do something about it because using
the same (RelationGetIndexAttrBitmap) just breaks the synchronous
logical replication. I think this is broken since the logical
replication of Truncate is supported.

> I am not sure what exact locking we need but I would have guessed at least 
> AccessShareLock would be needed.
>

Are you telling that we need AccessShareLock on the index? If so, why
is it different from how we access the relation during decoding
(basically in ReorderBufferProcessTXN, we directly use
RelationIdGetRelation() without any lock on the relation)? I think we
do it that way because we need it to process WAL entries and we need
the relation state based on the historic snapshot, so, even if the
relation is later changed/dropped, we are fine with the old state we
got. Isn't the same thing applies here in logicalrep_write_attrs? If
that is true then some equivalent of RelationGetIndexAttrBitmap where
we use RelationIdGetRelation instead of index_open should work? Am, I
missing something?

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2021-04-14 Thread Andrei Zubkov

On Wed, 2021-04-14 at 17:32 +0800, Julien Rouhaud wrote:
> 
> did you enable compute_query_id new parameter? 

Hi, Julien!
Thank you very much! I've missed it.
>

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2021-04-14 Thread Julien Rouhaud

Le mer. 14 avr. 2021 à 17:22, Andrei Zubkov  a écrit :

>
> But I'm unable to test the patch - it seems that pg_stat_statements is
> receiving queryId = 0 for every statements in every hook now and
> statements are not tracked at all.
>
> Am I mistaken somewhere? Maybe you know why this is happening?
>

did you enable compute_query_id new parameter?

>

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2021-04-14 Thread Andrei Zubkov

Hi, Kuroda!

I've intended to change the pg_stat_statements version with rebasing
this patch to the current master branch state. Now this is commit
07e5e66.

But I'm unable to test the patch - it seems that pg_stat_statements is
receiving queryId = 0 for every statements in every hook now and
statements are not tracked at all.

Am I mistaken somewhere? Maybe you know why this is happening?

Thank you!
-- 
Andrei Zubkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On Fri, 2021-04-09 at 00:23 +, kuroda.hay...@fujitsu.com wrote:
> Dear Andrei,
> 
> > I think, yes, version of pg_stat_statements is need to be updated.
> > Is
> > it will be 1.10 in PG15?
> 
> I think you are right. 
> According to [1] we can bump up the version per one PG major version,
> and any features are not committed yet for 15.
> 
> [1]: https://www.postgresql.org/message-id/20201202040516.GA43757@nol
> 
> 
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED
>

Re: jsonb subscripting assignment performance

2021-04-14 Thread Pavel Stehule

st 14. 4. 2021 v 11:07 odesílatel Oleg Bartunov 
napsal:

>
>
> On Wed, Apr 14, 2021 at 11:09 AM Pavel Stehule 
> wrote:
>
>>
>>
>> st 14. 4. 2021 v 9:57 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
>> napsal:
>>
>>> > On Wed, Apr 14, 2021 at 09:20:08AM +0200, Pavel Stehule wrote:
>>> > st 14. 4. 2021 v 7:39 odesílatel Joel Jacobson 
>>> napsal:
>>> >
>>> > > Hi,
>>> > >
>>> > > commit 676887a3 added support for jsonb subscripting.
>>> > >
>>> > > Many thanks for working on this. I really like the improved syntax.
>>> > >
>>> > > I was also hoping for some performance benefits,
>>> > > but my testing shows that
>>> > >
>>> > >jsonb_value['existing_key'] = new_value;
>>> > >
>>> > > takes just as long time as
>>> > >
>>> > >jsonb_value := jsonb_set(jsonb_value, ARRAY['existing_key'],
>>> new_value);
>>> > >
>>> > > which is a bit surprising to me. Shouldn't subscripting be a lot
>>> faster,
>>> > > since it could modify the existing data structure in-place? What am I
>>> > > missing here?
>>> > >
>>> >
>>> > no - it doesn't support in-place modification. Only arrays and records
>>> > support it.
>>> >
>>> >
>>> > > I came to think of the this new functionality when trying to
>>> optimize some
>>> > > PL/pgSQL code where the bottle-neck turned out to be lots of calls
>>> > > to jsonb_set() for large jsonb objects.
>>> > >
>>> >
>>> > sure - there is big room for optimization. But this patch was big
>>> enough
>>> > without its optimization. And it was not clean, if I will be committed
>>> or
>>> > not (it waited in commitfest application for 4 years). So I accepted
>>> > implemented behaviour (without inplace update). Now, this patch is in
>>> core,
>>> > and anybody can work on others possible optimizations.
>>>
>>> Right, jsonb subscripting deals mostly with the syntax part and doesn't
>>> change internal jsonb behaviour. If I understand the original question
>>> correctly, "in-place" here means updating of e.g. just one particular
>>> key within a jsonb object, since jsonb_set looks like an overwrite of
>>> the whole jsonb. If so, then update will still cause the whole jsonb to
>>> be updated, there is no partial update functionality for the on-disk
>>> format. Although there is work going on to optimize this in case when
>>> jsonb is big enough to be put into a toast table (partial toast
>>> decompression thread, or bytea appendable toast).
>>>
>>
>> Almost all and almost everywhere Postgres's values are immutable. There
>> is only one exception - runtime plpgsql. "local variables" can hold values
>> of complex values unboxed. Then the repeated update is significantly
>> cheaper. Normal non repeated updates have the same speed, because the value
>> should be unboxed and boxed. Outside plpgsql the values are immutable. I
>> think this is a very hard problem, how to update big toasted values
>> effectively, and I am not sure if there is a solution. TOAST value is
>> immutable. It needs to introduce some alternative to TOAST. The benefits
>> are clear - it can be nice to have fast append arrays for time series. But
>> this is a very different topic.
>>
>
> I and Nikita are working on OLTP jsonb
> http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf
>

+1

Pavel


>
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>
>>
>>
>>
>>
>
> --
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

Re: jsonb subscripting assignment performance

2021-04-14 Thread Oleg Bartunov

On Wed, Apr 14, 2021 at 11:09 AM Pavel Stehule 
wrote:

>
>
> st 14. 4. 2021 v 9:57 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
> napsal:
>
>> > On Wed, Apr 14, 2021 at 09:20:08AM +0200, Pavel Stehule wrote:
>> > st 14. 4. 2021 v 7:39 odesílatel Joel Jacobson 
>> napsal:
>> >
>> > > Hi,
>> > >
>> > > commit 676887a3 added support for jsonb subscripting.
>> > >
>> > > Many thanks for working on this. I really like the improved syntax.
>> > >
>> > > I was also hoping for some performance benefits,
>> > > but my testing shows that
>> > >
>> > >jsonb_value['existing_key'] = new_value;
>> > >
>> > > takes just as long time as
>> > >
>> > >jsonb_value := jsonb_set(jsonb_value, ARRAY['existing_key'],
>> new_value);
>> > >
>> > > which is a bit surprising to me. Shouldn't subscripting be a lot
>> faster,
>> > > since it could modify the existing data structure in-place? What am I
>> > > missing here?
>> > >
>> >
>> > no - it doesn't support in-place modification. Only arrays and records
>> > support it.
>> >
>> >
>> > > I came to think of the this new functionality when trying to optimize
>> some
>> > > PL/pgSQL code where the bottle-neck turned out to be lots of calls
>> > > to jsonb_set() for large jsonb objects.
>> > >
>> >
>> > sure - there is big room for optimization. But this patch was big enough
>> > without its optimization. And it was not clean, if I will be committed
>> or
>> > not (it waited in commitfest application for 4 years). So I accepted
>> > implemented behaviour (without inplace update). Now, this patch is in
>> core,
>> > and anybody can work on others possible optimizations.
>>
>> Right, jsonb subscripting deals mostly with the syntax part and doesn't
>> change internal jsonb behaviour. If I understand the original question
>> correctly, "in-place" here means updating of e.g. just one particular
>> key within a jsonb object, since jsonb_set looks like an overwrite of
>> the whole jsonb. If so, then update will still cause the whole jsonb to
>> be updated, there is no partial update functionality for the on-disk
>> format. Although there is work going on to optimize this in case when
>> jsonb is big enough to be put into a toast table (partial toast
>> decompression thread, or bytea appendable toast).
>>
>
> Almost all and almost everywhere Postgres's values are immutable. There is
> only one exception - runtime plpgsql. "local variables" can hold values of
> complex values unboxed. Then the repeated update is significantly cheaper.
> Normal non repeated updates have the same speed, because the value should
> be unboxed and boxed. Outside plpgsql the values are immutable. I think
> this is a very hard problem, how to update big toasted values effectively,
> and I am not sure if there is a solution. TOAST value is immutable. It
> needs to introduce some alternative to TOAST. The benefits are clear - it
> can be nice to have fast append arrays for time series. But this is a very
> different topic.
>

I and Nikita are working on OLTP jsonb
http://www.sai.msu.su/~megera/postgres/talks/jsonb-pgconfonline-2021.pdf


>
> Regards
>
> Pavel
>
>
>
>
>
>
>
>

-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: More sepgsql weirdness

2021-04-14 Thread Dave Page

Hi

On Tue, Apr 13, 2021 at 6:22 PM Robert Haas  wrote:

> On Tue, Apr 13, 2021 at 10:33 AM Dave Page  wrote:
> > On a system with selinux and sepgsql configured, search path resolution
> appears to fail if sepgsql is in enforcing mode, but selinux is in
> permissive mode (which, as I understand it, should cause sepgsql to behave
> as if it's in permissive mode anyway - and does for other operations).
> Regardless of whether my understanding of the interaction of the two
> permissive modes is correct, I don't believe the following should happen:
>
> I agree that this sounds like something which shouldn't happen if the
> system is in permissive mode,

I realised that my test database hadn't had the sepgsql SQL script run in
it (I must have created it before running it on template1). I guess the
error was caused by lack of proper labelling.

So, clearly my fault, but I think there are a couple of things we need to
do here:

1) Improve the docs for sepgsql. The *only* vaguely useful source of info
I've found on using this is "SELinux System Administration", a Packt book
by Sven Vermeulen. Our own docs don't even list the supported object
classes (e.g. db_table) or types (e.g. sepgsql_ro_table_t) for example.

2) Improve the way we handle cases like the one I ran into. I only realised
what was going on when I tried to run sepgsql_getcon() to confirm I was
running in undefined_t. Clearly very weird things can happen if labelling
hasn't been run; perhaps we could raise a notice if the sepgsql module is
loaded but sepgsql_getcon() isn't present (though that seems flakey at
best)? I'd hesitate to try to check for the presence of one or more labels
as the admin could have intentionally removed them or changed them of
course.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

RE: Truncate in synchronous logical replication failed

2021-04-14 Thread osumi.takami...@fujitsu.com

On Wednesday, April 14, 2021 11:38 AM Japin Li  wrote:
> On Tue, 13 Apr 2021 at 21:54, osumi.takami...@fujitsu.com
>  wrote:
> > On Monday, April 12, 2021 3:58 PM Amit Kapila 
> wrote:
> >> On Mon, Apr 12, 2021 at 10:03 AM osumi.takami...@fujitsu.com
> >>  wrote:
> >> > but if we take a measure to fix the doc, we have to be careful for
> >> > the description, because when we remove the primary keys of 'test'
> >> > tables on the
> >> scenario in [1], we don't have this issue.
> >> > It means TRUNCATE in synchronous logical replication is not always
> >> blocked.
> >> >
> >>
> >> The problem happens only when we try to fetch IDENTITY_KEY attributes
> >> because pgoutput uses RelationGetIndexAttrBitmap() to get that
> >> information which locks the required indexes. Now, because TRUNCATE
> >> has already acquired an exclusive lock on the index, it seems to
> >> create a sort of deadlock where the actual Truncate operation waits
> >> for logical replication of operation to complete and logical replication 
> >> waits
> for actual Truncate operation to finish.
> >>
> >> Do we really need to use RelationGetIndexAttrBitmap() to build
> >> IDENTITY_KEY attributes? During decoding, we don't even lock the main
> >> relation, we just scan the system table and build that information
> >> using a historic snapshot. Can't we do something similar here?
> > I think we can build the IDENTITY_KEY attributes with NoLock instead
> > of calling RelationGetIndexAttrBitmap().
> >
> > When we trace back the caller side of logicalrep_write_attrs(), doing
> > the thing equivalent to RelationGetIndexAttrBitmap() for
> > INDEX_ATTR_BITMAP_IDENTITY_KEY impacts only pgoutput_truncate.
> >
> > OTOH, I can't find codes similar to RelationGetIndexAttrBitmap() in
> > pgoutput_* functions and in the file of relcache.c.
> > Therefore, I'd like to discuss how to address the hang.
> >
> > My first idea is to extract some parts of RelationGetIndexAttrBitmap()
> > only for INDEX_ATTR_BITMAP_IDENTITY_KEY and implement those either
> in
> > a logicalrep_write_attrs() or as a new function.
> > RelationGetIndexAttrBitmap() has 'restart' label for goto statement in
> > order to ensure to return up-to-date attribute bitmaps, so I prefer
> > having a new function when we choose this direction.
> > Having that goto in logicalrep_write_attrs() makes it a little bit messy, I 
> > felt.
> >
> > The other direction might be to extend RelationGetIndexAttrBitmap's
> > function definition to accept lockmode to give NoLock from
> logicalrep_write_attrs().
> > But, this change impacts on other several callers so is not as good as the 
> > first
> direction above, I think.
> >
> > If someone has any better idea, please let me know.
> >
> 
> I think the first idea is better than the second.  OTOH, can we release the 
> locks
> before SyncRepWaitForLSN(), since it already flush to local WAL files.
Thank you for your comments.
I didn't mean to change and touch TRUNCATE side to release the locks,
because I expected that its AccessExclusiveLock
protects any other operation (e.g. DROP INDEX) to the table
which affects IDENTITY KEY building. But, now as I said in another e-mail,
both ideas above can't work. Really sorry for making noises.


Best Regards,
Takamichi Osumi

File truncation within PostgresNode::issues_sql_like() wrong on Windows

2021-04-14 Thread Michael Paquier

Hi all,

As fairywren has proved a couple of days ago, it is not really a good
idea to rely on a file truncation to check for patterns in the logs of
the backend:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2021-04-07%2013%3A29%3A28

Visibly, a logic based on the log file truncation fails on Windows
because of the concurrent access of the backend that outputs its logs
there.  In PostgresNode.pm, connect_ok() and connect_access() enforce
a rotation of the log file before restarting the server on Windows to
make sure that a given step does not find logs generated by a previous
test, but that's not the case of issues_sql_like().  Looking at the
existing tests using this routine (src/bin/scripts/), I have found on
test in 090_reindexdb.pl that could lead to a false positive.  The
test is marked in the patch attached, just for awareness.

Would there be any objections to change this routine so as we avoid
the file truncation on Windows?  The patch attached achieves that.

Any thoughts?
--
Michael
diff --git a/src/bin/scripts/t/090_reindexdb.pl b/src/bin/scripts/t/090_reindexdb.pl
index 159b637230..8b06218d6b 100644
--- a/src/bin/scripts/t/090_reindexdb.pl
+++ b/src/bin/scripts/t/090_reindexdb.pl
@@ -174,6 +174,7 @@ $node->command_fails(
 $node->command_fails(
 	[ 'reindexdb', '-j', '2', '-i', 'i1', 'postgres' ],
 	'parallel reindexdb cannot process indexes');
+# XXX The first query maps with a test above.
 $node->issues_sql_like(
 	[ 'reindexdb', '-j', '2', 'postgres' ],
 	qr/statement:\ REINDEX SYSTEM postgres;
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e26b2b3f30..9daa438ccc 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2195,7 +2195,8 @@ Run a command on the node, then verify that $expected_sql appears in the
 server log file.
 
 Reads the whole log file so be careful when working with large log outputs.
-The log file is truncated prior to running the command, however.
+The log file is truncated prior to running the command, and on Windows, a
+rotation of the log file is done before restarting the node.
 
 =cut
 
@@ -2207,7 +2208,18 @@ sub issues_sql_like
 
 	local %ENV = $self->_get_env();
 
-	truncate $self->logfile, 0;
+	# On Windows, the truncation would not work, so rotate the log
+	# file before restarting the server afresh.
+	if ($TestLib::windows_os)
+	{
+		$self->rotate_logfile;
+		$self->restart;
+	}
+	else
+	{
+		truncate $self->logfile, 0;
+	}
+
 	my $result = TestLib::run_log($cmd);
 	ok($result, "@$cmd exit code 0");
 	my $log = TestLib::slurp_file($self->logfile);


signature.asc
Description: PGP signature

Re: jsonb subscripting assignment performance

2021-04-14 Thread Pavel Stehule

st 14. 4. 2021 v 9:57 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Wed, Apr 14, 2021 at 09:20:08AM +0200, Pavel Stehule wrote:
> > st 14. 4. 2021 v 7:39 odesílatel Joel Jacobson 
> napsal:
> >
> > > Hi,
> > >
> > > commit 676887a3 added support for jsonb subscripting.
> > >
> > > Many thanks for working on this. I really like the improved syntax.
> > >
> > > I was also hoping for some performance benefits,
> > > but my testing shows that
> > >
> > >jsonb_value['existing_key'] = new_value;
> > >
> > > takes just as long time as
> > >
> > >jsonb_value := jsonb_set(jsonb_value, ARRAY['existing_key'],
> new_value);
> > >
> > > which is a bit surprising to me. Shouldn't subscripting be a lot
> faster,
> > > since it could modify the existing data structure in-place? What am I
> > > missing here?
> > >
> >
> > no - it doesn't support in-place modification. Only arrays and records
> > support it.
> >
> >
> > > I came to think of the this new functionality when trying to optimize
> some
> > > PL/pgSQL code where the bottle-neck turned out to be lots of calls
> > > to jsonb_set() for large jsonb objects.
> > >
> >
> > sure - there is big room for optimization. But this patch was big enough
> > without its optimization. And it was not clean, if I will be committed or
> > not (it waited in commitfest application for 4 years). So I accepted
> > implemented behaviour (without inplace update). Now, this patch is in
> core,
> > and anybody can work on others possible optimizations.
>
> Right, jsonb subscripting deals mostly with the syntax part and doesn't
> change internal jsonb behaviour. If I understand the original question
> correctly, "in-place" here means updating of e.g. just one particular
> key within a jsonb object, since jsonb_set looks like an overwrite of
> the whole jsonb. If so, then update will still cause the whole jsonb to
> be updated, there is no partial update functionality for the on-disk
> format. Although there is work going on to optimize this in case when
> jsonb is big enough to be put into a toast table (partial toast
> decompression thread, or bytea appendable toast).
>

Almost all and almost everywhere Postgres's values are immutable. There is
only one exception - runtime plpgsql. "local variables" can hold values of
complex values unboxed. Then the repeated update is significantly cheaper.
Normal non repeated updates have the same speed, because the value should
be unboxed and boxed. Outside plpgsql the values are immutable. I think
this is a very hard problem, how to update big toasted values effectively,
and I am not sure if there is a solution. TOAST value is immutable. It
needs to introduce some alternative to TOAST. The benefits are clear - it
can be nice to have fast append arrays for time series. But this is a very
different topic.

Regards

Pavel

Re: jsonb subscripting assignment performance

2021-04-14 Thread Dmitry Dolgov

> On Wed, Apr 14, 2021 at 09:20:08AM +0200, Pavel Stehule wrote:
> st 14. 4. 2021 v 7:39 odesílatel Joel Jacobson  napsal:
>
> > Hi,
> >
> > commit 676887a3 added support for jsonb subscripting.
> >
> > Many thanks for working on this. I really like the improved syntax.
> >
> > I was also hoping for some performance benefits,
> > but my testing shows that
> >
> >jsonb_value['existing_key'] = new_value;
> >
> > takes just as long time as
> >
> >jsonb_value := jsonb_set(jsonb_value, ARRAY['existing_key'], new_value);
> >
> > which is a bit surprising to me. Shouldn't subscripting be a lot faster,
> > since it could modify the existing data structure in-place? What am I
> > missing here?
> >
>
> no - it doesn't support in-place modification. Only arrays and records
> support it.
>
>
> > I came to think of the this new functionality when trying to optimize some
> > PL/pgSQL code where the bottle-neck turned out to be lots of calls
> > to jsonb_set() for large jsonb objects.
> >
>
> sure - there is big room for optimization. But this patch was big enough
> without its optimization. And it was not clean, if I will be committed or
> not (it waited in commitfest application for 4 years). So I accepted
> implemented behaviour (without inplace update). Now, this patch is in core,
> and anybody can work on others possible optimizations.

Right, jsonb subscripting deals mostly with the syntax part and doesn't
change internal jsonb behaviour. If I understand the original question
correctly, "in-place" here means updating of e.g. just one particular
key within a jsonb object, since jsonb_set looks like an overwrite of
the whole jsonb. If so, then update will still cause the whole jsonb to
be updated, there is no partial update functionality for the on-disk
format. Although there is work going on to optimize this in case when
jsonb is big enough to be put into a toast table (partial toast
decompression thread, or bytea appendable toast).

Re: jsonb subscripting assignment performance

2021-04-14 Thread Joel Jacobson

On Wed, Apr 14, 2021, at 09:20, Pavel Stehule wrote:
> sure - there is big room for optimization. But this patch was big enough 
> without its optimization. And it was not clean, if I will be committed or not 
> (it waited in commitfest application for 4 years). So I accepted implemented 
> behaviour (without inplace update). Now, this patch is in core, and anybody 
> can work on others possible optimizations.

Thanks for explaining.

Do we a rough idea on how in-place could be implemented in a non-invasive 
non-controversial way that ought to be accepted by the project, if done right? 
Or are there other complicated problems that needs to be solved first?

I'm asking because I could be interested in working on this, but I know my 
limitations when it comes to C, so I want to get an idea on if it should be 
more or less straightforward, or if we already know on beforehand it would 
require committer-level expertise of the PostgreSQL code base for any realistic 
chance of being successful.

/Joel

RE: Truncate in synchronous logical replication failed

2021-04-14 Thread osumi.takami...@fujitsu.com

On Tuesday, April 13, 2021 11:38 PM Petr Jelinek 
 wrote:
> > On 12 Apr 2021, at 08:58, Amit Kapila  wrote:
> > On Mon, Apr 12, 2021 at 10:03 AM osumi.takami...@fujitsu.com
> >  wrote:
> >>
> >>> I checked the PG-DOC, found it says that “Replication of TRUNCATE
> >>> commands is supported”[1], so maybe TRUNCATE is not supported in
> >>> synchronous logical replication?
> >>>
> >>> If my understanding is right, maybe PG-DOC can be modified like
> >>> this. Any thought?
> >>> Replication of TRUNCATE commands is supported
> >>> ->
> >>> Replication of TRUNCATE commands is supported in asynchronous
> mode
> >> I'm not sure if this becomes the final solution,
> >>
> >
> > I think unless the solution is not possible or extremely complicated
> > going via this route doesn't seem advisable.
> >
> >> but if we take a measure to fix the doc, we have to be careful for
> >> the description, because when we remove the primary keys of 'test' tables
> on the scenario in [1], we don't have this issue.
> >> It means TRUNCATE in synchronous logical replication is not always
> blocked.
> >>
> >
> > The problem happens only when we try to fetch IDENTITY_KEY attributes
> > because pgoutput uses RelationGetIndexAttrBitmap() to get that
> > information which locks the required indexes. Now, because TRUNCATE
> > has already acquired an exclusive lock on the index, it seems to
> > create a sort of deadlock where the actual Truncate operation waits
> > for logical replication of operation to complete and logical
> > replication waits for actual Truncate operation to finish.
> >
> > Do we really need to use RelationGetIndexAttrBitmap() to build
> > IDENTITY_KEY attributes? During decoding, we don't even lock the main
> > relation, we just scan the system table and build that information
> > using a historic snapshot. Can't we do something similar here?
> >
> > Adding Petr J. and Peter E. to know their views as this seems to be an
> > old problem (since the decoding of Truncate operation is introduced).
> 
> We used RelationGetIndexAttrBitmap because it already existed, no other
> reason.I am not sure what exact locking we need but I would have guessed at
> least AccessShareLock would be needed.
This was true.

Having a look at the comment of index_open(), there's a description of basic 
rule
that NoLock should be used if appropriate lock on the index is already taken.
And, making the walsender use NoLock to build the attributes
leads us to the Assert in the relation_open().

Please ignore the two ideas I suggested in another mail,
which doesn't follow the basic and work.

Best Regards,
Takamichi Osumi

View invoker privileges

2021-04-14 Thread Ivan Ivanov

Hello guys!
In Postgres we can create view with view owner privileges only. What’s the 
reason that there is no option to create view with invoker privileges? Is there 
any technical or security subtleties related to absence of this feature?

Re: jsonb subscripting assignment performance

2021-04-14 Thread Pavel Stehule

st 14. 4. 2021 v 7:39 odesílatel Joel Jacobson  napsal:

> Hi,
>
> commit 676887a3 added support for jsonb subscripting.
>
> Many thanks for working on this. I really like the improved syntax.
>
> I was also hoping for some performance benefits,
> but my testing shows that
>
>jsonb_value['existing_key'] = new_value;
>
> takes just as long time as
>
>jsonb_value := jsonb_set(jsonb_value, ARRAY['existing_key'], new_value);
>
> which is a bit surprising to me. Shouldn't subscripting be a lot faster,
> since it could modify the existing data structure in-place? What am I
> missing here?
>

no - it doesn't support in-place modification. Only arrays and records
support it.


> I came to think of the this new functionality when trying to optimize some
> PL/pgSQL code where the bottle-neck turned out to be lots of calls
> to jsonb_set() for large jsonb objects.
>

sure - there is big room for optimization. But this patch was big enough
without its optimization. And it was not clean, if I will be committed or
not (it waited in commitfest application for 4 years). So I accepted
implemented behaviour (without inplace update). Now, this patch is in core,
and anybody can work on others possible optimizations.

Regards

Pavel


>
> Here is the output from attached bench:
>
> n=1
> 00:00:00.002628 jsonb := jsonb_set(jsonb, ARRAY[existing key], value);
> 00:00:00.002778 jsonb := jsonb_set(jsonb, ARRAY[new key], value);
> 00:00:00.002332 jsonb[existing key] := value;
> 00:00:00.002794 jsonb[new key] := value;
> n=10
> 00:00:00.042843 jsonb := jsonb_set(jsonb, ARRAY[existing key], value);
> 00:00:00.046515 jsonb := jsonb_set(jsonb, ARRAY[new key], value);
> 00:00:00.044974 jsonb[existing key] := value;
> 00:00:00.075429 jsonb[new key] := value;
> n=100
> 00:00:00.420808 jsonb := jsonb_set(jsonb, ARRAY[existing key], value);
> 00:00:00.449622 jsonb := jsonb_set(jsonb, ARRAY[new key], value);
> 00:00:00.31834 jsonb[existing key] := value;
> 00:00:00.527904 jsonb[new key] := value;
>
> Many thanks for clarifying.
>
> Best regards,
>
> Joel
>

Re: [PATCH] force_parallel_mode and GUC categories

2021-04-14 Thread Michael Paquier

On Tue, Apr 13, 2021 at 07:31:39AM -0500, Justin Pryzby wrote:
> Good point.

Thanks.  I have used the wording that Tom has proposed upthread, added
one GUC_NOT_IN_SAMPLE that you forgot, and applied the
force_parallel_mode patch.
--
Michael


signature.asc
Description: PGP signature

Re: Replication slot stats misgivings

2021-04-14 Thread Amit Kapila

On Tue, Apr 13, 2021 at 1:37 PM vignesh C  wrote:
>
> On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  wrote:
> >
> >
> > The following test for the latest v8 patch seems to show different.
> > total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> > something?
> >
> > postgres(1:85969)=# select pg_create_logical_replication_slot('s',
> > 'test_decoding');
> >  pg_create_logical_replication_slot
> > 
> >  (s,0/1884468)
> > (1 row)
> >
> > postgres(1:85969)=# create table a (i int);
> > CREATE TABLE
> > postgres(1:85969)=# insert into a select generate_series(1, 10);
> > INSERT 0 10
> > postgres(1:85969)=# set logical_decoding_work_mem to 64;
> > SET
> > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > ---++-++-+-+-+--+--+-
> >  s |  0 |   0 |  0 |   0 |
> >   0 |   0 |0 |0 |
> > (1 row)
> >
> > postgres(1:85969)=# select count(*) from
> > pg_logical_slot_peek_changes('s', NULL, NULL);
> >  count
> > 
> >  14
> > (1 row)
> >
> > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > ---++-++-+-+-+--+--+-
> >  s |  2 |1808 |  1 | 202 |
> > 1320 |   0 |0 |0 |
> > (1 row)
> >
>
> Thanks for identifying this issue, while spilling the transactions
> reorder buffer changes gets released, we will not be able to get the
> total size for spilled transactions from reorderbuffer size. I have
> fixed it by including spilledbytes to totalbytes in case of spilled
> transactions. Attached patch has the fix for this.
> Thoughts?
>

I am not sure if that is the best way to fix it because sometimes we
clear the serialized flag in which case it might not give the correct
answer. Another way to fix it could be that before we try to restore a
new set of changes, we update totalBytes counter. See, the attached
patch atop your v6-0002-* patch.

-- 
With Regards,
Amit Kapila.


fix_spilled_stats_1.patch
Description: Binary data

91 matches

Mail list logo