date:20200908





On 2020/09/09 14:15, Etsuro Fujita wrote:

Hi,

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Using given-name-first order is our consensus? I was thinking we have not
reached that yet and our "vague" consensus was to use the name that each
contributor prefers, for example the name that used in the email signature, etc.

BTW, if possible I'd like to use family-name-first, i.e., "Fujii Masao".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

2020-09-08 Thread Kyotaro Horiguchi

Thank you for the checking.

At Tue, 8 Sep 2020 21:05:43 -0700, Noah Misch  wrote in 
> On Mon, Sep 07, 2020 at 07:47:09PM -0700, Noah Misch wrote:
> After looking closer, I've moved the test to reindex_catalog.sql; see that
> file's header comment for the reasons.  One now needs this command:
> 
>   printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> 'max_wal_senders = 0' >/tmp/minimal.conf
>   make check-tests TESTS=reindex_catalog TEMP_CONFIG=/tmp/minimal.conf

I missed the file. It is surely the best place.

> RelationBuildDesc() calls RelationInitPhysicalAddr(), so RelationBuildDesc()
> can stop calling RelFileNodeSkippingWAL().  The attached version makes it so,
> and I plan to push it.

Sure. relNode is filled with zeros so RelationInitPhysicalAddr() works
correctly for RelationBuildDesc().

Both changes look good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Global snapshots

2020-09-08 Thread Masahiko Sawada

On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
 wrote:
>
> On 2020-09-08 14:48, Fujii Masao wrote:
> > On 2020/09/08 19:36, Alexey Kondratov wrote:
> >> On 2020-09-08 05:49, Fujii Masao wrote:
> >>> On 2020/09/05 3:31, Alexey Kondratov wrote:
> 
>  Attached is a patch, which implements a plain 2PC in the
>  postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it
>  solves these errors handling issues above and tries to add proper
>  comments everywhere. I think, that 0003 should be rebased on the top
>  of it, or it could be a first patch in the set, since it may be used
>  independently. What do you think?
> >>>
> >>> Thanks for the patch!
> >>>
> >>> Sawada-san was proposing another 2PC patch at [1]. Do you have any
> >>> thoughts
> >>> about pros and cons between your patch and Sawada-san's?
> >>>
> >>> [1]
> >>> https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com
> >>
> >> Thank you for the link!
> >>
> >> After a quick look on the Sawada-san's patch set I think that there
> >> are two major differences:
> >
> > Thanks for sharing your thought! As far as I read your patch quickly,
> > I basically agree with your this view.
> >
> >
> >>
> >> 1. There is a built-in foreign xacts resolver in the [1], which should
> >> be much more convenient from the end-user perspective. It involves
> >> huge in-core changes and additional complexity that is of course worth
> >> of.
> >>
> >> However, it's still not clear for me that it is possible to resolve
> >> all foreign prepared xacts on the Postgres' own side with a 100%
> >> guarantee. Imagine a situation when the coordinator node is actually a
> >> HA cluster group (primary + sync + async replica) and it failed just
> >> after PREPARE stage of after local COMMIT. In that case all foreign
> >> xacts will be left in the prepared state. After failover process
> >> complete synchronous replica will become a new primary. Would it have
> >> all required info to properly resolve orphan prepared xacts?
> >
> > IIUC, yes, the information required for automatic resolution is
> > WAL-logged and the standby tries to resolve those orphan transactions
> > from WAL after the failover. But Sawada-san's patch provides
> > the special function for manual resolution, so there may be some cases
> > where manual resolution is necessary.
> >
>
> I've found a note about manual resolution in the v25 0002:
>
> +After that we prepare all foreign transactions by calling
> +PrepareForeignTransaction() API. If we failed on any of them we change
> to
> +rollback, therefore at this time some participants might be prepared
> whereas
> +some are not prepared. The former foreign transactions need to be
> resolved
> +using pg_resolve_foreign_xact() manually and the latter ends
> transaction
> +in one-phase by calling RollbackForeignTransaction() API.
>
> but it's not yet clear for me.

Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.

>
> >
> > Implementing 2PC feature only inside postgres_fdw seems to cause
> > another issue; COMMIT PREPARED is issued to the remote servers
> > after marking the local transaction as committed
> > (i.e., ProcArrayEndTransaction()).
> >
>
> According to the Sawada-san's v25 0002 the logic is pretty much the same
> there:
>
> +2. Pre-Commit phase (1st phase of two-phase commit)
>
> +3. Commit locally
> +Once we've prepared all of them, commit the transaction locally.
>
> +4. Post-Commit Phase (2nd phase of two-phase commit)
>
> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
> FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction()
> in the CommitTransaction(). Thus, I don't see many difference between
> these approach and CallXactCallbacks() usage regarding this point.
>
> > Is this safe? This issue happens
> > because COMMIT PREPARED is issued via
> > CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
> > is called after ProcArrayEndTransaction().
> >
>
> Once the transaction is committed locally any ERROR (or higher level
> message) will be escalated to PANIC.

I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?

> And I do see possible ERROR level
> messages in the postgresCommitForeignTransaction() for example:
>
> +   if (PQresultStatus(res) != PGRES_COMMAND_OK)
> +   ereport(ERROR, (errmsg("could not commit transaction on 
> server %

Re: Inconsistency in determining the timestamp of the db statfile.





On 2020/09/09 12:04, Amit Kapila wrote:

On Tue, Sep 8, 2020 at 7:03 PM Magnus Hagander  wrote:


On Tue, Sep 8, 2020 at 3:11 PM Fujii Masao  wrote:




On 2020/09/08 19:28, Magnus Hagander wrote:



On Tue, Sep 8, 2020 at 8:10 AM Amit Kapila mailto:amit.kapil...@gmail.com>> wrote:

 We use the timestamp of the global statfile if we are not able to
 determine it for a particular database either because the entry for
 that database doesn't exist or there is an error while reading the
 specific database entry. This was not taken care of while reading
 other entries like ArchiverStats or SLRUStats. See
 pgstat_read_db_statsfile_timestamp. I have observed this while
 reviewing Sawada-San's patch related to logical replication stats [1].

 I think this can only happen if due to some reason the statfile got
 corrupt or we
 have some bug in stats writing code, the chances of both are rare and even
 if that happens we will use stale statistics.

 The attached patch by Sawada-San will fix this issue. As the chances of 
this
 the problem is rare and nobody has reported any issue related to this,
 so it might be okay not to backpatch this.

 Thoughts?


Why are we reading the archiver statis and and slru stats in 
"pgstat_read_db_statsfile_timestamp" in the first place?


Maybe because they are written before database stats entries? That is,
to read the target database stats entry and get the timestamp from it,
we need to read (or lseek) all the global stats entries written before them.



Oh meh. Yeah, I'm reading this thing completely wrong :/ Ignore my comments :)




That seems well out of scope for that function.

If nothing else the comment at the top of that function is out of touch with 
reality. We do seem to read it into local buffers and then ignore the contents. 
I guess we're doing it just to verify that it's not corrupt -- so perhaps the 
function should actually have a different name than it does now, since it 
certainly seems to do more than actually read the statsfile timestamp.

Specifically in this patch it looks wrong -- in the case of say the SLRU stats 
being corrupt, we will now return the timestamp of the global stats file even 
if there is one existing for the database requested, which definitely breaks 
the contract of the function.


Yes.
We should return false when fread() for database entry fails, instead? That is,

1. If corrupted stats file is found, the function always returns false.
2. If the file is not currupted and the target database entry is found, its 
timestamp is returned.
3. If the file is not corrupted and the target is NOT found, the timestamp of 
global entry is returned



Yeah, with more coffee and re-reading it, I'm not sure how we could have the 
database stats being OK if the archiver or slru stats are not.

That said, I think it still makes sense to return false if the stats file is 
corrupt. How much can we trust the first block, if the block right after it is 
corrupt? So yeah, I think your described order seems correct. But that's also 
what the code already did before this patch, isn't it?



No, before patch as well, if we can't read the DB entry say because
the file is corrupt, we return true and use timestamp of global stats
file and this is what is established by the original commit 187492b6.
So, if we consider that as correct then the functionality is something
like once we have read the timestamp of the global statfile, we use
that if we can't read the actual db entry for whatever reason. It
seems if we return false, the caller will call this function again in
the loop. Now, I see the point that if we can't read some part of the
file we should return false instead but not sure if that is helpful
from the perspective of the caller of
pgstat_read_db_statsfile_timestamp.


When false is returned, the caller backend_read_statsfile() seems to
request the stats collector to write a fresh stats data into the file,
and then pgstat_read_db_statsfile_timestamp() can try to read the fresh
file that may not be corrupted. So returning false in that case seems ok
to me...

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Redundant tuple copy in tqueueReceiveSlot()

2020-09-08 Thread Amit Khandekar

Hi,

I am starting a new thread that continues with the following point
that was discussed in [1] 

On Fri, 17 Jul 2020 at 09:05, Thomas Munro  wrote:
>
> On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
>  wrote:
> > Do you mean that we should have an implementation for
> > get_minimal_tuple() for the heap AM and have it return a pointer to the
> > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> > tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> > to extract the minimal tuple was allocated in the tuple queue in the
> > first place? If we consider that the node directly below a gather is a
> > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> > ss_ScanTupleSlot to point to memory in the shared tuple queue?
> > Similarly, other ExecInit*() methods can do the same for other executor
> > nodes that involve parallelism? Of course, things would be slightly
> > different for
> > the other use cases you mentioned (such as hash table population)
>
> What I mean is that where ExecHashTableInsert() and
> tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
> freshly allocated copy, and then you copy that again, and free it.
> There may be something similar going on in tuplestore and sort code.
> Perhaps we could have something like
> ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
> output_buffer_size) that returns a value that indicates either success
> or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
> That silly extra copy is something Andres pointed out to me in some
> perf results involving TPCH hash joins, a couple of years ago.

I went ahead and tried doing this. I chose an approach where we can
return the pointer to the in-place minimal tuple data if it's a
heap/buffer/minimal tuple slot. A new function
ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
If it's neither heap, buffer or minimal tuple, it returns a copy as
usual. The receiver should not assume the data is directly taken from
MinimalTupleData, so it should set it's t_len to the number of bytes
returned. Patch attached
(0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

Thomas, I guess you had a different approach in mind when you said
about "returning either success or
hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
me is that avoiding the copy shows consistent improvement of 4 to 10%
for simple parallel table scans. I tried my patch on both x86_64 and
arm64, and observed this speedup on both.

create table tab as select generate_series(1, 2000) id,
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v;
select pg_prewarm('tab'::regclass);
explain analyze select * from tab where id %2 = 0;
Times in milli-secs :
HEAD : 1833.119  1816.522  1875.648  1872.153  1834.383
Patch'ed : 1763.786  1721.160  1714.665  1719.738  1704.478
This was with the default 2 parallel workers. With 3 or 4 workers, for
the above testcase I didn't see a noticeable difference. I think, if I
double the number of rows, the difference will be noticeable. In any
case, the gain would go on reducing with the number of workers,
because the tuple copy also gets parallelized. In some scenarios,
parallel_leader_participation=off causes the difference to amplify.

Haven't had a chance to see if this helps any of the TPC-H queries.

Also attached is a patch guc_for_testing.patch that I used for testing
the gain. This patch is only for testing. Without this, in order to
compare the performance figures it requires server restart, and the
figures anyway shift back and forth by 5-15 percent after each
restart, which creates lot of noise when comparing figures with and
without fix. Use this GUC enable_fix to enable/disable the fix.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Wed, 9 Sep 2020 12:03:01 +0800
Subject: [PATCH 2/2] Add guc for ease of testing speedup.

This is only for testing the performance gain. Otherwise, to compare
the performance figures, it requires server restart, and the figures
anyway shift back and forth by 5-15 percent after each restart, which
creates lot of noise when comparing figures with and without fix.

With this, we can easily see at least 4-10% difference in execution
times by setting/unsetting the GUC enable_fix.
---
 src/backend/executor/tqueue.c | 12 
 src/backend/utils/misc/guc.c  | 12 
 2 files changed, 24 insertions(+)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index c70ee0f39a..7dd60b5c7e 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -45,6 +45,7 @@ struct TupleQueueReader
 	shm_mq_handle *queue;		/* shm_mq to receive from */
 };

+extern bool		enable_fix;
 /*
  *

Inconsistent Japanese name order in v13 contributors list

2020-09-08 Thread Etsuro Fujita

Hi,

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.

Best regards,
Etsuro Fujita


Standardize-Japanese-names.patch
Description: Binary data

Re: New statistics for tuning WAL buffer size

2020-09-08 Thread Masahiro Ikeda


On 2020-09-07 16:19, Fujii Masao wrote:

On 2020/09/07 9:58, Masahiro Ikeda wrote:

Thanks for the review and advice!

On 2020-09-03 16:05, Fujii Masao wrote:

On 2020/09/02 18:56, Masahiro Ikeda wrote:

+/* --
+ * Backend types
+ * --

You seem to forget to add "*/" into the above comment.
This issue could cause the following compiler warning.
../../src/include/pgstat.h:761:1: warning: '/*' within block 
comment [-Wcomment]


Thanks for the comment. I fixed.


Thanks for the fix! But why are those comments necessary?


Sorry about that. This comment is not necessary.
I removed it.

The pg_stat_walwriter is not security restricted now, so ordinary 
users can access it.
It has the same security level as pg_stat_archiver. If you have any 
comments, please let me know.


+   dirty_writes bigint

I guess that the column name "dirty_writes" derived from
the DTrace probe name. Isn't this name confusing? We should
rename it to "wal_buffers_full" or something?


I agree and rename it to "wal_buffers_full".


+/* --
+ * PgStat_MsgWalWriter    Sent by the walwriter to update 
statistics.


This comment seems not accurate because backends also send it.

+/*
+ * WAL writes statistics counter is updated in XLogWrite function
+ */
+extern PgStat_MsgWalWriter WalWriterStats;

This comment seems not right because the counter is not updated in 
XLogWrite().


Right. I fixed it to "Sent by each backend and background workers to 
update WAL statistics."
In the future, other statistics will be included so I remove the 
function's name.




+-- There will surely and maximum one record
+select count(*) = 1 as ok from pg_stat_walwriter;

What about changing this comment to "There must be only one record"?


Thanks, I fixed.


+    WalWriterStats.m_xlog_dirty_writes++;
 LWLockRelease(WALWriteLock);

Since WalWriterStats.m_xlog_dirty_writes doesn't need to be protected
with WALWriteLock, isn't it better to increment that after releasing 
the lock?


Thanks, I fixed.


+CREATE VIEW pg_stat_walwriter AS
+    SELECT
+    pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+    pg_stat_get_walwriter_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_stat_progress_vacuum AS

In system_views.sql, the definition of pg_stat_walwriter should be
placed just after that of pg_stat_bgwriter not 
pg_stat_progress_analyze.


OK, I fixed it.


 }
-
 /*
  * We found an existing collector stats file. Read it and put 
all the


You seem to accidentally have removed the empty line here.


Sorry about that. I fixed it.

- errhint("Target must be \"archiver\" or 
\"bgwriter\".")));
+ errhint("Target must be \"archiver\" or 
\"bgwriter\" or

\"walwriter\".")));

There are two "or" in the message, but the former should be replaced 
with ","?


Thanks, I fixed.


On 2020-09-05 18:40, Magnus Hagander wrote:

On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao
 wrote:


On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

I changed the view name from pg_stat_walwrites to

pg_stat_walwriter.

I think it is better to match naming scheme with other views

like

pg_stat_bgwriter,

which is for bgwriter statistics but it has the statistics

related to backend.


I prefer the view name pg_stat_walwriter for the consistency with
other view names. But we also have pg_stat_wal_receiver. Which
makes me think that maybe pg_stat_wal_writer is better for
the consistency. Thought? IMO either of them works for me.
I'd like to hear more opinons about this.


I think pg_stat_bgwriter is now a misnomer, because it contains

the backends' activity.  Likewise, pg_stat_walwriter leads to
misunderstanding because its information is not limited to WAL
writer.


How about simply pg_stat_wal?  In the future, we may want to

include WAL reads in this view, e.g. reading undo logs in zheap.

Sounds reasonable.


+1.

pg_stat_bgwriter has had the "wrong name" for quite some time now --
it became even more apparent when the checkpointer was split out to
it's own process, and that's not exactly a recent change. And it had
allocs in it from day one...

I think naming it for what the data in it is ("wal") rather than 
which

process deals with it ("walwriter") is correct, unless the statistics
can be known to only *ever* affect one type of process. (And then
different processes can affect different columns in the view). As a
general rule -- and that's from what I can tell exactly what's being
proposed.


Thanks for your comments. I agree with your opinions.
I changed the view name to "pg_stat_wal".


I fixed the code to send the WAL statistics from not only backend and 
walwriter

but also checkpointer, walsender and autovacuum worker.


Good point! Thanks for updating the patch!


@@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams 
*params,

 onerel->rd_rel->relisshared,

Re: INSERT INTO SELECT, Why Parallelism is not selected?

On Thu, Jul 30, 2020 at 6:42 PM Amit Kapila  wrote:
>
> On Thu, Jul 30, 2020 at 12:02 PM Amit Kapila  wrote:
> >
> > On Wed, Jul 29, 2020 at 7:18 PM Robert Haas  wrote:
> > >
> > > I still don't agree with this as proposed.
> > >
> > > + * For now, we don't allow parallel inserts of any form not even where 
> > > the
> > > + * leader can perform the insert.  This restriction can be uplifted once
> > > + * we allow the planner to generate parallel plans for inserts.  We can
> > >
> > > If I'm understanding this correctly, this logic is completely
> > > backwards. We don't prohibit inserts here because we know the planner
> > > can't generate them. We prohibit inserts here because, if the planner
> > > somehow did generate them, it wouldn't be safe. You're saying that
> > > it's not allowed because we don't try to do it yet, but actually it's
> > > not allowed because we want to make sure that we don't accidentally
> > > try to do it. That's very different.
> > >
> >
> > Right, so how about something like: "To allow parallel inserts, we
> > need to ensure that they are safe to be performed in workers.  We have
> > the infrastructure to allow parallel inserts in general except for the
> > case where inserts generate a new commandid (eg. inserts into a table
> > having a foreign key column)."

Robert, Dilip, do you see any problem if we change the comment on the
above lines? Feel free to suggest if you have something better in
mind.

> >  We can extend this for tuple locking
> > if required as per the below discussion.  Kindly suggest if you prefer
> > a different wording here.
> >

I feel we can leave this based on the reasoning provided below.

> > >
> > > + * We should be able to parallelize
> > > + * the later case if we can ensure that no two parallel processes can 
> > > ever
> > > + * operate on the same page.
> > >
> > > I don't know whether this is talking about two processes operating on
> > > the same page at the same time, or ever within a single query
> > > execution. If it's the former, perhaps we need to explain why that's a
> > > concern for parallel query but not otherwise;
> > >
> >
> > I am talking about the former case and I know that as per current
> > design it is not possible that two worker processes try to operate on
> > the same page but I was trying to be pessimistic so that we can ensure
> > that via some form of Assert.
> >
>
> I think the two worker processes can operate on the same page for a
> parallel index scan case but it won't be for same tuple. I am not able
> to think of any case where we should be worried about tuple locking
> for Insert's case, so we can probably skip writing anything about it
> unless someone else can think of such a case.
>

-- 
With Regards,
Amit Kapila.

Re: Ideas about a better API for postgres_fdw remote estimates

On Wed, 9 Sep 2020 at 02:35, Tomas Vondra 
wrote

>
> I think that was the topic of *this* thread as started by Tom, but I now
> realize Andrey steered it in the direction to allow re-using remote
> stats. Which seems useful too, but it confused me a bit.
>

I didn't realize that the nearby thread I am mentioning is actually this
thread :). Sorry.


>
> >Using EXPLAIN to get costs from the foreign server isn't efficient. It
> >increases planning time a lot; sometimes planning time exceeds execution
> >time. If usage of foreign tables becomes more and more common, this isn't
> >ideal. I think we should move towards a model in which the optimizer can
> >decide whether a subtree involving a foreign server should be evaluated
> >locally or on the foreign server without the help of foreign server. One
> >way to do it (I am not saying that this is the only or the best way) is to
> >estimate the cost of foreign query locally based on the information
> >available locally about the foreign server and foreign table. This might
> >mean that we have to get that information from the foreign server and
> cache
> >it locally and use it several times, including the indexes on foreign
> >table, values of various costs etc. Though this approach doesn't solve all
> >of those problems it's one step forward + it makes the current scenario
> >also efficient.
> >
>
> True, but that ptoject is way more ambitious than providing a simple API
> for postgres_fdw to obtain the estimates more efficiently.
>

Doing all of that is a big project. But what this patch aims at is a small
subset which makes statistics collection efficient and automatic. So, just
for that, we should consider it.


>
> >I agree that the patch needs some work though, esp the code dealing with
> >serialization and deserialization of statistics.
>
> I think there's a bunch of open questions, e.g. what to do with extended
> statistics - for example what should happen when the extended statistics
> object is defined only on local/remote server, or when the definitions
> don't match? What should happen when the definitions don't match? This
> probably is not an issue for "regular" stats, because that seems pretty
> stable, but for extended stats there are differences between versions.


If it is defined on the foreign server but not the local server, there is
no need to fetch it from the foreign server. The other way round case is
tricky. We could mark the extended statistics object invalid if it's not
defined on the foreign server or the definition is different. We have to
document it that way. I think that should serve most of the cases.

-- 
Best Wishes,
Ashutosh

Re: VACUUM (INTERRUPTIBLE)?

Hi,

On 2020-09-08 19:30:40 -0700, Andres Freund wrote:
> It was fairly straight forward to implement the basics of
> INTERRUPTIBLE. However it seems like it might not actually address my
> main concern, i.e. making this reliably testable with isolation
> tester. At least not the way I envisioned it...
>
> My idea for the test was to have one isolationtester session start a
> seqscan, which would then reliably block a concurrent VACUUM (FREEZE,
> INTERRUPTIBLE). That'd then give a stable point to switch back to the
> first session, which would interrupt the VACUUM by doing a LOCK.
>
> But because there's no known waiting-for pid for a buffer pin wait, we
> currently don't detect that we're blocked :(.
>
>
> Now, it'd not be too hard to make pg_isolation_test_session_is_blocked
> also report being blocked when we're waiting for a buffer pin (ignoring
> interesting_pids), similar to the safe snapshot test.  However I'm
> worried that that could lead to "false" reports of blocking? But maybe
> that's a small enough risk because there's few unconditional cleanup
> lock acquisitions?
>
> Hacking such a wait condition test into
> pg_isolation_test_session_is_blocked() successfully allows my test to
> work for VACUUM.

Here's a patch series implementing that:

0001: Adds INTERRUPTIBLE to VACUUM ANALYZE
There's quite a few XXX's inside. And it needs some none
isolationtester test.
0002: Treat BufferPin as a waiting condition in isolationtest.
That's the aforementioned workaround.
0003: A test, finally.
But it only tests VACUUM, not yet ANALYZE. Perhaps also a test for
not allowing interruptions, somehow?

Clearly WIP, but good enough for some comments, I hope?

A few comments:
- Right now there can only be one such blocking task, because
  PROC_IS_INTERRUPTIBLE is only set by vacuum / analyze, and the lock
  levels are self exclusive. But it's generically named now, so it seems
  just a matter of time until somebody adds that to other commands. I
  think it's ok to not add support for ProcSleep() killing multiple
  other processes?
- It's a bit annoying that normal user backends just see a generic
  "ERROR: canceling statement due to user request". Do we need something
  better?
- The order in which VACUUM parameters are documented & implemented
  seems fairly random. Perhaps it'd make sense to reorder
  alphabetically?


> Not yet sure about what a suitable way to test this for analyze would
> be, unless we'd also accept VacuumDelay as a wait condition :(. I guess
> one fairly easy way would be to include an expression index, and have
> the expression index acquire an unavailable lock. Which'd allow to
> switch to another session.

But here I've not yet done anything. That just seems too ugly & fragile.

Greetings,

Andres Freund
>From 29d4c1bdf536222c1e86c7502bbbf3fd49751a90 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 8 Sep 2020 20:47:38 -0700
Subject: [PATCH v1 1/3] WIP: Add INTERRUPTIBLE option to VACUUM & ANALYZE.

Partially because that's practically useful, partially because it's
useful for testing.

There's a few XXXs in the code, and there's a comment or two I have
intentionally not yet rewrapped.
---
 src/include/commands/vacuum.h   |  1 +
 src/include/storage/lock.h  |  6 ++--
 src/include/storage/proc.h  |  5 +--
 src/backend/commands/vacuum.c   | 14 +---
 src/backend/postmaster/autovacuum.c |  2 ++
 src/backend/storage/lmgr/deadlock.c | 54 -
 src/backend/storage/lmgr/proc.c | 24 +++--
 doc/src/sgml/ref/analyze.sgml   | 16 +
 doc/src/sgml/ref/vacuum.sgml| 17 +
 9 files changed, 94 insertions(+), 45 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d9475c99890..7e064ef45e9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -215,6 +215,7 @@ typedef struct VacuumParams
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 			 * whole table */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
+	bool		is_interruptible;	/* allow cancelling on lock conflict */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 	 * which  verbose logs are activated, -1
 	 * to use default */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 1c3e9c1999f..f21f9f3f974 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -499,8 +499,8 @@ typedef enum
 	DS_NO_DEADLOCK,/* no deadlock detected */
 	DS_SOFT_DEADLOCK,			/* deadlock avoided by queue rearrangement */
 	DS_HARD_DEADLOCK,			/* deadlock, no way out but ERROR */
-	DS_BLOCKED_BY_AUTOVACUUM	/* no deadlock; queue blocked by autovacuum
- * worker */
+	DS_BLOCKED_BY_INTERRUPTIBLE	/* no deadlock; queue blocked by interruptible
+ * task (e.g. autovacuum worker) */
 } DeadLockState;
 
 /*
@@ -588,7 +588,7 @@ extern void lock_twophase_standby_recover(

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

2020-09-08 Thread Noah Misch

On Mon, Sep 07, 2020 at 07:47:09PM -0700, Noah Misch wrote:
> On Tue, Sep 08, 2020 at 10:43:32AM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 08 Sep 2020 09:13:53 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > At Mon, 7 Sep 2020 02:32:55 -0700, Noah Misch  wrote 
> > > in 
> > > > As a PoC, this looks promising.  Thanks.  Would you add a test case 
> > > > such that
> > > > the following demonstrates the bug in the absence of your PoC?
> > > > 
> > > >   printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> > > > 'max_wal_senders = 0' >/tmp/minimal.conf
> > > >   make check TEMP_CONFIG=/tmp/minimal.conf
> > > 
> > > Mmm. I was close to add some tests to 018_wal_optimize.pl but your
> > > suggestion seems better.  I added several ines to create_index.sql.

After looking closer, I've moved the test to reindex_catalog.sql; see that
file's header comment for the reasons.  One now needs this command:

  printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
'max_wal_senders = 0' >/tmp/minimal.conf
  make check-tests TESTS=reindex_catalog TEMP_CONFIG=/tmp/minimal.conf

> > > > Please have the test try both a nailed-and-mapped relation and a 
> > > > "nailed, but
> > > > not mapped" relation.  I am fairly confident that your PoC fixes the 
> > > > former
> > > > case, but the latter may need additional code.
> > > 
> > > Mmm. You're right. I choosed pg_amproc_fam_proc_index as
> > > nailed-but-not-mapped index.
> > 
> > I fixed a typo (s/staring/starting/).
> 
> At a glance, this looks reasonable.  If a closer look doesn't reveal problems,
> I'll push this.

RelationBuildDesc() calls RelationInitPhysicalAddr(), so RelationBuildDesc()
can stop calling RelFileNodeSkippingWAL().  The attached version makes it so,
and I plan to push it.
Author: Noah Misch 
Commit: Noah Misch 

Fix rd_firstRelfilenodeSubid for nailed relations, in parallel workers.

Move applicable code out of RelationBuildDesc(), which nailed relations
bypass.  Non-assert builds experienced no known problems.  Back-patch to
v13, where commit c6b92041d38512a4176ed76ad06f713d2e6c01a8 introduced
rd_firstRelfilenodeSubid.

Kyotaro Horiguchi.  Reported by Justin Pryzby.

Discussion: https://postgr.es/m/20200907023737.ga7...@telsasoft.com

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index 96ecad0..9061af8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1243,14 +1243,6 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
if (insertIt)
RelationCacheInsert(relation, true);
 
-   /*
-* For RelationNeedsWAL() to answer correctly on parallel workers, 
restore
-* rd_firstRelfilenodeSubid.  No subtransactions start or end while in
-* parallel mode, so the specific SubTransactionId does not matter.
-*/
-   if (IsParallelWorker() && RelFileNodeSkippingWAL(relation->rd_node))
-   relation->rd_firstRelfilenodeSubid = TopSubTransactionId;
-
/* It's fully valid */
relation->rd_isvalid = true;
 
@@ -1273,6 +1265,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 static void
 RelationInitPhysicalAddr(Relation relation)
 {
+   Oid oldnode = relation->rd_node.relNode;
+
/* these relations kinds never have storage */
if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
return;
@@ -1330,6 +1324,19 @@ RelationInitPhysicalAddr(Relation relation)
elog(ERROR, "could not find relation mapping for 
relation \"%s\", OID %u",
 RelationGetRelationName(relation), 
relation->rd_id);
}
+
+   /*
+* For RelationNeedsWAL() to answer correctly on parallel workers, 
restore
+* rd_firstRelfilenodeSubid.  No subtransactions start or end while in
+* parallel mode, so the specific SubTransactionId does not matter.
+*/
+   if (IsParallelWorker() && oldnode != relation->rd_node.relNode)
+   {
+   if (RelFileNodeSkippingWAL(relation->rd_node))
+   relation->rd_firstRelfilenodeSubid = 
TopSubTransactionId;
+   else
+   relation->rd_firstRelfilenodeSubid = 
InvalidSubTransactionId;
+   }
 }
 
 /*
diff --git a/src/test/regress/expected/reindex_catalog.out 
b/src/test/regress/expected/reindex_catalog.out
index 4b5fba4..204f056 100644
--- a/src/test/regress/expected/reindex_catalog.out
+++ b/src/test/regress/expected/reindex_catalog.out
@@ -36,3 +36,13 @@ REINDEX INDEX pg_index_indexrelid_index; -- non-mapped, 
non-shared, critical
 REINDEX INDEX pg_index_indrelid_index; -- non-mapped, non-shared, non-critical
 REINDEX INDEX pg_database_oid_index; -- mapped, shared, critical
 REINDEX INDEX pg_shdescription_o_c_index; -- mapped, shared, non-critical
+-- Check the same REINDEX INDEX statements under parallelism.
+BEGIN;

Re: Two fsync related performance issues?

2020-09-08 Thread Thomas Munro

On Thu, Sep 3, 2020 at 11:30 AM Thomas Munro  wrote:
> On Wed, May 27, 2020 at 12:31 AM Craig Ringer  wrote:
> > On Tue, 12 May 2020, 08:42 Paul Guo,  wrote:
> >> 1. StartupXLOG() does fsync on the whole data directory early in the crash 
> >> recovery. I'm wondering if we could skip some directories (at least the 
> >> pg_log/, table directories) since wal, etc could ensure consistency. Here 
> >> is the related code.
> >>
> >>   if (ControlFile->state != DB_SHUTDOWNED &&
> >>   ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> >>   {
> >>   RemoveTempXlogFiles();
> >>   SyncDataDirectory();
> >>   }

> 4.  In datadir_fsync_fname(), if ParseRelationPath() is able to
> recognise a file as being a relation file, build a FileTag and call
> RegisterSyncRequest() to tell the checkpointer to sync this file as
> part of the end checkpoint (currently the end-of-recovery checkpoint,
> but that could also be relaxed).

For the record, Andres Freund mentioned a few problems with this
off-list and suggested we consider calling Linux syncfs() for each top
level directory that could potentially be on a different filesystem.
That seems like a nice idea to look into.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Mon, Sep 7, 2020 at 12:57 PM Dilip Kumar  wrote:
>
> On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila  wrote:
>>
>> On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar  wrote:
>> >
>> >
>> > I have reviewed the changes and looks fine to me.
>> >
>>
>> Thanks, I have pushed the last patch. Let's wait for a day or so to
>> see the buildfarm reports and then we can probably close this CF
>> entry.
>
>
> Thanks.
>

I have updated the status of CF entry as committed now.

-- 
With Regards,
Amit Kapila.

Re: Inconsistency in determining the timestamp of the db statfile.

On Tue, Sep 8, 2020 at 7:03 PM Magnus Hagander  wrote:
>
> On Tue, Sep 8, 2020 at 3:11 PM Fujii Masao  
> wrote:
>>
>>
>>
>> On 2020/09/08 19:28, Magnus Hagander wrote:
>> >
>> >
>> > On Tue, Sep 8, 2020 at 8:10 AM Amit Kapila > > > wrote:
>> >
>> > We use the timestamp of the global statfile if we are not able to
>> > determine it for a particular database either because the entry for
>> > that database doesn't exist or there is an error while reading the
>> > specific database entry. This was not taken care of while reading
>> > other entries like ArchiverStats or SLRUStats. See
>> > pgstat_read_db_statsfile_timestamp. I have observed this while
>> > reviewing Sawada-San's patch related to logical replication stats [1].
>> >
>> > I think this can only happen if due to some reason the statfile got
>> > corrupt or we
>> > have some bug in stats writing code, the chances of both are rare and 
>> > even
>> > if that happens we will use stale statistics.
>> >
>> > The attached patch by Sawada-San will fix this issue. As the chances 
>> > of this
>> > the problem is rare and nobody has reported any issue related to this,
>> > so it might be okay not to backpatch this.
>> >
>> > Thoughts?
>> >
>> >
>> > Why are we reading the archiver statis and and slru stats in 
>> > "pgstat_read_db_statsfile_timestamp" in the first place?
>>
>> Maybe because they are written before database stats entries? That is,
>> to read the target database stats entry and get the timestamp from it,
>> we need to read (or lseek) all the global stats entries written before them.
>>
>
> Oh meh. Yeah, I'm reading this thing completely wrong :/ Ignore my comments :)
>
>
>
>> > That seems well out of scope for that function.
>> >
>> > If nothing else the comment at the top of that function is out of touch 
>> > with reality. We do seem to read it into local buffers and then ignore the 
>> > contents. I guess we're doing it just to verify that it's not corrupt -- 
>> > so perhaps the function should actually have a different name than it does 
>> > now, since it certainly seems to do more than actually read the statsfile 
>> > timestamp.
>> >
>> > Specifically in this patch it looks wrong -- in the case of say the SLRU 
>> > stats being corrupt, we will now return the timestamp of the global stats 
>> > file even if there is one existing for the database requested, which 
>> > definitely breaks the contract of the function.
>>
>> Yes.
>> We should return false when fread() for database entry fails, instead? That 
>> is,
>>
>> 1. If corrupted stats file is found, the function always returns false.
>> 2. If the file is not currupted and the target database entry is found, its 
>> timestamp is returned.
>> 3. If the file is not corrupted and the target is NOT found, the timestamp 
>> of global entry is returned
>
>
> Yeah, with more coffee and re-reading it, I'm not sure how we could have the 
> database stats being OK if the archiver or slru stats are not.
>
> That said, I think it still makes sense to return false if the stats file is 
> corrupt. How much can we trust the first block, if the block right after it 
> is corrupt? So yeah, I think your described order seems correct. But that's 
> also what the code already did before this patch, isn't it?
>

No, before patch as well, if we can't read the DB entry say because
the file is corrupt, we return true and use timestamp of global stats
file and this is what is established by the original commit 187492b6.
So, if we consider that as correct then the functionality is something
like once we have read the timestamp of the global statfile, we use
that if we can't read the actual db entry for whatever reason. It
seems if we return false, the caller will call this function again in
the loop. Now, I see the point that if we can't read some part of the
file we should return false instead but not sure if that is helpful
from the perspective of the caller of
pgstat_read_db_statsfile_timestamp.

I have included Alvaro as he is a committer for 187492b6, so he might
remember something and let us know if this is a mistake or there is
some reason for doing so (return true even when the db entry we are
trying to read is corrupt).

-- 
With Regards,
Amit Kapila.

Re: VACUUM (INTERRUPTIBLE)?

Hi,

On 2020-09-08 18:37:05 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > So it doesn't seem that useful to not make manual analyze interruptible?
> 
> I find the main attraction of this idea is that instead of saying
> "autovacuum/autoanalyze have these magic behaviors", we can say
> "autovacuum/autoanalyze use option FOO".  So whatever behavior
> autoanalyze has today, we should make available to manual analyze,
> without quibbling too hard over how much of a use-case there is.

Yea, I think that's a fair point. It's basically more work to
distinguish the two anyway.

It was fairly straight forward to implement the basics of
INTERRUPTIBLE. However it seems like it might not actually address my
main concern, i.e. making this reliably testable with isolation
tester. At least not the way I envisioned it...

My idea for the test was to have one isolationtester session start a
seqscan, which would then reliably block a concurrent VACUUM (FREEZE,
INTERRUPTIBLE). That'd then give a stable point to switch back to the
first session, which would interrupt the VACUUM by doing a LOCK.

But because there's no known waiting-for pid for a buffer pin wait, we
currently don't detect that we're blocked :(.

Now, it'd not be too hard to make pg_isolation_test_session_is_blocked
also report being blocked when we're waiting for a buffer pin (ignoring
interesting_pids), similar to the safe snapshot test.  However I'm
worried that that could lead to "false" reports of blocking? But maybe
that's a small enough risk because there's few unconditional cleanup
lock acquisitions?

Hacking such a wait condition test into
pg_isolation_test_session_is_blocked() successfully allows my test to
work for VACUUM.

Not yet sure about what a suitable way to test this for analyze would
be, unless we'd also accept VacuumDelay as a wait condition :(. I guess
one fairly easy way would be to include an expression index, and have
the expression index acquire an unavailable lock. Which'd allow to
switch to another session.

Greetings,

Andres Freund

Re: Implementing Incremental View Maintenance

2020-09-08 Thread Thomas Munro

On Wed, Sep 9, 2020 at 12:29 PM Yugo NAGATA  wrote:
> I also thought it might be resolved using tuple locks and EvalPlanQual
> instead of table level lock, but there is still a unavoidable case.  For
> example, suppose that tuple dR is inserted into R in T1, and dS is inserted
> into S in T2. Also, suppose that dR and dS will be joined in according to
> the view definition. In this situation, without any lock, the change of V is
> computed as dV=dR*S in T1, dV=R*dS in T2, respectively, and dR*dS would not
> be included in the results.  This causes inconsistency. I don't think this
> could be resolved even if we use tuple locks.

I see.  Thanks for the explanation!

> As to aggregate view without join , however, we might be able to use a lock
> of more low granularity as you said, because if rows belonging a group in a
> table is changes, we just update (or delete) corresponding rows in the view.
> Even if there are concurrent transactions updating the same table, we would
> be able to make one of them wait using tuple lock. If concurrent transactions
> are trying to insert a tuple into the same table, we might need to use unique
> index and UPSERT to avoid to insert multiple rows with same group key into
> the view.
>
> Therefore, usual update semantics (tuple locks and EvalPlanQual) and UPSERT
> can be used for optimization for some classes of view, but we don't have any
> other better idea than using table lock for views joining tables.  We would
> appreciate it if you could suggest better solution.

I have nothing, I'm just reading starter papers and trying to learn a
bit more about the concepts at this stage.  I was thinking of
reviewing some of the more mechanical parts of the patch set, though,
like perhaps the transition table lifetime management, since I have
worked on that area before.

> > (Newer papers describe locking schemes that avoid even aggregate-row
> > level conflicts, by taking advantage of the associativity and
> > commutativity of aggregates like SUM and COUNT.  You can allow N
> > writers to update the aggregate concurrently, and if any transaction
> > has to roll back it subtracts what it added, not necessarily restoring
> > the original value, so that nobody conflicts with anyone else, or
> > something like that...  Contemplating an MVCC, no-rollbacks version of
> > that sort of thing leads to ideas like, I dunno, update chains
> > containing differential update trees to be compacted later... egad!)
>
> I am interested in papers you mentioned!  Are they literatures in context of
> incremental view maintenance?

Yeah.  I was skim-reading some parts of [1] including section 2.5.1
"Concurrency Control", which opens with some comments about
aggregates, locking and pointers to "V-locking" [2] for high
concurrency aggregates.  There is also a pointer to G. Graefe and M.
J. Zwilling, "Transaction support for indexed views," which I haven't
located; apparently indexed views are Graefe's name for MVs, and
apparently this paper has a section on MVCC systems which sounds
interesting for us.

[1] https://dsf.berkeley.edu/cs286/papers/mv-fntdb2012.pdf
[2] http://pages.cs.wisc.edu/~gangluo/latch.pdf

Re: file_fdw vs relative paths

2020-09-08 Thread Ian Barwick


Hi

On 2020/09/07 2:31, Magnus Hagander wrote:

On Mon, Aug 31, 2020 at 5:03 PM Bruce Momjian mailto:br...@momjian.us>> wrote:

On Mon, Aug 31, 2020 at 01:16:05PM +0200, Magnus Hagander wrote:
 >     Bruce, I've applied and backpatched your docs patch for this.
 >
 > Gah, and of course right after doing that, I remembered I wanted to get a
 > second change in :) To solve the "who's this Josh" question, I suggest 
we also
 > change the example to point to the data/log directory which is likely to 
exist
 > in a lot more of the cases. I keep getting people who ask "who is josh" 
based
 > on the /home/josh path. Not that it's that important, but...

Thanks, and agreed.


Thanks, applied. I backpacked to 13 but didn't bother with the rest as it's not 
technically *wrong* before..


It's missing the leading single quote from the filename parameter:

diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
(...)
-OPTIONS ( filename '/home/josh/data/log/pglog.csv', format 'csv' );
+OPTIONS ( filename log/pglog.csv', format 'csv' );
(...)


Regards


Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: default partition and concurrent attach partition

2020-09-08 Thread Amit Langote

On Wed, Sep 9, 2020 at 7:41 AM Alvaro Herrera  wrote:
> On 2020-Sep-08, Amit Langote wrote:
>
> > Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
> > TupleDesc matches the partition's.  Actually we do have such dedicated
> > slots for partitions around (for both sub-partitioned and leaf
> > partitions), so we can simply use them instead of creating one from
> > scratch for every use.  It did take some code rearrangement to
> > actually make that work though.
>
> Pushed, thanks for working on this fix.

Thanks.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Implementing Incremental View Maintenance

2020-09-08 Thread Yugo NAGATA

Hi Thomas,

Thank you for your comment!

On Sat, 5 Sep 2020 17:56:18 +1200
Thomas Munro  wrote:
> +   /*
> +* Wait for concurrent transactions which update this materialized view at
> +* READ COMMITED. This is needed to see changes committed in other
> +* transactions. No wait and raise an error at REPEATABLE READ or
> +* SERIALIZABLE to prevent update anomalies of matviews.
> +* XXX: dead-lock is possible here.
> +*/
> +   if (!IsolationUsesXactSnapshot())
> +   LockRelationOid(matviewOid, ExclusiveLock);
> +   else if (!ConditionalLockRelationOid(matviewOid, ExclusiveLock))
> 
> Could you please say a bit more about your plans for concurrency control?
> 
> Simple hand-crafted "rollup" triggers typically conflict only when
> modifying the same output rows due to update/insert conflicts, or
> perhaps some explicit row level locking if they're doing something
> complex (unfortunately, they also very often have concurrency
> bugs...).  In some initial reading about MV maintenance I did today in
> the hope of understanding some more context for this very impressive
> but rather intimidating patch set, I gained the impression that
> aggregate-row locking granularity is assumed as a baseline for eager
> incremental aggregate maintenance.  I understand that our
> MVCC/snapshot scheme introduces extra problems, but I'm wondering if
> these problems can be solved using the usual update semantics (the
> EvalPlanQual mechanism), and perhaps also some UPSERT logic.  Why is
> it not sufficient to have locked all the base table rows that you have
> modified, captured the before-and-after values generated by those
> updates, and also locked all the IMV aggregate rows you will read, and
> in the process acquired a view of the latest committed state of the
> IMV aggregate rows you will modify (possibly having waited first)?  In
> other words, what other data do you look at, while computing the
> incremental update, that might suffer from anomalies because of
> snapshots and concurrency?  For one thing, I am aware that unique
> indexes for groups would probably be necessary; perhaps some subtle
> problems of the sort usually solved with predicate locks lurk there?

I decided to lock a matview considering views joining tables.  
For example, let V = R*S is an incrementally maintainable materialized 
view which joins tables R and S.  Suppose there are two concurrent 
transactions T1 which changes table R to R' and T2 which changes S to S'.  
Without any lock,  in READ COMMITTED mode, V would be updated to
represent  V=R'*S in T1, and V=R*S' in T2, so it would cause inconsistency.  
By locking the view V, transactions T1, T2 are processed serially and this 
inconsistency can be avoided.

I also thought it might be resolved using tuple locks and EvalPlanQual
instead of table level lock, but there is still a unavoidable case.  For
example, suppose that tuple dR is inserted into R in T1, and dS is inserted
into S in T2. Also, suppose that dR and dS will be joined in according to
the view definition. In this situation, without any lock, the change of V is
computed as dV=dR*S in T1, dV=R*dS in T2, respectively, and dR*dS would not
be included in the results.  This causes inconsistency. I don't think this
could be resolved even if we use tuple locks.

As to aggregate view without join , however, we might be able to use a lock
of more low granularity as you said, because if rows belonging a group in a
table is changes, we just update (or delete) corresponding rows in the view.  
Even if there are concurrent transactions updating the same table, we would
be able to make one of them wait using tuple lock. If concurrent transactions
are trying to insert a tuple into the same table, we might need to use unique
index and UPSERT to avoid to insert multiple rows with same group key into
the view.

Therefore, usual update semantics (tuple locks and EvalPlanQual) and UPSERT
can be used for optimization for some classes of view, but we don't have any
other better idea than using table lock for views joining tables.  We would
appreciate it if you could suggest better solution. 

> (Newer papers describe locking schemes that avoid even aggregate-row
> level conflicts, by taking advantage of the associativity and
> commutativity of aggregates like SUM and COUNT.  You can allow N
> writers to update the aggregate concurrently, and if any transaction
> has to roll back it subtracts what it added, not necessarily restoring
> the original value, so that nobody conflicts with anyone else, or
> something like that...  Contemplating an MVCC, no-rollbacks version of
> that sort of thing leads to ideas like, I dunno, update chains
> containing differential update trees to be compacted later... egad!)

I am interested in papers you mentioned!  Are they literatures in context of
incremental view maintenance?

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-08 Thread Justin Pryzby

On Tue, Sep 08, 2020 at 09:02:38PM -0300, Alvaro Herrera wrote:
> On 2020-Sep-08, Justin Pryzby wrote:
> 
> > From 992e0121925c74d5c5a4e5b132cddb3d6b31da86 Mon Sep 17 00:00:00 2001
> > From: Justin Pryzby 
> > Date: Fri, 27 Mar 2020 17:50:46 -0500
> > Subject: [PATCH v27 1/5] Change REINDEX/CLUSTER to accept an option list..
> > 
> > ..like EXPLAIN (..), VACUUM (..), and ANALYZE (..).
> > 
> > Change docs in the style of VACUUM.  See also: 
> > 52dcfda48778d16683c64ca4372299a099a15b96
> 
> I don't understand why you change all options to DefElem instead of
> keeping the bitmask for those options that can use it.

That's originally how I did it, too.

Initially I added List *params, and Michael suggested to retire
ReindexStmt->concurrent.  I provided a patch to do so, initially by leaving int
options and then, after this, removing it to "complete the thought", and get
rid of the remnants of the "old way" of doing it.  This is also how vacuum and
explain are done.
https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com

-- 
Justin

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

On 2020-Sep-08, Justin Pryzby wrote:

> From 992e0121925c74d5c5a4e5b132cddb3d6b31da86 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Fri, 27 Mar 2020 17:50:46 -0500
> Subject: [PATCH v27 1/5] Change REINDEX/CLUSTER to accept an option list..
> 
> ..like EXPLAIN (..), VACUUM (..), and ANALYZE (..).
> 
> Change docs in the style of VACUUM.  See also: 
> 52dcfda48778d16683c64ca4372299a099a15b96

I don't understand why you change all options to DefElem instead of
keeping the bitmask for those options that can use it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-08 Thread Justin Pryzby

On Thu, Sep 03, 2020 at 09:43:51PM -0500, Justin Pryzby wrote:
> And rebased on Michael's commit removing ReindexStmt->concurrent.

Rebased on a6642b3ae: Add support for partitioned tables and indexes in REINDEX

So now this includes the new functionality and test for reindexing a
partitioned table onto a new tablespace.  That part could use some additional
review.

I guess this patch series will also conflict with the CLUSTER part of this
other one.  Once its CLUSTER patch is commited, this patch should to be updated
to test clustering a partitioned table to a new tbspc.
https://commitfest.postgresql.org/29/2584/
REINDEX/CIC/CLUSTER of partitioned tables

-- 
Justin
>From 992e0121925c74d5c5a4e5b132cddb3d6b31da86 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 27 Mar 2020 17:50:46 -0500
Subject: [PATCH v27 1/5] Change REINDEX/CLUSTER to accept an option list..

..like EXPLAIN (..), VACUUM (..), and ANALYZE (..).

Change docs in the style of VACUUM.  See also: 52dcfda48778d16683c64ca4372299a099a15b96
---
 doc/src/sgml/ref/cluster.sgml  | 27 ++---
 doc/src/sgml/ref/reindex.sgml  | 43 ++-
 src/backend/commands/cluster.c | 28 --
 src/backend/nodes/copyfuncs.c  |  4 +--
 src/backend/nodes/equalfuncs.c |  4 +--
 src/backend/parser/gram.y  | 54 +++---
 src/backend/tcop/utility.c | 38 ++--
 src/bin/psql/tab-complete.c| 22 ++
 src/include/commands/cluster.h |  3 +-
 src/include/nodes/parsenodes.h |  4 +--
 10 files changed, 167 insertions(+), 60 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4da60d8d56..e6ebce27e6 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -21,8 +21,13 @@ PostgreSQL documentation
 
  
 
-CLUSTER [VERBOSE] table_name [ USING index_name ]
-CLUSTER [VERBOSE]
+CLUSTER [VERBOSE] [ ( option [, ...] ) ] table_name [ USING index_name ]
+CLUSTER [VERBOSE] [ ( option [, ...] ) ]
+
+where option can be one of:
+
+VERBOSE [ boolean ]
+
 
  
 
@@ -81,6 +86,15 @@ CLUSTER [VERBOSE]
   Parameters
 
   
+   
+VERBOSE
+
+ 
+  Prints a progress report as each table is clustered.
+ 
+
+   
+

 table_name
 
@@ -100,10 +114,15 @@ CLUSTER [VERBOSE]

 

-VERBOSE
+boolean
 
  
-  Prints a progress report as each table is clustered.
+  Specifies whether the selected option should be turned on or off.
+  You can write TRUE, ON, or
+  1 to enable the option, and FALSE,
+  OFF, or 0 to disable it.  The
+  boolean value can also
+  be omitted, in which case TRUE is assumed.
  
 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 33af4ae02a..593b38a7ee 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -25,7 +25,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 where option can be one of:
 
-VERBOSE
+VERBOSE [ boolean ]
 
  
 
@@ -145,19 +145,6 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
-   
-name
-
- 
-  The name of the specific index, table, or database to be
-  reindexed.  Index and table names can be schema-qualified.
-  Presently, REINDEX DATABASE and REINDEX SYSTEM
-  can only reindex the current database, so their parameter must match
-  the current database's name.
- 
-
-   
-

 CONCURRENTLY
 
@@ -185,6 +172,34 @@ REINDEX [ ( option [, ...] ) ] { IN
  
 

+
+   
+name
+
+ 
+  The name of the specific index, table, or database to be
+  reindexed.  Index and table names can be schema-qualified.
+  Presently, REINDEX DATABASE and REINDEX SYSTEM
+  can only reindex the current database, so their parameter must match
+  the current database's name.
+ 
+
+   
+
+   
+boolean
+
+ 
+  Specifies whether the selected option should be turned on or off.
+  You can write TRUE, ON, or
+  1 to enable the option, and FALSE,
+  OFF, or 0 to disable it.  The
+  boolean value can also
+  be omitted, in which case TRUE is assumed.
+ 
+
+   
+
   
  
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0d647e912c..4a3678831d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -35,6 +35,7 @@
 #include "catalog/pg_am.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
+#include "commands/defrem.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
@@ -102,8 +103,29 @@ static List *get_tables_to_cluster(MemoryContext cluster_context);
  *---
  */
 void
-cluster(ClusterStmt *stmt, bool isTopLevel)
+cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
 {
+	ListCell	*lc;
+	int			options = 0;
+
+	/* Parse list of

Re: WIP: WAL prefetch (another approach)

2020-09-08 Thread Tomas Vondra

On Sat, Sep 05, 2020 at 12:05:52PM +1200, Thomas Munro wrote:

On Wed, Sep 2, 2020 at 2:18 AM Tomas Vondra
 wrote:

On Wed, Sep 02, 2020 at 02:05:10AM +1200, Thomas Munro wrote:
>On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
> wrote:
>> from the archive
>
>Ahh, so perhaps that's the key.

Maybe. For the record, the commands look like this:

archive_command = 'gzip -1 -c %p > /mnt/raid/wal-archive/%f.gz'

restore_command = 'gunzip -c /mnt/raid/wal-archive/%f.gz > %p.tmp && mv %p.tmp 
%p'

Yeah, sorry, I goofed here by not considering archive recovery
properly.  I have special handling for crash recovery from files in
pg_wal (XLRO_END, means read until you run out of files) and streaming
replication (XLRO_WALRCV_WRITTEN, means read only as far as the wal
receiver has advertised as written in shared memory), as a way to
control the ultimate limit on how far ahead to read when
maintenance_io_concurrency and max_recovery_prefetch_distance don't
limit you first.  But if you recover from a base backup with a WAL
archive, it uses the XLRO_END policy which can run out of files just
because a new file hasn't been restored yet, so it gives up
prefetching too soon, as you're seeing.  That doesn't cause any
damage, but it stops doing anything useful because the prefetcher
thinks its job is finished.

It'd be possible to fix this somehow in the two-XLogReader design, but
since I'm testing a new version that has a unified
XLogReader-with-read-ahead I'm not going to try to do that.  I've
added a basebackup-with-archive recovery to my arsenal of test
workloads to make sure I don't forget about archive recovery mode
again, but I think it's actually harder to get this wrong in the new
design.  In the meantime, if you are still interested in studying the
potential speed-up from WAL prefetching using the most recently shared
two-XLogReader patch, you'll need to unpack all your archived WAL
files into pg_wal manually beforehand.

OK, thanks for looking into this. I guess I'll wait for an updated patch
before testing this further. The storage has limited capacity so I'd
have to either reduce the amount of data/WAL or juggle with the WAL
segments somehow. Doesn't seem worth it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: VACUUM (INTERRUPTIBLE)?

Hi,

On 2020-09-08 22:30:40 +0200, Magnus Hagander wrote:
> One thing I've been wanting many times but never properly got around to
> investigating how much work it would be to make happen, was to be able to
> trigger an autovacuum manually (yeah, strange choice of terms). That is,
> instead of the above, you'd have something like "VACUUM BACKGROUND" which
> would trigger an autovacuum worker to do the work, and then release your
> session. The points then being both (1) the ability to interrupt it, and
> (2) that it'd run in the backgorund and thus the foreground session could
> disconnect.
> 
> I think both would probably solve your problem, and being able to trigger a
> background one would add some extra value? But we'd have to figure out and
> be clear about what to do if all workers are busy for example - queue or
> error?
> 
> Worth considering, or am I missing something?

It seems like it could be useful in general. Not that much for my case
however. It'd be much easier to test whether vacuum was successfully
cancelled if we can see the cancellation, which we can't in the
autovacuum case. And there'd likely be some fairly ugly logic around
needing to wait until the "autovacuum request" is processed etc,
including the case that all workers are currently busy.

So my INTERRUPTIBLE idea seems to be a better fit for the tests, and
independently quite useful. E.g. wanting to know whether VACUUM errored
out is useful for scripts that want their VACUUMs to be interruptible,
and that doesn't work well with the "backgrounding" idea of yours.

Having said that, your idea does seem like it could be helpful. The
difficulty seems to depend a bit on the exact desired
semantics. E.g. would the "queue" command block until vacuum started or
not? The latter would obviously be much easier...

Greetings,

Andres Freund

Re: default partition and concurrent attach partition

On 2020-Sep-08, Amit Langote wrote:

> Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
> TupleDesc matches the partition's.  Actually we do have such dedicated
> slots for partitions around (for both sub-partitioned and leaf
> partitions), so we can simply use them instead of creating one from
> scratch for every use.  It did take some code rearrangement to
> actually make that work though.

Pushed, thanks for working on this fix.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: VACUUM (INTERRUPTIBLE)?

Andres Freund  writes:
> So it doesn't seem that useful to not make manual analyze interruptible?

I find the main attraction of this idea is that instead of saying
"autovacuum/autoanalyze have these magic behaviors", we can say
"autovacuum/autoanalyze use option FOO".  So whatever behavior
autoanalyze has today, we should make available to manual analyze,
without quibbling too hard over how much of a use-case there is.

regards, tom lane

Re: default partition and concurrent attach partition

On 2020-Sep-08, Alvaro Herrera wrote:

> Andres added to CC because of TTS interface: apparently calling
> slot_getallattrs() on a virtual slot raises error that "getsomeattrs is
> not required to be called on a virtual tuple table slot".  I'm thinking
> that this exposes implementation details that should not be necessary
> for a caller to know; patch 0001 fixes that at the problematic caller by
> making the slot_getallatrs() call conditional on the slot not being
> virtual, but I wonder if the better fix isn't to remove the elog(ERROR)
> at tts_virtual_getsomeattrs.

Actually I misread the code, so this is all bogus.  Nevermind ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: More aggressive vacuuming of temporary tables

Andres Freund  writes:
> On 2020-09-08 15:24:54 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> But now I do wonder why we need to know whether the command is top level
>>> or not? Why isn't the correct thing to instead look at what the current
>>> backend's xmin is? Seems like you could just replace
>>> *oldestXmin = XidFromFullTransactionId(ReadNextFullTransactionId());
>>> with
>>> *oldestXmin = MyProc->xmin;
>>> Assert(TransactionIdIsValid(*oldestXmin));

>> Ummm ... since VACUUM doesn't run inside a transaction, it won't be
>> advertising an xmin will it?

> We do run it in a transaction though:

Ah.  But still, this is not the semantics I want for VACUUM, because the
process xmin will involve other processes' behavior.  The point of the
change as far as I'm concerned is to ensure vacuuming of dead temp rows
independent of anything else in the system.

regards, tom lane

Re: More aggressive vacuuming of temporary tables

Hi,

On 2020-09-08 15:24:54 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > But now I do wonder why we need to know whether the command is top level
> > or not? Why isn't the correct thing to instead look at what the current
> > backend's xmin is? Seems like you could just replace
> > *oldestXmin = XidFromFullTransactionId(ReadNextFullTransactionId());
> > with
> > *oldestXmin = MyProc->xmin;
> > Assert(TransactionIdIsValid(*oldestXmin));
> 
> Ummm ... since VACUUM doesn't run inside a transaction, it won't be
> advertising an xmin will it?

We do run it in a transaction though:

static bool
vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
{
...
/* Begin a transaction for vacuuming this relation */
StartTransactionCommand();

/*
 * Need to acquire a snapshot to prevent pg_subtrans from being 
truncated,
 * cutoff xids in local memory wrapping around, and to have updated xmin
 * horizons.
 */
PushActiveSnapshot(GetTransactionSnapshot());

> Maybe you could make something like this work, but I think it'd still
> have to treat CLUSTER as a special case.  Not sure it's worth it.

Why would CLUSTER need to be special cased? We'd precisely retain the
rows we need to, I think? Given that we'd exactly use the snapshot that
rules that determines which rows need to be retained?

Greetings,

Andres Freund

SIGQUIT handling, redux

This is to pull together a couple of recent threads that are seemingly
unrelated to $SUBJECT.

Over in the long thread at [1] we've agreed to try to make the backend
code actually do what assorted comments claim it does, namely allow
SIGQUIT to be accepted at any point after a given process has established
its signal handlers. (Right now, it mostly is not accepted during error
recovery, but there's no clear reason to preserve that exception.)

Of course, this is only safe if the SIGQUIT handler is safe to be invoked
anywhere, so I did a quick survey of backend signal handlers to see if
that is true. I immediately found pgarch_exit() which surely is not. It
turns out that Horiguchi-san already noticed that and proposed to fix it,
within the seemingly entirely unrelated patch series for a shared-memory
based stats collector (see patch 0001 in [2]). I think we should go ahead
and commit that, and maybe even back-patch it. There seems no good reason
for the archiver to treat SIGQUIT as nonfatal when other postmaster
children do not.

Another thing I found is startup_die() in postmaster.c, which catches
SIGQUIT during the authentication phase of a new backend. This calls
proc_exit(1), which (a) is unsafe on its face, and (b) potentially
demotes a SIGQUIT into something that will not cause the parent postmaster
to perform a DB-wide restart. There seems no good reason for that
behavior, either. I propose that we use SignalHandlerForCrashExit
for SIGQUIT here too.

In passing, it's worth noting that startup_die() isn't really much safer
for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
those is that code that applies BlockSig will at least manage to block the
former. So it's slightly tempting to drop startup_die() altogether in
favor of using SignalHandlerForCrashExit for the SIGTERM-during-auth case
too. However, that'd have the effect of causing a "fast shutdown" to get
converted to a panic if it happens while any sessions are in
authentication phase, so I think this cure is likely worse than the
disease.

Worth reading in connection with this is the thread that led up to
commits 8e19a8264 et al [3]. We had a long discussion about how
quickdie() and startup_die() might be made safe(r), without any
real consensus emerging about any alternatives being better than the
status quo. I still don't have an improvement idea for quickdie();
I don't want to give up trying to send a message to the client.
Note, however, that quickdie() does end with _exit(2) so that at
least it's not trying to execute arbitrary process-cleanup code.

In short then, I want to ensure that postmaster children's SIGQUIT
handlers always end with _exit(2) rather than some other exit method.
We have two exceptions now and they need to get fixed.

regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA%2B-2cJcLpw-cePZ%3DGgDVfA%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/20200908.175557.617150409868541587.horikyota.ntt%40gmail.com
[3]
https://www.postgresql.org/message-id/flat/CAOMx_OAuRUHiAuCg2YgicZLzPVv5d9_H4KrL_OFsFP%3DVPekigA%40mail.gmail.com

Re: VACUUM (INTERRUPTIBLE)?

Hi,

On 2020-09-08 17:36:00 -0300, Alvaro Herrera wrote:
> On 2020-Sep-08, Andres Freund wrote:
> 
> > That made me wonder if it'd be worthwhile to add an option that'd make
> > user invoked VACUUM be interruptible by conflicting lock requests, just
> > like autovacuum is.
> 
> Yeah, I recall a request for this in the past, too.
> 
> > So how about adding an INTERRUPTIBLE option to VACUUM and ANALYZE?
> 
> +1 on adding it to VACUUM.  I'm not sure about ANALYZE ... most of the
> time it is not as bad as VACUUM in terms of blocking other things, and
> things get ugly if that ANALYZE is part of a transaction.  I think I'd
> leave that out, since we don't have to cover all DDL that could take
> ShareUpdateExclusive lock (CIC etc).  VACUUM is especially problematic,
> ISTM, which is we would do it for that.

ANALYZE sometimes is worse than VACUUM in some ways, it'll often take
longer on large tables (more random IO, no freeze map equivalent). And
there's still TRUNCATE / DROP TABLE etc that you want to be able to
interrupt (auto-)analyze.

I'm not sure it's really necessary to distinguish transaction /
non-transactional for ANALYZE. We continue to wait for the lock, right?
So signalling the transaction is fine, it'll error out. Even in the
case of ANALYZE in a subtransaction, we'll not do something incorrect,
we might just wait, if the top level transaction also has a lock, and
doesn't abort immediately.

So it doesn't seem that useful to not make manual analyze interruptible?

Greetings,

Andres Freund

Re: Ideas about a better API for postgres_fdw remote estimates

2020-09-08 Thread Tomas Vondra


On Tue, Sep 08, 2020 at 05:55:09PM +0530, Ashutosh Bapat wrote:

On Fri, 4 Sep 2020 at 20:27, Tomas Vondra 
wrote




4) I wonder if we actually want/need to simply output pg_statistic data
verbatim like this. Is postgres_fdw actually going to benefit from it? I
kinda doubt that, and my assumption was that we'd return only a small
subset of the data, needed by get_remote_estimate.

This has a couple of issues. Firstly, it requires the knowledge of what
the stakind constants in pg_statistic mean and how to interpret it - but
OK, it's true that does not change very often (or at all). Secondly, it
entirely ignores extended statistics - OK, we might extract those too,
but it's going to be much more complex. And finally it entirely ignores
costing on the remote node. Surely we can't just apply local
random_page_cost or whatever, because those may be entirely different.
And we don't know if the remote is going to use index etc.

So is extracting data from pg_statistic the right approach?



There are two different problems, which ultimately might converge.
1. If use_remote_estimates = false, more generally if querying costs from
foreign server for costing paths is impractical, we want to use local
estimates and try to come up with costs. For that purpose we keep some
statistics locally and user is expected to refresh it periodically by
running ANALYZE on the foreign table. This patch is about a. doing this
efficiently without requiring to fetch every row from the foreign server b.
through autovacuum automatically without user firing ANALYZE. I think this
also answers your question about vacuum_rel() above.

2. How to efficiently extract costs from an EXPLAIN plan when
use_remote_eestimates is true. That's the subject of some nearby thread. I
think you are referring to that problem here. Hence your next point.



I think that was the topic of *this* thread as started by Tom, but I now
realize Andrey steered it in the direction to allow re-using remote
stats. Which seems useful too, but it confused me a bit.


Using EXPLAIN to get costs from the foreign server isn't efficient. It
increases planning time a lot; sometimes planning time exceeds execution
time. If usage of foreign tables becomes more and more common, this isn't
ideal. I think we should move towards a model in which the optimizer can
decide whether a subtree involving a foreign server should be evaluated
locally or on the foreign server without the help of foreign server. One
way to do it (I am not saying that this is the only or the best way) is to
estimate the cost of foreign query locally based on the information
available locally about the foreign server and foreign table. This might
mean that we have to get that information from the foreign server and cache
it locally and use it several times, including the indexes on foreign
table, values of various costs etc. Though this approach doesn't solve all
of those problems it's one step forward + it makes the current scenario
also efficient.



True, but that ptoject is way more ambitious than providing a simple API
for postgres_fdw to obtain the estimates more efficiently.


I agree that the patch needs some work though, esp the code dealing with
serialization and deserialization of statistics.


I think there's a bunch of open questions, e.g. what to do with extended
statistics - for example what should happen when the extended statistics
object is defined only on local/remote server, or when the definitions
don't match? What should happen when the definitions don't match? This
probably is not an issue for "regular" stats, because that seems pretty
stable, but for extended stats there are differences between versions.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-09-08 Thread Heikki Linnakangas


On 08/09/2020 22:25, James Coleman wrote:

On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas  wrote:


You could also apply the optimization for non-Const expressions, as long
as they're stable (i.e. !contain_volatile_functions(expr)).


Is that true? Don't we also have to worry about whether or not the
value is stable (i.e., know when a param has changed)? There have been
discussions about being able to cache stable subexpressions, and my
understanding was that we'd need to have that infrastructure (along
with the necessarily invalidation when the param changes) to be able
to use this for non-Const expressions.


Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And 
Vars. I think the conditions are the same as those checked in 
match_clause_to_partition_key() in partprune.c (it's a long function, 
search for "if (!IsA(rightop, Const))"). Not sure it's worth the 
trouble, then. But it would be nice to handle queries like "WHERE column 
= ANY ($1)"



Deconstructing the array Datum into a simple C array on first call would
be a win even for very small arrays and for AND semantics, even if you
don't use a hash table.


Because you wouldn't have to repeatedly detoast it? Or some other
reason I'm not thinking of? My intuition would have been that (aside
from detoasting if necessary) there wouldn't be much real overhead in,
for example, an array storing integers.


Dealing with NULLs and different element sizes in the array is pretty 
complicated. Looping through the array currently looks like this:


/* Loop over the array elements */
s = (char *) ARR_DATA_PTR(arr);
bitmap = ARR_NULLBITMAP(arr);
bitmask = 1;

for (int i = 0; i < nitems; i++)
{
Datum   elt;
Datum   thisresult;

/* Get array element, checking for NULL */
if (bitmap && (*bitmap & bitmask) == 0)
{
fcinfo->args[1].value = (Datum) 0;
fcinfo->args[1].isnull = true;
}
else
{
elt = fetch_att(s, typbyval, typlen);
s = att_addlength_pointer(s, typlen, s);
s = (char *) att_align_nominal(s, typalign);
fcinfo->args[1].value = elt;
fcinfo->args[1].isnull = false;
}

[do stuff with Datum/isnull]

/* advance bitmap pointer if any */
if (bitmap)
{
bitmask <<= 1;
if (bitmask == 0x100)
{
bitmap++;
bitmask = 1;
}
}
}

Compared with just:

for (int i = 0; i < nitems; i++)
{
Datum   elt = datums[i];

[do stuff with the Datum]
}

I'm not sure how much difference that makes, but I presume it's not 
zero, and it seems like an easy win when you have the code to deal with 
the Datum array representation anyway.


- Heikki

Re: VACUUM (INTERRUPTIBLE)?

On 2020-Sep-08, Andres Freund wrote:

> That made me wonder if it'd be worthwhile to add an option that'd make
> user invoked VACUUM be interruptible by conflicting lock requests, just
> like autovacuum is.

Yeah, I recall a request for this in the past, too.

> So how about adding an INTERRUPTIBLE option to VACUUM and ANALYZE?

+1 on adding it to VACUUM.  I'm not sure about ANALYZE ... most of the
time it is not as bad as VACUUM in terms of blocking other things, and
things get ugly if that ANALYZE is part of a transaction.  I think I'd
leave that out, since we don't have to cover all DDL that could take
ShareUpdateExclusive lock (CIC etc).  VACUUM is especially problematic,
ISTM, which is we would do it for that.

> Alternatively, if we do want to restrict it to VACUUM and ANALYZE, we'd
> have to re-introduce PROC_IN_ANALYZE ;). After 12 years of not being
> used and removed just weeks ago...

Hah :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: VACUUM (INTERRUPTIBLE)?

2020-09-08 Thread Magnus Hagander

On Tue, Sep 8, 2020 at 8:49 PM Andres Freund  wrote:

> Hi,
>
> Jeff Janes in [1] found that I ([2]) broke autovacuum cancellations. It
> obviously would be good to add a test for this, but it seems hard to
> have that be reliable given how long it can take for autovacuum actually
> get around to vacuum a specific table.
>
> That made me wonder if it'd be worthwhile to add an option that'd make
> user invoked VACUUM be interruptible by conflicting lock requests, just
> like autovacuum is.
>
> That's actually something I've wanted, and see other people want, a
> couple times. Sometimes one wants to vacuum a table, but not block out
> more important things like DDL. Which isn't really possible right now.
>
> So how about adding an INTERRUPTIBLE option to VACUUM and ANALYZE?
>
> Implementation wise it seems like the best way to implement this would
> be to replace PROC_VACUUM_FOR_WRAPAROUND with
> PROC_INTERRUPTIBLE_VACANALYZE or such (i.e. inverting the meaning).
>
> One question I'm a bit split on is whether we'd want to continue
> restricting the signalling in ProcSleep() to commands running VACUUM or
> ANALYZE.
>
> We could have a generic PROC_INTERRUPTIBLE_COMMAND or such, and have
> deadlock.c / proc.c trigger whether to kill based on just that, without
> looking at PROC_IS_AUTOVACUUM / PROC_IN_VACUUM.
>
> Alternatively, if we do want to restrict it to VACUUM and ANALYZE, we'd
> have to re-introduce PROC_IN_ANALYZE ;). After 12 years of not being
> used and removed just weeks ago...
>

One thing I've been wanting many times but never properly got around to
investigating how much work it would be to make happen, was to be able to
trigger an autovacuum manually (yeah, strange choice of terms). That is,
instead of the above, you'd have something like "VACUUM BACKGROUND" which
would trigger an autovacuum worker to do the work, and then release your
session. The points then being both (1) the ability to interrupt it, and
(2) that it'd run in the backgorund and thus the foreground session could
disconnect.

I think both would probably solve your problem, and being able to trigger a
background one would add some extra value? But we'd have to figure out and
be clear about what to do if all workers are busy for example - queue or
error?

Worth considering, or am I missing something?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: default partition and concurrent attach partition

Hi,

Andres added to CC because of TTS interface: apparently calling
slot_getallattrs() on a virtual slot raises error that "getsomeattrs is
not required to be called on a virtual tuple table slot".  I'm thinking
that this exposes implementation details that should not be necessary
for a caller to know; patch 0001 fixes that at the problematic caller by
making the slot_getallatrs() call conditional on the slot not being
virtual, but I wonder if the better fix isn't to remove the elog(ERROR)
at tts_virtual_getsomeattrs.

Moving on from that -- this is a version of Amit's previous patch that I
like better.  I think the "prev_myslot" thing was a bit ugly, but it
turns out that with the above fix we can clear the slot before creating
the new one, making things more sensible.  I also changed the "do {}
while ()" into a simple "while {}" loop, which is sensible since the
condition is true on loop entrance.  Minor comment rewording also.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From ef139f89531ba15f480cbb64c2bddeee03dc20ab Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 8 Sep 2020 16:55:30 -0300
Subject: [PATCH v3 1/2] Don't "getsomeattrs" on virtual slots
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

execute_attr_map_slot() was not careful about not calling
slot_getallattrs() on tuple slots that could be virtual.  Repair.

Author: Álvaro Herrera 
---
 src/backend/access/common/tupconvert.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c
index 3cb0cbefaa..d440c1201a 100644
--- a/src/backend/access/common/tupconvert.c
+++ b/src/backend/access/common/tupconvert.c
@@ -193,7 +193,8 @@ execute_attr_map_slot(AttrMap *attrMap,
 	outnatts = out_slot->tts_tupleDescriptor->natts;
 
 	/* Extract all the values of the in slot. */
-	slot_getallattrs(in_slot);
+	if (!TTS_IS_VIRTUAL(in_slot))
+		slot_getallattrs(in_slot);
 
 	/* Before doing the mapping, clear any old contents from the out slot */
 	ExecClearTuple(out_slot);
-- 
2.20.1

>From e5a2b5ddbbb55dd9a83474f3f55b8f8724a3a63e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 3 Sep 2020 13:18:35 -0400
Subject: [PATCH v3 2/2] Check default partitions constraints while descending
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Partitioning tuple route code assumes that the partition chosen while
descending the partition hierarchy is always the correct one.  This is
true except when the partition is the default partition and another
partition has been added concurrently: the partition constraint changes
and we don't recheck it.  This can lead to tuples mistakenly being added
to the default partition that should have been rejected.

Fix by rechecking the default partition constraint while descending the
hierarchy.

An isolation test based on the reproduction steps described by Hao Wu
(with tweaks for extra coverage) is included.

Reported by: Hao Wu 
Author: Amit Langote 
Author: Álvaro Herrera 
Discussion: https://postgr.es/m/ca+hiwqfqbmcssap4sfncbuel_vfommekaq3gwuhyfa4c7j_...@mail.gmail.com
Discussion: https://postgr.es/m/dm5pr0501mb3910e97a9edfb4c775cf3d75a4...@dm5pr0501mb3910.namprd05.prod.outlook.com
---
 src/backend/executor/execPartition.c  | 127 ++
 .../expected/partition-concurrent-attach.out  |  49 +++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/partition-concurrent-attach.spec|  43 ++
 4 files changed, 195 insertions(+), 25 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-concurrent-attach.out
 create mode 100644 src/test/isolation/specs/partition-concurrent-attach.spec

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 79fcbd6b06..bc2372959c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -51,6 +51,11 @@
  *		PartitionDispatchData->indexes for details on how this array is
  *		indexed.
  *
+ * nonleaf_partitions
+ *		Array of 'max_dispatch' elements containing pointers to fake
+ *		ResultRelInfo objects for nonleaf partitions, useful for checking
+ *		the partition constraint.
+ *
  * num_dispatch
  *		The current number of items stored in the 'partition_dispatch_info'
  *		array.  Also serves as the index of the next free array element for
@@ -89,6 +94,7 @@ struct PartitionTupleRouting
 {
 	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
+	ResultRelInfo **nonleaf_partitions;
 	int			num_dispatch;
 	int			max_dispatch;
 	ResultRelInfo **partitions;
@@ -280,9 +286,11 @@ ExecFindPartition(ModifyTableState *mtstate,
 	PartitionDispatch dispatch;
 	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
-	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scant

Re: Collation versioning

2020-09-08 Thread Julien Rouhaud

On Tue, Sep 08, 2020 at 09:33:26PM +0200, Christoph Berg wrote:
> Re: Julien Rouhaud
> > Here's the rationale for this new flag, extracted from the patch's commit
> > message:
> > 
> > Add a new amnostablecollorder flag in IndexAmRoutine.
> > 
> > This flag indicates if the access method does not rely on a stable collation
> > ordering for deterministic collation, i.e. would not be corrupted if the
> > underlying collation library changes its ordering.  This is done this way to
> > make sure that if any external access method isn't updated to correctly 
> > setup
> > this flag it will be assumed to rely on a stable collation ordering as this
> > should be the case for the majority of access methods.
> > 
> > This flag will be useful for an upcoming commit that will add detection of
> > possibly corrupted index due to changed collation library version.
> 
> Hmm. Does that flag gain us much? What about non-collation locale
> changes that might still affect indexes like lower() and the citext
> extension? That still depends on locale changes, but that flag
> wouldn't be able to help with "this index is (not) affected by this
> locale change".
> 
> IOW, I think we should aim at simply tracking the version, and leave
> it to the admin (with the help of supplied SQL queries) to either
> rebuild indexes or waive them.
> 
> Or maybe I misunderstood the problem.
> 

I see your point, and indeed this isn't really clear how the flag will be used
given this description.

I guess that my idea is that how exactly an index depends on a stable
collation ordering isn't part of this flag definition, as it should be the same
for any access method.

In the commit that uses the infrastructure, the lack of stable ordering
requirement is only used for index key column, but not for any
expression, whether on index key or qual, because as you mention there's no
guarantee that the expression itself depends on a stable ordering or not.

There could be some improvements done for some simple case (like maybe md5() is
used often enough in index keys that it could be worth to detect), but nothing
in done to attempt that for now.

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Bharath Rupireddy  writes:
> Attaching a patch with both the comments modification and changing
> DEBUG3 to ERROR. make check and make world-check passes on this patch.

I pushed this after simplifying the ereport down to an elog.  I see
no reason to consider this a user-facing error, so there's no need
to make translators deal with the message.

regards, tom lane

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-09-08 Thread Justin Pryzby

On Sat, Jul 18, 2020 at 03:15:32PM -0500, Justin Pryzby wrote:
> Still waiting for feedback from a committer.

This patch has been waiting for input from a committer on the approach I've
taken with the patches since March 10, so I'm planning to set to "Ready" - at
least ready for senior review.

-- 
Justin

Re: Collation versioning

2020-09-08 Thread Christoph Berg

Re: Julien Rouhaud
> Here's the rationale for this new flag, extracted from the patch's commit
> message:
> 
> Add a new amnostablecollorder flag in IndexAmRoutine.
> 
> This flag indicates if the access method does not rely on a stable collation
> ordering for deterministic collation, i.e. would not be corrupted if the
> underlying collation library changes its ordering.  This is done this way to
> make sure that if any external access method isn't updated to correctly setup
> this flag it will be assumed to rely on a stable collation ordering as this
> should be the case for the majority of access methods.
> 
> This flag will be useful for an upcoming commit that will add detection of
> possibly corrupted index due to changed collation library version.

Hmm. Does that flag gain us much? What about non-collation locale
changes that might still affect indexes like lower() and the citext
extension? That still depends on locale changes, but that flag
wouldn't be able to help with "this index is (not) affected by this
locale change".

IOW, I think we should aim at simply tracking the version, and leave
it to the admin (with the help of supplied SQL queries) to either
rebuild indexes or waive them.

Or maybe I misunderstood the problem.

Christoph

Re: Boundary value check in lazy_tid_reaped()

2020-09-08 Thread Peter Geoghegan

On Tue, Sep 8, 2020 at 1:23 AM Masahiko Sawada
 wrote:
> > > I wonder if you would also see a speed-up with a bsearch() replacement
> > > that is inlineable, so it can inline the comparator (instead of
> > > calling it through a function pointer).  I wonder if something more
> > > like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than
> > > the branchy comparator.
> >
> > Erm, off course that expression won't work... should be << 16, but
> > even then it would only work with a bsearch that uses int64_t
> > comparators, so I take that part back.
>
> Yeah, it seems to worth benchmarking the speed-up with an inlining.
> I'll do some performance tests with/without inlining on top of
> checking boundary values.

It sounds like Thomas was talking about something like
itemptr_encode() + itemptr_decode(). In case you didn't know, we
actually do something like this for the TID tuplesort used for CREATE
INDEX CONCURRENTLY.


-- 
Peter Geoghegan

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-08 Thread David Zhang




I found the function "table_relation_size" is only used by buffer 
manager for "RELKIND_RELATION", "RELKIND_TOASTVALUE" and 
"RELKIND_MATVIEW", i.e.


        case RELKIND_RELATION:
        case RELKIND_TOASTVALUE:
        case RELKIND_MATVIEW:
            {
                /*
                 * Not every table AM uses BLCKSZ wide fixed size blocks.
                 * Therefore tableam returns the size in bytes - but 
for the

                 * purpose of this routine, we want the number of blocks.
                 * Therefore divide, rounding up.
                 */
                uint64        szbytes;

                szbytes = table_relation_size(relation, forkNum);

                return (szbytes + (BLCKSZ - 1)) / BLCKSZ;
            }

So using "calculate_relation_size" and "calculate_toast_table_size" in 
"calculate_table_size" is easy to understand and the original logic is 
simple.



On 2020-08-27 6:38 a.m., gkokola...@pm.me wrote:

Hi all,

this minor patch is attempting to force the use of the tableam api in dbsize 
where ever it is required.

Apparently something similar was introduced for toast relations only. 
Intuitively it seems that the distinction between a table and a toast table is 
not needed. This patch treats tables, toast tables and materialized views 
equally.

Regards,
//Georgios

Best regards,
--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas  wrote:
>
> On 01/05/2020 05:20, James Coleman wrote:
> > On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
> >  wrote:
> > ...
> >> Any particular reasons to pick dynahash over simplehash? ISTM we're
> >> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> >> while there are not many places using dynahash for simple short-lived
> >> hash tables. Of course, that alone is a weak reason to insist on using
> >> simplehash here, but I suppose there were reasons for not using dynahash
> >> and we'll end up facing the same issues here.
> >
> > I've attached a patch series that includes switching to simplehash.
> > Obviously we'd really just collapse all of these patches, but it's
> > perhaps interesting to see the changes required to use each style
> > (binary search, dynahash, simplehash).
> >
> > As before, there are clearly comments and naming things to be
> > addressed, but the implementation should be reasonably clean.
>
> Looks good, aside from the cleanup work that you mentioned. There are a
> few more cases that I think you could easily handle with very little
> extra code:
>
> You could also apply the optimization for non-Const expressions, as long
> as they're stable (i.e. !contain_volatile_functions(expr)).

Is that true? Don't we also have to worry about whether or not the
value is stable (i.e., know when a param has changed)? There have been
discussions about being able to cache stable subexpressions, and my
understanding was that we'd need to have that infrastructure (along
with the necessarily invalidation when the param changes) to be able
to use this for non-Const expressions.

> Deconstructing the array Datum into a simple C array on first call would
> be a win even for very small arrays and for AND semantics, even if you
> don't use a hash table.

Because you wouldn't have to repeatedly detoast it? Or some other
reason I'm not thinking of? My intuition would have been that (aside
from detoasting if necessary) there wouldn't be much real overhead in,
for example, an array storing integers.

James

Re: More aggressive vacuuming of temporary tables

Andres Freund  writes:
> But now I do wonder why we need to know whether the command is top level
> or not? Why isn't the correct thing to instead look at what the current
> backend's xmin is? Seems like you could just replace
> *oldestXmin = XidFromFullTransactionId(ReadNextFullTransactionId());
> with
> *oldestXmin = MyProc->xmin;
> Assert(TransactionIdIsValid(*oldestXmin));

Ummm ... since VACUUM doesn't run inside a transaction, it won't be
advertising an xmin will it?

Maybe you could make something like this work, but I think it'd still
have to treat CLUSTER as a special case.  Not sure it's worth it.

regards, tom lane

Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-06-07 11:24:50 +0300, Michail Nikolaev wrote:
> Hello, hackers.
> Andres, nice work!
> 
> Sorry for the off-top.
> 
> Some of my work [1] related to the support of index hint bits on
> standby is highly interfering with this patch.
> Is it safe to consider it committed and start rebasing on top of the patches?

Sorry, I missed this email. Since they're now committed, yes, it is safe
;)

Greetings,

Andres Freund

Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

Hi,

On 2020-07-29 22:26:04 -0500, David Pirotte wrote:
> FWIW, we have been using pg_logical_emit_message to send application-level
> events alongside our change-data-capture for about two years, and we would
> move this part of our stack to pgoutput if message support was available.

Yea, it's really useful for this kind of thing.

> @@ -119,14 +124,16 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
>  
>  static void
>  parse_output_parameters(List *options, uint32 *protocol_version,
> - List **publication_names, bool 
> *binary)
> + List **publication_names, bool 
> *binary, bool *messages)

I think it might be time to add a PgOutputParameters struct, instead of
adding more and more output parameters to
parse_output_parameters. Alternatively just passing PGOutputData owuld
make sense.

> diff --git a/src/test/subscription/t/015_messages.pl 
> b/src/test/subscription/t/015_messages.pl
> new file mode 100644
> index 00..4709e69f4e
> --- /dev/null
> +++ b/src/test/subscription/t/015_messages.pl

A test verifying that a non-transactional message in later aborted
transaction is handled correctly would be good.

Greetings,

Andres Freund

Re: INSERT ON CONFLICT and RETURNING

2020-09-08 Thread Pavel Stehule

út 8. 9. 2020 v 12:34 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 08.09.2020 12:34, Pavel Stehule wrote:
>
>
>
> út 8. 9. 2020 v 11:06 odesílatel Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> napsal:
>
>> I have performed comparison of different ways of implementing UPSERT in
>> Postgres.
>> May be it will be interesting not only for me, so I share my results:
>>
>> So first of all initialization step:
>>
>>   create table jsonb_schemas(id serial, schema bytea primary key);
>>   create unique index on jsonb_schemas(id);
>>   insert into jsonb_schemas (schema) values ('some') on conflict(schema)
>> do nothing returning id;
>>
>> Then I test performance of getting ID of exitsed schema:
>>
>> 1. Use plpgsql script to avoid unneeded database modifications:
>>
>> create function upsert(obj_schema bytea) returns integer as $$
>> declare
>>   obj_id integer;
>> begin
>>   select id from jsonb_schemas where schema=obj_schema into obj_id;
>>   if obj_id is null then
>> insert into jsonb_schemas (schema) values (obj_schema) on
>> conflict(schema) do nothing returning id into obj_id;
>> if obj_id is null then
>>   select id from jsonb_schemas where schema=obj_schema into obj_id;
>> end if;
>>   end if;
>>   return obj_id;
>> end;
>> $$ language plpgsql;
>>
>
> In parallel execution the plpgsql variant can fail. The possible raise
> conditions are not handled.
>
> So maybe this is the reason why this is really fast.
>
>
> With this example I model real use case, where we need to map long key
> (json schema in this case) to short  identifier (serial column in this
> case).
> Rows of jsonb_schemas are never updated: it is append-only dictionary.
> In this assumption no race condition can happen with this PLpgSQL
> implementation (and other implementations of UPSERT as well).
>

I am not sure, but I think this should  be a design and behavior of MERGE
statement - it is designed for OLAP (and speed). Unfortunately, this
feature stalled (and your benchmarks show so there is clean performance
benefit).

Regards

Pavel

>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: Missing "Up" navigation link between parts and doc root?

2020-09-08 Thread Bruce Momjian

On Sun, Sep  6, 2020 at 04:59:11PM +0200, Peter Eisentraut wrote:
> On 2020-08-25 21:48, Bruce Momjian wrote:
> > On Sat, Jul  4, 2020 at 08:47:53AM +0200, Fabien COELHO wrote:
> > > 
> > > Hello Peter,
> > > 
> > > > The original stylesheets explicitly go out of their way to do it that
> > > > way. We can easily fix that by removing that special case.  See attached
> > > > patch.
> > > > 
> > > > That patch only fixes it for the header.  To fix it for the footer as
> > > > well, we'd first need to import the navfooter template to be able to
> > > > customize it.
> > > 
> > > Thanks for the patch, which applies cleanly, doc compiles, works for me 
> > > with
> > > w3m.
> > > 
> > > > Not a big problem though.
> > > 
> > > Nope, just mildly irritating for quite a long time:-) So I'd go for back
> > > patching if it applies cleanly.
> > 
> > Can we get Peter's patch for this applied soon?  Thanks.  Should I apply
> > it?
> 
> I have made the analogous changes to the footer as well and committed this.

I see this only applied to master.  Shouldn't this be backpatched?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Yet another fast GiST build

2020-09-08 Thread Heikki Linnakangas

On 08/09/2020 21:33, Pavel Borisov wrote:

 > Thanks! Did you measure the quality of the built index somehow? The
 > ordering shouldn't make any difference to the build speed, but it
 > affects the shape of the resulting index and the speed of queries
 > against it.

Again I've tried random select tests near axes and haven't noticed any 
performance difference between ordinary gist build and z-ordered one. 
The same is for selects far from axes. Theoretically, there may be a 
possible slowdown for particular points inside the MBR which crosses the 
axis but I haven't tried to dig so deep and haven't tested performance 
as a function of coordinate.

So I feel this patch is not about select speed optimization.

Ok, thank for confirming.

I've been reviewing the patch today. The biggest changes I've made have 
been in restructuring the code in gistbuild.c for readability, but there 
are a bunch of smaller changes throughout. Attached is what I've got so 
far, squashed into one patch. I'm continuing to review it, but a couple 
of questions so far:

In the gistBuildCallback(), you're skipping the tuple if 'tupleIsAlive 
== false'. That seems fishy, surely we need to index recently-dead 
tuples, too. The normal index build path isn't skipping them either.

How does the 'sortsupport' routine interact with 
'compress'/'decompress'? Which representation is passed to the 
comparator routine: the original value from the table, the compressed 
representation, or the decompressed representation? Do the 
comparetup_index_btree() and readtup_index() routines agree with that?

- Heikki
>From 7a9331bbd43799150d6a0b9dad2e98604c6b7dfc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 8 Sep 2020 21:56:41 +0300
Subject: [PATCH v16 1/1] Add sort support for point gist_point_sortsupport

Implement GiST build using sort support

We use special sorting function provided by opclass to approximate
GiST tree with B-tree-like structure. This approach allows to
radically reduce build time in some cases.

Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
Reviewed-by: Pavel Borisov, Thomas Munro
---
 doc/src/sgml/gist.sgml  |  64 +++
 src/backend/access/gist/gistbuild.c | 511 
 src/backend/access/gist/gistproc.c  | 154 +++
 src/backend/access/gist/gistutil.c  |  59 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/tuplesort.c  |  37 ++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/tuplesort.h   |   6 +
 13 files changed, 801 insertions(+), 106 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..bc45b3260f2 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to speed up
+   building a GiST index.

@@ -1065,6 +1067,68 @@ my_compress(PG_FUNCTION_ARGS)

+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator routine preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+The SQL declaration of the function must look like this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+The argument is a pointer to a SortSupport struct.
+At a minimum, the function must fill in its comparator field, the full API
+is defined in src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum
+my_sortsupport(PG_FUNCTION_ARGS)
+{
+  SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+  ssup->comparator = my_fastcmp

Re: More aggressive vacuuming of temporary tables

Hi,

On 2020-08-28 11:46:49 -0400, Tom Lane wrote:
> It strikes me that when we are vacuuming a temporary table (which
> necessarily will be one of our own session), we don't really need
> to care what the global xmin horizon is.  If we're not inside a
> user transaction block, then there are no tuples in the table that
> could be in-doubt anymore.  Neither are there any snapshots in our
> session that could see any dead tuples.  Nor do we give a fig what
> other sessions might think of those tuples.  So we could just set
> the xmin cutoff as aggressively as possible, which is to say
> equal to the nextXid counter.  While vacuuming a temp table is
> perhaps not something people do very often, I think when they do
> do it they would like us to clean out all the dead tuples not just
> some.

That seems like a good idea.

I've been toying with a patch that introduces more smarts about when a
row is removable, by looking more closely whether a specific row
versions are visible (e.g. in the common case of one old snapshot and
lots of newer rows). But that's orders of magnitude more complicated. So
going for something as simple as this seems like a good idea.

> Hence I propose 0001 attached.  80% of it is just API additions to allow
> passing down the isTopLevel flag so that we can do the right thing in
> the CLUSTER case; the important change is in vacuum_set_xid_limits.
> (For ease of review, I didn't reindent the existing code in
> vacuum_set_xid_limits, but that would be something to do before commit.)

I did wonder for a moment if it could be worthwhile to just implement
this for VACUUM and leave CLUSTER alone, to avoid having to deal with
the toplevel stuff. But I think it's better to be consistent.

But now I do wonder why we need to know whether the command is top level
or not? Why isn't the correct thing to instead look at what the current
backend's xmin is? Seems like you could just replace
*oldestXmin = XidFromFullTransactionId(ReadNextFullTransactionId());
with
*oldestXmin = MyProc->xmin;
Assert(TransactionIdIsValid(*oldestXmin));

and you'd not need to care about whether the CLUSTER is top-level or
not? Without, I think, loosing any aggressiveness in the top-level case?
Perhaps even worth moving this logic into
GetOldestNonRemovableTransactionId().

I know you already pushed this, but I vote for revising it this way if
you don't see an issue?

Greetings,

Andres Freund

VACUUM (INTERRUPTIBLE)?

Hi,

Jeff Janes in [1] found that I ([2]) broke autovacuum cancellations. It
obviously would be good to add a test for this, but it seems hard to
have that be reliable given how long it can take for autovacuum actually
get around to vacuum a specific table.

That made me wonder if it'd be worthwhile to add an option that'd make
user invoked VACUUM be interruptible by conflicting lock requests, just
like autovacuum is.

That's actually something I've wanted, and see other people want, a
couple times. Sometimes one wants to vacuum a table, but not block out
more important things like DDL. Which isn't really possible right now.

So how about adding an INTERRUPTIBLE option to VACUUM and ANALYZE?

Implementation wise it seems like the best way to implement this would
be to replace PROC_VACUUM_FOR_WRAPAROUND with
PROC_INTERRUPTIBLE_VACANALYZE or such (i.e. inverting the meaning).

One question I'm a bit split on is whether we'd want to continue
restricting the signalling in ProcSleep() to commands running VACUUM or
ANALYZE.

We could have a generic PROC_INTERRUPTIBLE_COMMAND or such, and have
deadlock.c / proc.c trigger whether to kill based on just that, without
looking at PROC_IS_AUTOVACUUM / PROC_IN_VACUUM.

Alternatively, if we do want to restrict it to VACUUM and ANALYZE, we'd
have to re-introduce PROC_IN_ANALYZE ;). After 12 years of not being
used and removed just weeks ago...

Greetings,

Andres Freund

[1] https://postgr.es/m/20200827213506.4baaanygq62q7pcj%40alap3.anarazel.de
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d72f6c750

Re: Autovac cancellation is broken in v14

Hi,

On 2020-08-27 14:35:06 -0700, Andres Freund wrote:
> On 2020-08-27 16:20:30 -0400, Jeff Janes wrote:
> > On Thu, Aug 27, 2020 at 3:10 PM Jeff Janes  wrote:
> > 
> > > If I create a large table with "CREATE TABLE ... AS SELECT ... from
> > > generate_series(1,3e7)" with no explicit transactions, then once it is 
> > > done
> > > I wait for autovac to kick in, then when I try to build an index on that
> > > table (or drop the table) the autovac doesn't go away on its own.
> > >
> > 
> > After a bit more poking at this, I think we are checking if we ourselves
> > are an autovac process, not doing the intended check of whether the other
> > guy is one.
> 
> Ugh, good catch.

Pushed the fix.


> > Where would be a good spot to add a regression test for this?
> > "isolation_regression" ?
> 
> I'm not immediately sure how we could write a good test for this,
> particularly not in the isolation tests. We'd basically have to make
> sure that a table needs autovacuuming, then sleep for long enough for
> autovacuum to have come around, and block autovacuum from making
> progress. That latter is doable by holding a pin on a page it needs to
> freeze, e.g. using a cursor.  I suspect all of that would at least
> require a TAP test, and might still be too fragile.

Perhaps the easiest way for this would be to have an option to have
manual VACUUMs be interruptible by other backends. That seems like a
useful option anyway? I'll start a new thread.

Greetings,

Andres Freund

Re: Yet another fast GiST build

2020-09-08 Thread Pavel Borisov

>
> > Thanks! Did you measure the quality of the built index somehow? The
> > ordering shouldn't make any difference to the build speed, but it
> > affects the shape of the resulting index and the speed of queries
> > against it.


Again I've tried random select tests near axes and haven't noticed any
performance difference between ordinary gist build and z-ordered one. The
same is for selects far from axes. Theoretically, there may be a possible
slowdown for particular points inside the MBR which crosses the axis but I
haven't tried to dig so deep and haven't tested performance as a function
of coordinate.

So I feel this patch is not about select speed optimization.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

On 2020-08-28 15:53:29 -0400, Robert Haas wrote:
> On Wed, Aug 19, 2020 at 6:28 AM Amul Sul  wrote:
> > Attached is a rebased on top of the latest master head (# 3e98c0bafb2).
>
> Does anyone, especially anyone named Andres Freund, have comments on
> 0001? That work is somewhat independent of the rest of this patch set
> from a theoretical point of view, and it seems like if nobody sees a
> problem with the line of attack there, it would make sense to go ahead
> and commit that part.

It'd be easier to review the proposed commit if it included reasoning
about the change...

In particular, it looks to me like the commit actually implements two
different changes:
1) Allow a barrier function to "reject" a set barrier, because it can't
   be set in that moment
2) Allow barrier functions to raise errors

and there's not much of an explanation as to why (probably somewhere
upthread, but ...)



 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 
0);

/*
-* Process each type of barrier. It's important that nothing we call 
from
-* here throws an error, because pss_barrierCheckMask has already been
-* cleared. If we jumped out of here before processing all barrier 
types,
-* then we'd forget about the need to do so later.
-*
-* NB: It ought to be OK to call the barrier-processing functions
-* unconditionally, but it's more efficient to call only the ones that
-* might need us to do something based on the flags.
+* If there are no flags set, then we can skip doing any real work.
+* Otherwise, establish a PG_TRY block, so that we don't lose track of
+* which types of barrier processing are needed if an ERROR occurs.
 */
-   if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-   ProcessBarrierPlaceholder();
+   if (flags != 0)
+   {
+   PG_TRY();
+   {
+   /*
+* Process each type of barrier. The barrier-processing 
functions
+* should normally return true, but may return false if 
the barrier
+* can't be absorbed at the current time. This should 
be rare,
+* because it's pretty expensive.  Every single
+* CHECK_FOR_INTERRUPTS() will return here until we 
manage to
+* absorb the barrier, and that cost will add up in a 
hurry.
+*
+* NB: It ought to be OK to call the barrier-processing 
functions
+* unconditionally, but it's more efficient to call 
only the ones
+* that might need us to do something based on the 
flags.
+*/
+   if (BARRIER_SHOULD_CHECK(flags, 
PROCSIGNAL_BARRIER_PLACEHOLDER)
+   && ProcessBarrierPlaceholder())
+   BARRIER_CLEAR_BIT(flags, 
PROCSIGNAL_BARRIER_PLACEHOLDER);

This pattern seems like it'll get unwieldy with more than one barrier
type. And won't flag "unhandled" barrier types either (already the case,
I know). We could go for something like:

while (flags != 0)
{
barrier_bit = pg_rightmost_one_pos32(flags);
barrier_type = 1 >> barrier_bit;

switch (barrier_type)
{
case PROCSIGNAL_BARRIER_PLACEHOLDER:
processed = ProcessBarrierPlaceholder();
}

if (processed)
BARRIER_CLEAR_BIT(flags, barrier_type);
}

But perhaps that's too complicated?

+   }
+   PG_CATCH();
+   {
+   /*
+* If an ERROR occurred, add any flags that weren't yet 
handled
+* back into pss_barrierCheckMask, and reset the global 
variables
+* so that we try again the next time we check for 
interrupts.
+*/
+   
pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+  flags);

For this to be correct, wouldn't flags need to be volatile? Otherwise
this might use a register value for flags, which might not contain the
correct value at this point.

Perhaps a comment explaining why we have to clear bits first would be
good?

+   ProcSignalBarrierPending = true;
+   InterruptPending = true;
+
+   PG_RE_THROW();
+   }
+   PG_END_TRY();


+   /*
+* If some barrier was not successfully absorbed, we will have 
to try
+* again later.
+*/
+   if (flags != 0)
+   {
+   
pg_atomic_fetc

Re: Rare deadlock failure in create_am test

On 2020-09-04 12:41:23 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2020-Sep-03, Tom Lane wrote:
> >> I'm inclined to think that the best fix is to put
> >> 
> >> begin;
> >> lock table [fast_emp4000];
> >> ...
> >> commit;
> >> 
> >> around the DROP CASCADE.
> 
> > Yeah, sounds good.
> 
> Done.

Thanks!

Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-08 16:44:17 +1200, Thomas Munro wrote:
> On Tue, Sep 8, 2020 at 4:11 PM Andres Freund  wrote:
> > At first I was very confused as to why none of the existing tests have
> > found this significant issue. But after thinking about it for a minute
> > that's because they all use psql, and largely separate psql invocations
> > for each query :(. Which means that there's no cached snapshot around...
> 
> I prototyped a TAP test patch that could maybe do the sort of thing
> you need, in patch 0006 over at [1].  Later versions of that patch set
> dropped it, because I figured out how to use the isolation tester
> instead, but I guess you can't do that for a standby test (at least
> not until someone teaches the isolation tester to support multi-node
> schedules, something that would be extremely useful...).

Unfortunately proper multi-node isolationtester test basically is
equivalent to building a global lock graph. I think, at least? Including
a need to be able to correlate connections with their locks between the
nodes.

But for something like the bug at hand it'd probably sufficient to just
"hack" something with dblink. In session 1) insert a row on the primary
using dblink, return the LSN, wait for the LSN to have replicated and
finally in session 2) check for row visibility.

Greetings,

Andres Freund

Re: Optimising compactify_tuples()

2020-09-08 Thread Thomas Munro

On Wed, Sep 9, 2020 at 3:47 AM David Rowley  wrote:
> On Tue, 8 Sep 2020 at 12:08, Thomas Munro  wrote:
> > One thought is that if we're going to copy everything out and back in
> > again, we might want to consider doing it in a
> > memory-prefetcher-friendly order.  Would it be a good idea to
> > rearrange the tuples to match line pointer order, so that the copying
> > work and also later sequential scans are in a forward direction?
>
> That's an interesting idea but wouldn't that require both the copy to
> the separate buffer *and* a qsort? That's the worst of both
> implementations. We'd need some other data structure too in order to
> get the index of the sorted array by reverse lineitem point, which
> might require an additional array and an additional sort.

Well I may not have had enough coffee yet but I thought you'd just
have to spin though the item IDs twice.  Once to compute sum(lp_len)
so you can compute the new pd_upper, and the second time to copy the
tuples from their random locations on the temporary page to new
sequential locations, so that afterwards item ID order matches offset
order.

Re: PROC_IN_ANALYZE stillborn 13 years ago

On Sat, Aug 29, 2020 at 8:06 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > I pushed despite the objection because it seemed that downstream
> > discussion was largely favorable to the change, and there's a different
> > proposal to solve the bloat problem for analyze; and also:
>
> Note that this quasi-related patch has pretty thoroughly hijacked
> the CF entry for James' original docs patch proposal.  The cfbot
> thinks that that's the latest patch in the original thread, and
> unsurprisingly is failing to apply it.
>
> Since the discussion was all over the place, I'm not sure whether
> there's still a live docs patch proposal or not; but if so, somebody
> should repost that patch (and go back to the original thread title).

I replied to the original email thread with reposted patches.

James

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-08 Thread Jeff Davis

On Sat, 2020-09-05 at 12:03 -0700, Peter Geoghegan wrote:
> We should totally disable the preallocation stuff for external sorts
> in any case. External sorts are naturally characterized by relatively
> large, distinct batching of reads and writes -- preallocation cannot
> help.

Patch attached to disable preallocation for Sort.

I'm still looking into the other concerns.

Regards,
Jeff Davis

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 9776263ae75..f74d4841f17 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2882,7 +2882,7 @@ hashagg_tapeinfo_init(AggState *aggstate)
 	HashTapeInfo *tapeinfo = palloc(sizeof(HashTapeInfo));
 	int			init_tapes = 16;	/* expanded dynamically */
 
-	tapeinfo->tapeset = LogicalTapeSetCreate(init_tapes, NULL, NULL, -1);
+	tapeinfo->tapeset = LogicalTapeSetCreate(init_tapes, true, NULL, NULL, -1);
 	tapeinfo->ntapes = init_tapes;
 	tapeinfo->nfreetapes = init_tapes;
 	tapeinfo->freetapes_alloc = init_tapes;
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index bbb01f6d337..8ec224b62de 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -212,6 +212,7 @@ struct LogicalTapeSet
 	long	   *freeBlocks;		/* resizable array holding minheap */
 	long		nFreeBlocks;	/* # of currently free blocks */
 	Size		freeBlocksLen;	/* current allocated length of freeBlocks[] */
+	bool		enable_prealloc;	/* preallocate write blocks? */
 
 	/* The array of logical tapes. */
 	int			nTapes;			/* # of logical tapes in set */
@@ -220,6 +221,7 @@ struct LogicalTapeSet
 
 static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
 static void ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
+static long ltsGetBlock(LogicalTapeSet *lts, LogicalTape *lt);
 static long ltsGetFreeBlock(LogicalTapeSet *lts);
 static long ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt);
 static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
@@ -373,8 +375,20 @@ parent_offset(unsigned long i)
 }
 
 /*
- * Select the lowest currently unused block by taking the first element from
- * the freelist min heap.
+ * Get the next block for writing.
+ */
+static long
+ltsGetBlock(LogicalTapeSet *lts, LogicalTape *lt)
+{
+	if (lts->enable_prealloc)
+		return ltsGetPreallocBlock(lts, lt);
+	else
+		return ltsGetFreeBlock(lts);
+}
+
+/*
+ * Select the lowest currently unused block from the tape set's global free
+ * list min heap.
  */
 static long
 ltsGetFreeBlock(LogicalTapeSet *lts)
@@ -430,7 +444,8 @@ ltsGetFreeBlock(LogicalTapeSet *lts)
 
 /*
  * Return the lowest free block number from the tape's preallocation list.
- * Refill the preallocation list if necessary.
+ * Refill the preallocation list with blocks from the tape set's free list if
+ * necessary.
  */
 static long
 ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt)
@@ -671,8 +686,8 @@ ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt)
  * infrastructure that may be lifted in the future.
  */
 LogicalTapeSet *
-LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
-	 int worker)
+LogicalTapeSetCreate(int ntapes, bool preallocate, TapeShare *shared,
+	 SharedFileSet *fileset, int worker)
 {
 	LogicalTapeSet *lts;
 	int			i;
@@ -689,6 +704,7 @@ LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
 	lts->freeBlocksLen = 32;	/* reasonable initial guess */
 	lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
 	lts->nFreeBlocks = 0;
+	lts->enable_prealloc = preallocate;
 	lts->nTapes = ntapes;
 	lts->tapes = (LogicalTape *) palloc(ntapes * sizeof(LogicalTape));
 
@@ -782,7 +798,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		Assert(lt->firstBlockNumber == -1);
 		Assert(lt->pos == 0);
 
-		lt->curBlockNumber = ltsGetPreallocBlock(lts, lt);
+		lt->curBlockNumber = ltsGetBlock(lts, lt);
 		lt->firstBlockNumber = lt->curBlockNumber;
 
 		TapeBlockGetTrailer(lt->buffer)->prev = -1L;
@@ -806,7 +822,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 			 * First allocate the next block, so that we can store it in the
 			 * 'next' pointer of this block.
 			 */
-			nextBlockNumber = ltsGetPreallocBlock(lts, lt);
+			nextBlockNumber = ltsGetBlock(lts, lt);
 
 			/* set the next-pointer and dump the current block. */
 			TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3c49476483b..cbda911f465 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2591,7 +2591,7 @@ inittapes(Tuplesortstate *state, bool mergeruns)
 	/* Create the tape set and allocate the per-tape data arrays */
 	inittapestate(state, maxTapes);
 	state->tapeset =
-		LogicalTapeSetCreate(maxTapes, NULL,
+		LogicalTapeSetCreate(maxTapes, false, NULL,
 			 state->shared ? &state->shared->files

Re: [DOC] Document concurrent index builds waiting on each other

On Fri, Jul 31, 2020 at 2:51 PM James Coleman  wrote:
>
> On Thu, Jul 16, 2020 at 7:34 PM David Johnston
>  wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  not tested
> > Implements feature:   not tested
> > Spec compliant:   not tested
> > Documentation:tested, passed
> >
> > James,
> >
> > I'm on board with the point of pointing out explicitly the "concurrent 
> > index builds on multiple tables at the same time will not return on any one 
> > table until all have completed", with back-patching.  I do not believe the 
> > new paragraph is necessary though.  I'd suggest trying to weave it into the 
> > existing paragraph ending "Even then, however, the index may not be 
> > immediately usable for queries: in the worst case, it cannot be used as 
> > long as transactions exist that predate the start of the index build."  
> > Adding "Notably, " in front of the existing sentence fragment above and 
> > tacking it onto the end probably suffices.
>
> I'm not sure "the index may not be immediately usable for queries" is
> really accurate/sufficient: it seems to imply the CREATE INDEX has
> returned but for some reason the index isn't yet valid. The issue I'm
> trying to describe here is that the CREATE INDEX query itself will not
> return until all preceding queries have completed *including*
> concurrent index creations on unrelated tables.
>
> > I don't actually don't whether this is true behavior though.  Is it 
> > something our tests do, or could, demonstrate?
>
> It'd take tests that exercise parallelism, but it's pretty simple to
> demonstrate (but you do have to catch the first index build in a scan
> phase, so you either need lots of data or a hack). Here's an example
> that uses a bit of a hack to simulate a slow scan phase:
>
> Setup:
> create table items(i int);
> create table others(i int);
> create function slow_expr() returns text as $$ select pg_sleep(15);
> select '5'; $$ language sql immutable;
> insert into items(i) values (1), (2);
> insert into others(i) values (1), (2);
>
> Then the following in order:
> 1. In session A: create index concurrently on items((i::text || slow_expr()));
> 2. In session B (at the same time): create index concurrently on others(i);
>
> You'll notice that the 2nd command, which should be practically
> instantaneous, waits on the first ~30s scan phase of (1) before it
> returns. The same is true if after (2) completes you immediately run
> it again -- it waits on the second ~30s scan phase of (1).
>
> That does reveal a bit of complexity though that that the current
> patch doesn't address, which is that this can be phase dependent (and
> that complexity gets a lot more non-obvious when there's real live
> activity (particularly long-running transactions) in the system as
> well.
>
> I've attached a new patch series with two items:
> 1. A simpler (and I believe more correct) doc changes for "cic blocks
> cic on other tables".
> 2. A patch to document that all index builds can prevent tuples from
> being vacuumed away on other tables.
>
> If it's preferable we could commit the first and discuss the second
> separately, but since that limitation was also discussed up-thread, I
> decided to include them both here for now.

Álvaro's patch confused the current state of this thread, so I'm
reattaching (rebased) v2 as v3.

James


v3-0002-Document-vacuum-on-one-table-depending-on-concurr.patch
Description: Binary data


v3-0001-Document-concurrent-indexes-waiting-on-each-other.patch
Description: Binary data

Re: Nicer error when connecting to standby with hot_standby=off

On Tue, Aug 18, 2020 at 12:25 PM Fujii Masao
 wrote:
> Thanks for updating the patch! But it failed to be applied to the master 
> branch
> cleanly because of the recent commit 0038f94387. So could you update the patch
> again?

Updated patch attached.

James


v3-0001-Improve-standby-connection-denied-error-message.patch
Description: Binary data

Re: Logical Replication - detail message with names of missing columns

2020-09-08 Thread Bharath Rupireddy

Thanks for the comments. Attaching the v3 patch.

On Tue, Sep 8, 2020 at 8:19 PM Alvaro Herrera  wrote:
>
> This looks a bit fiddly.  Would it be less cumbersome to use
> quote_identifier here instead?
>

Changed. quote_identifier() adds quotes wherever necessary.

>
> Please do use errmsg_plural -- have the new function return the number
> of missing columns.  Should be pretty straightforward.
>

Changed. Now the error message looks as below:

CREATE TABLE test_tbl1(a1 int, b1 text, c1 int, d1 real, e1 int);
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated column:c1
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:b1,c1
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:b1,c1,e1

CREATE TABLE test_tbl1("!A1" int, "%b1" text, "@C1" int, d1 real, E1 int);
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated column:"@C1"
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"@C1",d1
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"!A1","@C1",d1
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"!A1","@C1",d1,e1
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"!A1","%b1","@C1",d1,e1

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Detail-message-with-names-of-missing-columns-in-l.patch
Description: Binary data

Re: Global snapshots

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 14:48, Fujii Masao wrote:

On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:

Attached is a patch, which implements a plain 2PC in the
postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it
solves these errors handling issues above and tries to add proper
comments everywhere. I think, that 0003 should be rebased on the top
of it, or it could be a first patch in the set, since it may be used
independently. What do you think?

Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any
thoughts

about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there
are two major differences:

Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.

1. There is a built-in foreign xacts resolver in the [1], which should
be much more convenient from the end-user perspective. It involves
huge in-core changes and additional complexity that is of course worth
of.

However, it's still not clear for me that it is possible to resolve
all foreign prepared xacts on the Postgres' own side with a 100%
guarantee. Imagine a situation when the coordinator node is actually a
HA cluster group (primary + sync + async replica) and it failed just
after PREPARE stage of after local COMMIT. In that case all foreign
xacts will be left in the prepared state. After failover process
complete synchronous replica will become a new primary. Would it have
all required info to properly resolve orphan prepared xacts?

IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change
to
+rollback, therefore at this time some participants might be prepared
whereas
+some are not prepared. The former foreign transactions need to be
resolved
+using pg_resolve_foreign_xact() manually and the latter ends
transaction

+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.

According to the Sawada-san's v25 0002 the logic is pretty much the same
there:

+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction()
in the CommitTransaction(). Thus, I don't see many difference between
these approach and CallXactCallbacks() usage regarding this point.

Is this safe? This issue happens
because COMMIT PREPARED is issued via
CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
is called after ProcArrayEndTransaction().

Once the transaction is committed locally any ERROR (or higher level
message) will be escalated to PANIC. And I do see possible ERROR level
messages in the postgresCommitForeignTransaction() for example:

+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ ereport(ERROR, (errmsg("could not commit transaction on server
%s",
+
frstate->server->servername)));

I don't think that it's very convenient to get a PANIC every time we
failed to commit one of the prepared foreign xacts, since it could be
not so rare in the distributed system. That's why I tried to get rid of
possible ERRORs as far as possible in my proposed patch.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Optimising compactify_tuples()

2020-09-08 Thread David Rowley

On Mon, 7 Sep 2020 at 19:47, David Rowley wrote:
> I wondered if we could get around that just by having another buffer
> somewhere and memcpy the tuples into that first then copy the tuples
> out that buffer back into the page. No need to worry about the order
> we do that in as there's no chance to overwrite memory belonging to
> other tuples.
>
> Doing that gives me 79.166 seconds in the same recovery test. Or about
> 46% faster, instead of 22% (I mistakenly wrote 28% yesterday)

I did some more thinking about this and considered if there's a way to
just get rid of the sorting version of compactify_tuples() completely.
In the version from yesterday, I fell back on the sort version for
when more than half the tuples from the page were being pruned. I'd
thought that in this case copying out *all* of the page from pd_upper
up to the pd_special (the tuple portion of the page) would maybe be
more costly since that would include (needlessly) copying all the
pruned tuples too. The sort also becomes cheaper in that case since
the number of items to sort is less, hence I thought it was a good
idea to keep the old version for some cases. However, I now think we
can just fix this by conditionally copying all tuples when in 1 big
memcpy when not many tuples have been pruned and when more tuples are
pruned we can just do individual memcpys into the separate buffer.

I wrote a little .c program to try to figure out of there's some good
cut off point to where one method becomes better than the other and I
find that generally if we're pruning away about 75% of tuples then
doing a memcpy() per non-pruned tuple is faster, otherwise, it seems
better just to copy the entire tuple area of the page. See attached
compact_test.c

I ran this and charted the cut off at (nitems < totaltups / 4) and
(nitems < totaltups / 2), and nitems < 16)
./compact_test 32 192
./compact_test 64 96
./compact_test 128 48
./compact_test 256 24
./compact_test 512 12

The / 4 one gives me the graphs with the smallest step when the method
switches. See attached 48_tuples_per_page.png for comparison.

I've so far come up with the attached
compactify_tuples_dgr_v2.patch.txt. Thomas pointed out to me off-list
that using PGAlignedBlock is the general way to allocate a page-sized
data on the stack. I'm still classing this patch as PoC grade. I'll
need to look a bit harder at the correctness of it all.

I did spend quite a bit of time trying to find a case where this is
slower than master's version. I can't find a case where there's any
noticeable slowdown. Using the same script from [1] I tried a few
variations of the t1 table by adding an additional column to pad out
the tuple to make it wider. Obviously a wider tuple means fewer
tuples on the page so less tuples for master's qsort to sort during
compactify_tuples(). I did manage to squeeze a bit more performance
out of the test cases. Yesterday I got 79.166 seconds. This version
gets me 76.623 seconds.

Here are the results of the various tuple widths:

narrow width row test: insert into t1 select x,0 from
generate_series(1,1000) x; (32 byte tuples)

patched: 76.623
master: 137.593

medium width row test: insert into t1 select x,0,md5(x::text) ||
md5((x+1)::Text) from generate_series(1,1000) x; (about 64 byte
tuples)

patched: 64.411
master: 95.576

wide row test: insert into t1 select x,0,(select
string_agg(md5(y::text),'') from generate_Series(x,x+30) y) from
generate_series(1,100)x; (1024 byte tuples)

patched: 88.653
master: 90.077

Changing the test so instead of having 10 million rows in the table
and updating a random row 12 million times. I put just 10 rows in the
table and updated them 12 million times. This results in
compactify_tuples() pruning all but 1 row (since autovac can't keep up
with this, each row does end up on a page by itself). I wanted to
ensure I didn't regress a workload that master's qsort() version would
have done very well at. qsorting 1 element is pretty fast.

10-row narrow test:

patched: 10.633 <--- small regression
master: 10.366

I could special case this and do a memmove without copying the tuple
to another buffer, but I don't think the slowdown is enough to warrant
having such a special case.

Another thing I tried was to instead of compacting the page in
compactify_tuples(), I just get rid of that function and did the
compacting in the existing loop in PageRepairFragmentation(). This
does require changing the ERROR check to a PANIC since we may have
already started shuffling tuples around when we find the corrupted
line pointer. However, I was just unable to make this faster than the
attached version. I'm still surprised at this as I can completely get
rid of the itemidbase array. The best run-time I got with this method
out the original test was 86 seconds, so 10 seconds slower than what
the attached can do. So I threw that idea away.

David

[1]
https://www.postgresql.org/message-id/caaphdvokwqazhiuxet8jsqupjkdph8dnuzdfusx9p7dxrjd...@ma

Re: Optimising compactify_tuples()

2020-09-08 Thread David Rowley

On Tue, 8 Sep 2020 at 12:08, Thomas Munro  wrote:
>
> One thought is that if we're going to copy everything out and back in
> again, we might want to consider doing it in a
> memory-prefetcher-friendly order.  Would it be a good idea to
> rearrange the tuples to match line pointer order, so that the copying
> work and also later sequential scans are in a forward direction?

That's an interesting idea but wouldn't that require both the copy to
the separate buffer *and* a qsort? That's the worst of both
implementations. We'd need some other data structure too in order to
get the index of the sorted array by reverse lineitem point, which
might require an additional array and an additional sort.

> The
> copying could also perhaps be done with single memcpy() for ranges of
> adjacent tuples.

I wonder if the additional code required to check for that would be
cheaper than the additional function call. If it was then it might be
worth trying, but since the tuples can be in any random order then
it's perhaps not likely to pay off that often.  I'm not really sure
how often adjacent line items will also be neighbouring tuples for
pages we call compactify_tuples() for.  It's certainly going to be
common with INSERT only tables, but if we're calling
compactify_tuples() then it's not read-only.

> Another thought is that it might be possible to
> identify some easy cases that it can handle with an alternative
> in-place shifting algorithm without having to go to the
> copy-out-and-back-in path.  For example, when the offset order already
> matches line pointer order but some dead tuples just need to be
> squeezed out by shifting ranges of adjacent tuples, and maybe some
> slightly more complicated cases, but nothing requiring hard work like
> sorting.

It's likely worth experimenting.  The only thing is that the workload
I'm using seems to end up with the tuples with line items not in the
same order as the tuple offset.  So adding a precheck to check the
ordering will regress the test I'm doing. We'd need to see if there is
any other workload that would keep the tuples more in order then
determine how likely that is to occur in the real world.

> > (I don't want to derail the sort improvements here. I happen to think
> > those are quite important improvements, so I'll continue to review
> > that patch still.  Longer term, we might just end up with something
> > slightly different for compactify_tuples)
>
> Yeah.  Perhaps qsort specialisation needs to come back in a new thread
> with a new use case.

hmm, yeah, perhaps that's a better way given the subject here is about
compactify_tuples()

David

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 17:00, Amit Langote wrote:

Hi Alexey,

On Tue, Sep 8, 2020 at 6:29 PM Alexey Kondratov
 wrote:

On 2020-09-08 10:34, Amit Langote wrote:
> Any thoughts on the taking out the refactoring changes out of the main
> patch as I suggested?
>

+1 for splitting the patch. It was rather difficult for me to
distinguish changes required by COPY via postgres_fdw from this
refactoring.

Another ambiguous part of the refactoring was in changing
InitResultRelInfo() arguments:

@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be
better to set resultRelInfo->ri_usesMultiInsert in the
InitResultRelInfo() unconditionally like it is done for
ri_usesFdwDirectModify? And after that it will be up to the caller
whether to use multi-insert or not based on their own circumstances.
Otherwise now we have a flag to indicate that we want to check for
another flag, while this check doesn't look costly.

Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then
execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?

No, you're right. If someone want to share a state and use ResultRelInfo 
(RRI) for that purpose, then it's fine, but CopyFrom() may simply 
override RRI->ri_usesMultiInsert if needed and pass this RRI further.

This is how it's done for RRI->ri_usesFdwDirectModify. 
InitResultRelInfo() initializes it to false and then 
ExecInitModifyTable() changes the flag if needed.

Probably this is just a matter of personal choice, but for me the 
current implementation with additional argument in InitResultRelInfo() 
doesn't look completely right. Maybe because a caller now should pass an 
additional argument (as false) even if it doesn't care about 
ri_usesMultiInsert at all. It also adds additional complexity and feels 
like abstractions leaking.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Use for name of unnamed portal's memory context

2020-09-08 Thread Peter Eisentraut


On 2020-09-05 17:10, Tom Lane wrote:

Julien Rouhaud  writes:

+1 too, and obviously patch looks good.


The adjacent comment needs updating.


committed with that change

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: proposal - reference to plpgsql_check from plpgsql doc

2020-09-08 Thread Pavel Stehule

po 17. 8. 2020 v 11:03 odesílatel Pavel Stehule 
napsal:

>
>
> po 17. 8. 2020 v 10:37 odesílatel Magnus Hagander 
> napsal:
>
>>
>>
>> On Mon, Aug 17, 2020 at 8:47 AM Pavel Stehule 
>> wrote:
>>
>>> Hi
>>>
>>> plpgsql_check extension is almost complete now. This extension is
>>> available on all environments and for all supported Postgres releases. It
>>> is probably too big to be part of contrib, but I think so it can be
>>> referenced in
>>> https://www.postgresql.org/docs/current/plpgsql-development-tips.html
>>> chapter.
>>>
>>> What do you think about it?
>>>
>>>
>> Without making any valuation on this particular tool, I think we should
>> be very very careful and restrictive about putting such links in the main
>> documentation.
>>
>> The appropriate location for such references are in the product catalog
>> on the website and on the wiki. (I'd be happy to have a link from the docs
>> to a generic "pl/pgsql tips" page on the wiki, though, if people would
>> think that helpful -- because that would be linking to a destination that
>> we can easily update/fix should it go stale)
>>
>
> good idea
>

I created this page

https://wiki.postgresql.org/wiki/Tools_and_tips_for_develpment_in_PL/pgSQL_language

Now, there is just a list of available tools. Please, can somebody check it
and clean and fix my Czechisms in text?

Regards

Pavel


> Pavel
>
>
>> --
>>  Magnus Hagander
>>  Me: https://www.hagander.net/ 
>>  Work: https://www.redpill-linpro.com/ 
>>
>

Re: Change a constraint's index - ALTER TABLE ... ALTER CONSTRAINT ... USING INDEX ...

On 2020-Sep-08, Laurenz Albe wrote:

> We should at least have
> 
>   ALTER TABLE ... ADD PRIMARY KEY (id) INCLUDE (val);
> 
> or something before we consider this patch.

Agreed.

Now the trick in this new command is to let the user change the included
columns afterwards, which remains useful (since it's clearly reasonable
to change minds after applications using the constraint start to take
shape).

> As to the information_schema, that could pretend that the INCLUDE
> columns just don't exist.

Yeah, that's what I was thinking too, since for all intents and
purposes, from the standard's POV the constraint works the same
regardless of included columns.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Change a constraint's index - ALTER TABLE ... ALTER CONSTRAINT ... USING INDEX ...

2020-09-08 Thread Laurenz Albe

On Mon, 2020-09-07 at 11:42 -0300, Alvaro Herrera wrote:
> > This patch would provide a more convenient way to do that.
> > Again, I am not sure if that justifies the effort.
> 
> I have to admit I've seen cases where it'd be useful to have included
> columns in primary keys.
> 
> TBH I think if we really wanted the feature of primary keys with
> included columns, we'd have to add it to the PRIMARY KEY syntax rather
> than having an ad-hoc ALTER TABLE ALTER CONSTRAINT USING INDEX command
> to replace the index underneath.  Then things like pg_dump would work
> normally.
> 
> (I have an answer for the information_schema question Tom posed; I'd
> like to know what's yours.)

Gah, now I see my mistake.  I was under the impression that a
primary key can have an INCLUDE clause today, which is not true.

So this would introduce that feature in a weird way.
I agree that that is undesirable.

We should at least have

  ALTER TABLE ... ADD PRIMARY KEY (id) INCLUDE (val);

or something before we consider this patch.

As to the information_schema, that could pretend that the INCLUDE
columns just don't exist.

Yours,
Laurenz Albe

Re: Logical Replication - detail message with names of missing columns

On 2020-Sep-08, Bharath Rupireddy wrote:

> + /* Find the remote attributes that are missing in the local relation. */
> + for (i = 0; i < remoterel->natts; i++)
> + {
> + if (!bms_is_member(i, localattnums))
> + {
> + if (missingatts->len == 0)
> + {
> + appendStringInfoChar(missingatts, '"');
> + appendStringInfoString(missingatts, 
> remoterel->attnames[i]);
> + }
> + else if (missingatts->len > 0)
> + {
> + appendStringInfoChar(missingatts, ',');
> + appendStringInfoChar(missingatts, '"');
> + appendStringInfo(missingatts, "%s", 
> remoterel->attnames[i]);
> + }
> +
> + appendStringInfoChar(missingatts, '"');
> + }

This looks a bit fiddly.  Would it be less cumbersome to use
quote_identifier here instead?


>   ereport(ERROR,
>   
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>errmsg("logical replication target 
> relation \"%s.%s\" is missing "
> - "some replicated 
> columns",
> - remoterel->nspname, 
> remoterel->relname)));
> + "replicated columns:%s",
> + remoterel->nspname, 
> remoterel->relname,
> + missingatts.data)));

Please do use errmsg_plural -- have the new function return the number
of missing columns.  Should be pretty straightforward.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical Replication - detail message with names of missing columns

2020-09-08 Thread Bharath Rupireddy

Thanks for the comments, v2 patch is attached.

>
> On Tue, Sep 8, 2020 at 6:50 AM Kyotaro Horiguchi
>  wrote:
> >
> > +1 for objective. However, that can be done simpler way that doesn't
> > need additional loops by using bitmapset to hold missing remote
> > attribute numbers. This also make the variable "found" useless.
> >
>
> Thanks. I will look into it and post a v2 patch soon.
>

Changed.

> >
> > FWIW, I would prefer that the message be like
> >
> >  logical replication target relation "public.test_1" is missing
> >  replicated columns: "a1", "c1"
> >
>
> This looks fine, I will change that.
>

Changed. Now the error looks like as shown below:

ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"d1"
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"c1","d1"
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"a1","c1","d1"
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"a1","b1","c1","d1"
ERROR:  logical replication target relation "public.test_tbl1" is
missing replicated columns:"a1","b1","c1","d1","e1"



With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 9fbda09a015895a447bdce6683e8138350b133bd Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 8 Sep 2020 02:45:04 -0700
Subject: [PATCH v2] Detail message with names of missing columns in logical
 replication

In logical replication when a subscriber is missing some columns,
it currently emits an error message that says "some" columns are
missing(see logicalrep_rel_open()), but it doesn't specify what
the missing column names are. The comment there also says that
finding the missing column names is a todo item(/* TODO, detail
message with names of missing columns */).

This patch finds the missing columns on the subscriber relation
using the publisher relation columns and show them in the error
message which makes error to be more informative to the user.
---
 src/backend/replication/logical/relation.c | 60 --
 1 file changed, 57 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index a60c73d74d..2ecb4766a4 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -227,6 +227,53 @@ logicalrep_rel_att_by_name(LogicalRepRelation *remoterel, const char *attname)
 	return -1;
 }
 
+/*
+ * Finds the attributes that are missing in the local relation
+ * compared to remote relation.
+ */
+static void
+logicalrep_find_missing_atts(AttrMap*attrmap,
+			 LogicalRepRelation *remoterel,
+			 StringInfo missingatts)
+{
+	int			i;
+	Bitmapset  *localattnums = NULL;
+
+	/* Add local relation attnums to a bitmapset. */
+	for (i = 0; i < attrmap->maplen; i++)
+	{
+		/*
+		 * Do not add attnums with values -1 to bitmapset
+		 * as it doesn't take negative values. Attnums value
+		 * is -1 for a dropped or a generated local attribute.
+		 */
+		if (attrmap->attnums[i] >= 0)
+			localattnums = bms_add_member(localattnums, attrmap->attnums[i]);
+	}
+
+	/* Find the remote attributes that are missing in the local relation. */
+	for (i = 0; i < remoterel->natts; i++)
+	{
+		if (!bms_is_member(i, localattnums))
+		{
+			if (missingatts->len == 0)
+			{
+appendStringInfoChar(missingatts, '"');
+appendStringInfoString(missingatts, remoterel->attnames[i]);
+			}
+			else if (missingatts->len > 0)
+			{
+appendStringInfoChar(missingatts, ',');
+appendStringInfoChar(missingatts, '"');
+appendStringInfo(missingatts, "%s", remoterel->attnames[i]);
+			}
+
+			appendStringInfoChar(missingatts, '"');
+		}
+	}
+	bms_free(localattnums);
+}
+
 /*
  * Open the local relation associated with the remote one.
  *
@@ -322,13 +369,20 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 found++;
 		}
 
-		/* TODO, detail message with names of missing columns */
+		/* Report error with names of the missing localrel column(s). */
 		if (found < remoterel->natts)
+		{
+			StringInfoData missingatts;
+
+			initStringInfo(&missingatts);
+			logicalrep_find_missing_atts(entry->attrmap, remoterel, &missingatts);
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("logical replication target relation \"%s.%s\" is missing "
-			"some replicated columns",
-			remoterel->nspname, remoterel->relname)));
+			"replicated columns:%s",
+			remoterel->nspname, remoterel->relname,
+			missingatts.data)));
+		}
 
 		/*
 		 * Check that replica identity matches. We allow for stricter replica
-- 
2.25.1

Re: default partition and concurrent attach partition

Hello Amit,

On 2020-Sep-08, Amit Langote wrote:

> On Tue, Sep 8, 2020 at 8:44 AM Alvaro Herrera  
> wrote:
> > On 2020-Sep-07, Alvaro Herrera wrote:
> >
> > > Ah, it looks like we can get away with initializing the RRI to 0, and
> > > then explicitly handle that case in ExecPartitionCheckEmitError, as in
> > > the attached (which means reindenting, but I left it alone to make it
> > > easy to read).
> 
> At this point, I think it may be clear that ri_RangeTableIndex being
> set to a dummy value isn't problematic.

Yep ... I misled myself.

> Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
> TupleDesc matches the partition's.  Actually we do have such dedicated
> slots for partitions around (for both sub-partitioned and leaf
> partitions),

Yeah, that's what I was looking for.

> so we can simply use them instead of creating one from
> scratch for every use.  It did take some code rearrangement to
> actually make that work though.

Thanks.  It doesn't look too bad ... I'd say it even looks easier to
read now in terms of code structure.

> Attached is the latest patch including those changes.  Also, I have
> merged your earlier "fixes" patch and updated the isolation test to
> exercise a case with sub-partitioned default partition, as well as
> varying attribute numbers.  Patch applies as-is to both HEAD and v13
> branches, but needs slight changes to apply to v12 branch, so also
> attaching one for that branch.

Thanks, will dispatch it shortly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

2020-09-08 Thread Tomas Vondra


On Mon, Sep 07, 2020 at 01:55:28PM -0700, Jeff Davis wrote:

On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote:

I've tested the costing changes on the simplified TPC-H query, on two
different machines, and it seems like a clear improvement.


Thank you. Committed.


So yeah, the patched costing is much closer to sort (from the point
of
this cost/duration metric), although for higher work_mem values
there's
still a clear gap where the hashing seems to be under-costed by a
factor
of ~2 or more.


There seems to be a cliff right after 4MB. Perhaps lookup costs on a
larger hash table?



I assume you mean higher costs due to hash table outgrowing some sort of
CPU cache (L2/L3), right? Good guess - the CPU has ~6MB cache, but no.
This seems to be merely due to costing, because the raw cost/duration
looks like this:

 work_mem   costduration
-
  1MB   20627403  216861
  2MB   15939722  178237
  4MB   15939722  176296
  8MB   11252041  160479
 16MB   11252041  168304
 32MB   11252041  179567
 64MB   11252041  189410
256MB   11252041  204206

This is unpatched master, with the costing patch it looks similar except
that the cost is about 2x higher. On the SATA RAID machine, it looks
like this:

 work_mem costduration
---
  1MB108358461 1147269
  2MB 77381688 1004895
  4MB 77381688  994853
  8MB 77381688  980071
 16MB 46404915  930511
 32MB 46404915  902167
 64MB 46404915  908757
256MB 46404915  926862

So roughly the same - the cost drops to less than 50%, but the duration
really does not. This is what I referred to when I said "Not sure if we
need/should tweak the costing to reduce the effect of work_mem (on
hashagg)."

For sort this seems to behave a bit more nicely - the cost and duration
(with increasing work_mem) are correlated quite well, I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Automatic HASH and LIST partition creation

2020-09-08 Thread Pavel Borisov

> The patch lacks documentation, because I expect some details may change
> during discussion. Other than that, the feature is ready for review.
>
Hi, hackers!

>From what I've read I see there is much interest in automatic partitions
creation. (Overall discussion on the topic is partitioned into two threads:
(1)
https://www.postgresql.org/message-id/alpine.DEB.2.21.1907150711080.22273%40lancre
and
(2)
https://www.postgresql.org/message-id/flat/7fec3abb-c663-c0d2-8452-a46141be6...@postgrespro.ru
(current thread) )

There were many syntax proposals and finally, there is a patch realizing
one of them. So I'd like to review it.

The syntax proposed in the patch seems good enough for me and is in
accordance with one of the proposals in the discussion. Maybe I'd prefer
using the word AUTOMATICALLY/AUTO instead of CONFIGURATION with explicit
meaning that using this syntax we'd get already (automatically) created
partitions and don't need to create them manually, as in the existing state
of postgresql declarative partitioning.

CREATE TABLE tbl (i int) PARTITION BY HASH (i) AUTOMATICALLY (MODULUS
3); (partitions are created automatically)

vs

CREATE TABLE tbl (i int) PARTITION BY HASH (i); (partitions should be
created manually by use of PARTITION OF)


CREATE TABLE tbl (i char) PARTITION BY LIST (i) AUTOMATICALLY (VALUES
IN ('a', 'b'), ('c', 'd'), ('e','f') DEFAULT PARTITION tbl_default);

vs

CREATE TABLE tbl (i char) PARTITION BY LIST (i); (partitions should be
created manually by use of PARTITION OF)


I think this syntax can also be extended later with adding automatic
creation of RANGE partitions, with IMMEDIATE/DEFERRED for dynamic/on-demand
automatic partition creation, and with SUBPARTITION possibility.

But I don't have a strong preference for the word AUTOMATICALLY, moreover I
saw opposition to using AUTO at the top of the discussion. I suppose we can
go with the existing CONFIGURATION word.

If compare with existing declarative partitions, I think automatic creation
simplifies the process for the end-user and  I'd vote for its committing
into Postgres. The patch is short and clean in code style. It has enough
comments Tests covering the new functionality are included. Yet it doesn't
have documentation and I'd suppose it's worth adding it. Even if there will
be syntax changes, I hope they will not be more than the replacement of
several words. Current syntax is described in the text of a patch.

The patch applies cleanly and installcheck-world is passed.

Some minor things:

I've got a compiler warning:
parse_utilcmd.c:4280:15: warning: unused variable 'lc' [-Wunused-variable]

When the number of partitions is over the maximum value of int32 the output
shows a generic syntax error. I don't think it is very important as it is
not the case someone will make deliberately, but maybe it's better to
output something like "Partitions number is more than the maximum supported
value"
create table test (i int, t text) partition by hash (i) configuration
(modulus );
ERROR:  syntax error at or near ""

One more piece of nitpicking. Probably we can go just with a mention in
documentation.
create table test (i int, t text) partition by hash (i) configuration
(modulus );
ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

Typo:
+ /* Add statemets to create each partition after we create parent table */

Overall I see the patch almost ready for commit and I'd like to meet this
functionality in v14.

Tested it and see this feature very cool and much simpler to use compared
to declarative partitioning to date.

Thanks!
-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-08 Thread Amit Langote

Hi Alexey,

On Tue, Sep 8, 2020 at 6:29 PM Alexey Kondratov
 wrote:
> On 2020-09-08 10:34, Amit Langote wrote:
> > Any thoughts on the taking out the refactoring changes out of the main
> > patch as I suggested?
> >
>
> +1 for splitting the patch. It was rather difficult for me to
> distinguish changes required by COPY via postgres_fdw from this
> refactoring.
>
> Another ambiguous part of the refactoring was in changing
> InitResultRelInfo() arguments:
>
> @@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
>   Relation resultRelationDesc,
>   Index resultRelationIndex,
>   Relation partition_root,
> + bool use_multi_insert,
>   int instrument_options)
>
> Why do we need to pass this use_multi_insert flag here? Would it be
> better to set resultRelInfo->ri_usesMultiInsert in the
> InitResultRelInfo() unconditionally like it is done for
> ri_usesFdwDirectModify? And after that it will be up to the caller
> whether to use multi-insert or not based on their own circumstances.
> Otherwise now we have a flag to indicate that we want to check for
> another flag, while this check doesn't look costly.

Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then
execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Auto-vectorization speeds up multiplication of large-precision numerics

Amit Khandekar  writes:
> Thanks. I must admit it did not occur to me that I could have very
> well installed clang on my linux machine and tried compiling this
> file, or tested with some older gcc versions. I think I was using gcc
> 8. Do you know what was the gcc compiler version that gave these
> warnings ?

Per the buildfarm's configure logs, prairiedog is using

configure: using compiler=powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple 
Computer, Inc. build 5341)

IIRC, locust has a newer build number but it's the same underlying gcc
version.

regards, tom lane

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-08 Thread John Naylor

On Thu, Aug 27, 2020 at 9:39 AM  wrote:
>
> Hi all,
>
> this minor patch is attempting to force the use of the tableam api in dbsize 
> where ever it is required.
>
> Apparently something similar was introduced for toast relations only. 
> Intuitively it seems that the distinction between a table and a toast table 
> is not needed.

I suspect the reason is found in the comment for table_block_relation_size():

 * If a table AM uses the various relation forks as the sole place where data
 * is stored, and if it uses them in the expected manner (e.g. the actual data
 * is in the main fork rather than some other), it can use this implementation
 * of the relation_size callback rather than implementing its own.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Sep 8, 2020 at 7:53 AM Masahiko Sawada
 wrote:
>
> On Mon, 7 Sep 2020 at 15:24, Amit Kapila  wrote:
> >
> > This patch needs a rebase. I don't see this patch in the CF app. I
> > hope you are still interested in working on this.
>
> Thank you for reviewing this patch!
>
> I'm still going to work on this patch although I might be slow
> response this month.
>

Comments on the latest patch:
=
1.
+CREATE VIEW pg_stat_replication_slots AS
+SELECT
+s.name,
+s.spill_txns,
+s.spill_count,
+s.spill_bytes,
+ s.stats_reset
+FROM pg_stat_get_replication_slots() AS s;

You forgot to update the docs for the new parameter.

2.
@@ -5187,6 +5305,12 @@ pgstat_read_statsfiles(Oid onlydb, bool
permanent, bool deep)
  for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
  slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;

+ /*
+ * Set the same reset timestamp for all replication slots too.
+ */
+ for (i = 0; i < max_replication_slots; i++)
+ replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+

I don't understand why you have removed the above code from the new
version of the patch?

3.
pgstat_recv_resetreplslotcounter()
{
..
+ ts = GetCurrentTimestamp();
+ for (i = 0; i < nReplSlotStats; i++)
+ {
+ /* reset entry with the given index, or all entries */
+ if (msg->clearall || idx == i)
+ {
+ /* reset only counters. Don't clear slot name */
+ replSlotStats[i].spill_txns = 0;
+ replSlotStats[i].spill_count = 0;
+ replSlotStats[i].spill_bytes = 0;
+ replSlotStats[i].stat_reset_timestamp = ts;
+ }
+ }
..

I don't like this coding pattern as in the worst case we need to
traverse all the slots to reset a particular slot. This could be okay
for a fixed number of elements as we have in SLRU but here it appears
quite inefficient. We can move the reset of stats part to a separate
function and then invoke it from the place where we need to reset a
particular slot and the above place.

4.
+pgstat_replslot_index(const char *name, bool create_it)
{
..
+ replSlotStats[nReplSlotStats].stat_reset_timestamp = GetCurrentTimestamp();
..
}

Why do we need to set the reset timestamp on the creation of slot entry?

5.
@@ -3170,6 +3175,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  spilled++;
  }

+ /* update the statistics */
+ rb->spillCount += 1;
+ rb->spillBytes += size;
+
+ /* Don't consider already serialized transactions. */
+ rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;

We can't increment the spillTxns in the above way because now
sometimes we do serialize before streaming and in that case we clear
the serialized flag after streaming, see ReorderBufferTruncateTXN. So,
the count can go wrong. Another problem is currently the patch call
UpdateSpillStats only from begin_cb_wrapper which means it won't
consider streaming transactions (streaming transactions that might
have spilled). If we consider the streamed case along with it, we can
probably keep this counter up-to-date because in the above place we
can check if the txn is once serialized or streamed, we shouldn't
increment the counter. I think we need to merge Ajin's patch for
streaming stats [1] and fix the issue. I have not checked his patch so
it might need a rebase and or some changes.

6.
@@ -322,6 +321,9 @@ ReplicationSlotCreate(const char *name, bool db_specific,

  /* Let everybody know we've modified this slot */
  ConditionVariableBroadcast(&slot->active_cv);
+
+ /* Create statistics entry for the new slot */
+ pgstat_report_replslot(NameStr(slot->data.name), 0, 0, 0);
 }
..
..
@@ -683,6 +685,18 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
  ereport(WARNING,
  (errmsg("could not remove directory \"%s\"", tmppath)));

+ /*
+ * Report to drop the replication slot to stats collector.  Since there
+ * is no guarantee the order of message arrival on an UDP connection,
+ * it's possible that a message for creating a new slot arrives before a
+ * message for removing the old slot.  We send the drop message while
+ * holding ReplicationSlotAllocationLock to reduce that possibility.
+ * If the messages arrived in reverse, we would lose one statistics update
+ * message.  But the next update message will create the statistics for
+ * the replication slot.
+ */
+ pgstat_report_replslot_drop(NameStr(slot->data.name));
+

Similar to drop message, why don't we send the create message while
holding the ReplicationSlotAllocationLock?

[1] - 
https://www.postgresql.org/message-id/CAFPTHDZ8RnOovefzB%2BOMoRxLSD404WRLqWBUHe6bWqM5ew1bNA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Inconsistency in determining the timestamp of the db statfile.

2020-09-08 Thread Magnus Hagander

On Tue, Sep 8, 2020 at 3:11 PM Fujii Masao 
wrote:

>
>
> On 2020/09/08 19:28, Magnus Hagander wrote:
> >
> >
> > On Tue, Sep 8, 2020 at 8:10 AM Amit Kapila  > wrote:
> >
> > We use the timestamp of the global statfile if we are not able to
> > determine it for a particular database either because the entry for
> > that database doesn't exist or there is an error while reading the
> > specific database entry. This was not taken care of while reading
> > other entries like ArchiverStats or SLRUStats. See
> > pgstat_read_db_statsfile_timestamp. I have observed this while
> > reviewing Sawada-San's patch related to logical replication stats
> [1].
> >
> > I think this can only happen if due to some reason the statfile got
> > corrupt or we
> > have some bug in stats writing code, the chances of both are rare
> and even
> > if that happens we will use stale statistics.
> >
> > The attached patch by Sawada-San will fix this issue. As the chances
> of this
> > the problem is rare and nobody has reported any issue related to
> this,
> > so it might be okay not to backpatch this.
> >
> > Thoughts?
> >
> >
> > Why are we reading the archiver statis and and slru stats in
> "pgstat_read_db_statsfile_timestamp" in the first place?
>
> Maybe because they are written before database stats entries? That is,
> to read the target database stats entry and get the timestamp from it,
> we need to read (or lseek) all the global stats entries written before
> them.
>
>
Oh meh. Yeah, I'm reading this thing completely wrong :/ Ignore my comments
:)



> That seems well out of scope for that function.
> >
> > If nothing else the comment at the top of that function is out of touch
> with reality. We do seem to read it into local buffers and then ignore the
> contents. I guess we're doing it just to verify that it's not corrupt -- so
> perhaps the function should actually have a different name than it does
> now, since it certainly seems to do more than actually read the statsfile
> timestamp.
> >
> > Specifically in this patch it looks wrong -- in the case of say the SLRU
> stats being corrupt, we will now return the timestamp of the global stats
> file even if there is one existing for the database requested, which
> definitely breaks the contract of the function.
>
> Yes.
> We should return false when fread() for database entry fails, instead?
> That is,
>
> 1. If corrupted stats file is found, the function always returns false.
> 2. If the file is not currupted and the target database entry is found,
> its timestamp is returned.
> 3. If the file is not corrupted and the target is NOT found, the timestamp
> of global entry is returned
>

Yeah, with more coffee and re-reading it, I'm not sure how we could have
the database stats being OK if the archiver or slru stats are not.

That said, I think it still makes sense to return false if the stats file
is corrupt. How much can we trust the first block, if the block right after
it is corrupt? So yeah, I think your described order seems correct. But
that's also what the code already did before this patch, isn't it?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Transactions involving multiple postgres foreign servers, take 2

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao  wrote:
>
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
>
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.

I agree with this. However, in case of explicit prepare, if we are not
going to try automatic resolution, it might be better to provide a way
to pass the information about transactions prepared on the foreign
servers if they can not be resolved at the time of commit so that the
user can take it up to resolve those him/herself. This was an idea
that Tom had suggested at the very beginning of the first take.

--
Best Wishes,
Ashutosh Bapat

Re: Inconsistency in determining the timestamp of the db statfile.





On 2020/09/08 19:28, Magnus Hagander wrote:



On Tue, Sep 8, 2020 at 8:10 AM Amit Kapila mailto:amit.kapil...@gmail.com>> wrote:

We use the timestamp of the global statfile if we are not able to
determine it for a particular database either because the entry for
that database doesn't exist or there is an error while reading the
specific database entry. This was not taken care of while reading
other entries like ArchiverStats or SLRUStats. See
pgstat_read_db_statsfile_timestamp. I have observed this while
reviewing Sawada-San's patch related to logical replication stats [1].

I think this can only happen if due to some reason the statfile got
corrupt or we
have some bug in stats writing code, the chances of both are rare and even
if that happens we will use stale statistics.

The attached patch by Sawada-San will fix this issue. As the chances of this
the problem is rare and nobody has reported any issue related to this,
so it might be okay not to backpatch this.

Thoughts?


Why are we reading the archiver statis and and slru stats in 
"pgstat_read_db_statsfile_timestamp" in the first place?


Maybe because they are written before database stats entries? That is,
to read the target database stats entry and get the timestamp from it,
we need to read (or lseek) all the global stats entries written before them.



That seems well out of scope for that function.

If nothing else the comment at the top of that function is out of touch with 
reality. We do seem to read it into local buffers and then ignore the contents. 
I guess we're doing it just to verify that it's not corrupt -- so perhaps the 
function should actually have a different name than it does now, since it 
certainly seems to do more than actually read the statsfile timestamp.

Specifically in this patch it looks wrong -- in the case of say the SLRU stats 
being corrupt, we will now return the timestamp of the global stats file even 
if there is one existing for the database requested, which definitely breaks 
the contract of the function.


Yes.
We should return false when fread() for database entry fails, instead? That is,

1. If corrupted stats file is found, the function always returns false.
2. If the file is not currupted and the target database entry is found, its 
timestamp is returned.
3. If the file is not corrupted and the target is NOT found, the timestamp of 
global entry is returned.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-09-08 Thread Bharath Rupireddy

On Mon, Sep 7, 2020 at 8:50 PM Tom Lane  wrote:
>
> I think there is agreement that we're not going to change
> cancel_before_shmem_exit's restriction to only allow LIFO popping.
> So we should improve its comment to explain why.  The other thing
> that seems legitimately on-the-table for this CF entry is whether
> we should change cancel_before_shmem_exit to complain, rather than
> silently do nothing, if it fails to pop the stack.  Bharath's
> last patchset proposed to add an elog(DEBUG3) complaint, which
> seems to me to be just about entirely useless.  I'd make it an
> ERROR, or maybe an Assert.
>

Attaching a patch with both the comments modification and changing
DEBUG3 to ERROR. make check and make world-check passes on this patch.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 45cb0ada52eba25ce83985cc23edab0d34be9e4a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 7 Sep 2020 23:46:26 -0700
Subject: [PATCH v4] Modify cancel_before_shmem_exit() comments

Currently, there is no mention of safe usage of cancel_before_shmem_exit()
in the function comment and also it has a mis-directing point that, there
is still scope for improving the way cancel_before_shmem_exit() looks
for callback to be removed from before_shmem_exit_list. So, this patch
modifies the comment to reflect how the caller needs to use
before_shmem_exit_list mechanism.

This patch also adds an error in case the before_shmem_exit
function is not found at the latest entry.
---
 src/backend/storage/ipc/ipc.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index bdbc2c3ac4..3c2c30c189 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -381,9 +381,9 @@ on_shmem_exit(pg_on_exit_callback function, Datum arg)
  *		cancel_before_shmem_exit
  *
  *		this function removes a previously-registered before_shmem_exit
- *		callback.  For simplicity, only the latest entry can be
- *		removed.  (We could work harder but there is no need for
- *		current uses.)
+ *		callback.  We only look at the latest entry for removal, as we
+ * 		expect callers to add and remove temporary before_shmem_exit
+ * 		callbacks in strict LIFO order.
  * 
  */
 void
@@ -393,7 +393,18 @@ cancel_before_shmem_exit(pg_on_exit_callback function, Datum arg)
 		before_shmem_exit_list[before_shmem_exit_index - 1].function
 		== function &&
 		before_shmem_exit_list[before_shmem_exit_index - 1].arg == arg)
+	{
 		--before_shmem_exit_index;
+	}
+	else if (before_shmem_exit_index > 0)
+	{
+		ereport(ERROR,
+(errmsg("before_shmem_exit callback %p cannot be removed from the callback list as it doesn't match with the latest entry %p at index %d",
+		function,
+		before_shmem_exit_list[before_shmem_exit_index - 1].function,
+		before_shmem_exit_index),
+ errhint("Make sure the temporary before_shmem_exit callbacks are added and removed in strict Last-In-First-Out(LIFO) order.")));
+	}
 }
 
 /* 
-- 
2.25.1

Case for Improved Costing for Index Only Scans?

2020-09-08 Thread Hamid Akhtar

While running a simple SELECT statement on a table where I expected
indexonly scan to be preferred over sequential scan, the planner kept on
selecting the sequential scan. Looking at the EXPLAIN output, it seemed
obvious that the cost of indexonly was exceeding that of sequential scan.
So the planner was right, however, my expectation was that indexonly should
have been selected instead.

Following is an example table that I’ll be referring to.

-- Creating a very basic table, index and running vacuum analyze
CREATE TABLE index_test AS (SELECT GENERATE_SERIES(1, 1000)::int id,
'hello world' AS name);
CREATE INDEX on index_test;
VACUUM ANALYZE index_test;

-- SELECT statement
SELECT id FROM index_test WHERE id < [SOME VALUE];

So as a starting point, I wanted to identify when the cost for indexonly
exceeds that of a sequential scan. In my case, the number turned out to be
6,243,354 with the table containing 10,000,000 tuples.

When the cost exceeds, the planner should select the more optimum path.
However, my concern was why is the indexonly scan cost greater than that of
sequential one. Turning on the timing, the expectation was that at the
tipping point, indexonly execution time should be greater than that of
sequential one. However, I see that indexonly scan’s execution was much
much faster than the sequential scan. In terms of timing, this is what I
expected. So I turned on the timing in psql and ran EXPLAIN ANALYZE.
Following are the outputs.

-- SEQUENTIAL SCAN
EXPLAIN ANALYZE SELECT id FROM index_test WHERE id < 6243354;

  QUERY PLAN

---
 Seq Scan on index_test  (cost=0.00..179053.25 rows=6227030 width=4)
(actual time=0.175..993.291 rows=6243353 loops=1)
   Filter: (id < 6243354)
   Rows Removed by Filter: 3756647
 Planning Time: 0.235 ms
 Execution Time: 1149.590 ms
(5 rows)

Time: 1150.798 ms (00:01.151)
postgres=# explain analyze select id from index_test where id < 6243353;


-- INDEXONLY SCAN
EXPLAIN ANALYZE SELECT id FROM index_test WHERE id < 6243353;

  QUERY
PLAN
--
 Index Only Scan using index_test_id_idx on index_test
 (cost=0.43..177277.44 rows=6227029 width=4) (actual time=0.063..718.390
rows=6243352 loops=1)
   Index Cond: (id < 6243353)
   Heap Fetches: 0
 Planning Time: 0.174 ms
 Execution Time: 873.070 ms
(5 rows)

Time: 874.079 ms

Given that this is a very well packed table, still you can see that the
execution time increases from 718ms for indexonly scan to 993ms only with
an increase of a single tuple just because the cost of indexonly scan goes
beyond that of sequential scan.

I’ve tried this in a docker, my laptop and on a server without
virtualization. All have shown similar gains.

Reviewing the costestimate function in cost_size.c, I see that the cost of
indexonly differs for min_IO_cost and max_IO_cost. The indexTotalCost cost
remains the same whether it’s an indexonly scan or not. I believe that
total index cost in case of indexonly scan should also be updated.

I don't think there is a need to modify all amcostestimate functions for
indexes, however, perhaps we can pull in another factor that helps make
cost for indexonly more realistic.

ISTM, it should be reduced by a factor somewhere around (index
size)/(relation size), even perhaps putting together expected index size,
actual index size and relation size in some equation.

-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: Ideas about a better API for postgres_fdw remote estimates

On Fri, 4 Sep 2020 at 20:27, Tomas Vondra 
wrote

>
>
> 4) I wonder if we actually want/need to simply output pg_statistic data
> verbatim like this. Is postgres_fdw actually going to benefit from it? I
> kinda doubt that, and my assumption was that we'd return only a small
> subset of the data, needed by get_remote_estimate.
>
> This has a couple of issues. Firstly, it requires the knowledge of what
> the stakind constants in pg_statistic mean and how to interpret it - but
> OK, it's true that does not change very often (or at all). Secondly, it
> entirely ignores extended statistics - OK, we might extract those too,
> but it's going to be much more complex. And finally it entirely ignores
> costing on the remote node. Surely we can't just apply local
> random_page_cost or whatever, because those may be entirely different.
> And we don't know if the remote is going to use index etc.
>
> So is extracting data from pg_statistic the right approach?
>
>
There are two different problems, which ultimately might converge.
1. If use_remote_estimates = false, more generally if querying costs from
foreign server for costing paths is impractical, we want to use local
estimates and try to come up with costs. For that purpose we keep some
statistics locally and user is expected to refresh it periodically by
running ANALYZE on the foreign table. This patch is about a. doing this
efficiently without requiring to fetch every row from the foreign server b.
through autovacuum automatically without user firing ANALYZE. I think this
also answers your question about vacuum_rel() above.

2. How to efficiently extract costs from an EXPLAIN plan when
use_remote_eestimates is true. That's the subject of some nearby thread. I
think you are referring to that problem here. Hence your next point.

Using EXPLAIN to get costs from the foreign server isn't efficient. It
increases planning time a lot; sometimes planning time exceeds execution
time. If usage of foreign tables becomes more and more common, this isn't
ideal. I think we should move towards a model in which the optimizer can
decide whether a subtree involving a foreign server should be evaluated
locally or on the foreign server without the help of foreign server. One
way to do it (I am not saying that this is the only or the best way) is to
estimate the cost of foreign query locally based on the information
available locally about the foreign server and foreign table. This might
mean that we have to get that information from the foreign server and cache
it locally and use it several times, including the indexes on foreign
table, values of various costs etc. Though this approach doesn't solve all
of those problems it's one step forward + it makes the current scenario
also efficient.

I agree that the patch needs some work though, esp the code dealing with
serialization and deserialization of statistics.
-- 
Best Wishes,
Ashutosh

Re: Evaluate expression at planning time for two more cases

On Tue, 8 Sep 2020 at 07:16, Tom Lane  wrote:

>
>
> I'm not sure what I think about Ashutosh's ideas about doing this
> somewhere else than eval_const_expressions.  I do not buy the argument
> that it's interesting to do this separately for each child partition.
> Child partitions that have attnotnull constraints different from their
> parent's are at best a tiny minority use-case, if indeed we allow them
> at all (I tend to think we shouldn't).


I agree about partitions. But, IMO, a child having constraints different
from that of a parent is more common in inheritance trees.

Another point I raised in my mail was about constraint exclusion. Why
aren't these clauses constant-folded by constraint exclusion? Sorry, I
haven't looked at the constraint exclusion code myself for this.

As a not incidental example, consider
>
> select ... from t1 left join t2 on (...) where t2.x is not null;
>
> reduce_outer_joins will realize that the left join can be reduced
> to a plain join, whereupon (if t2.x is attnotnull) the WHERE clause
> really is constant-true --- and this seems like a poster-child case
> for it being useful to optimize away the WHERE clause.  But
> we won't be able to detect that if we apply the optimization during
> eval_const_expressions.  So maybe that's a good reason to do it
> somewhere later.
>

+1

-- 
Best Wishes,
Ashutosh

Re: Parallel worker hangs while handling errors.

2020-09-08 Thread Robert Haas

On Thu, Sep 3, 2020 at 5:29 PM Tom Lane  wrote:
> Concretely, something about like this (I just did the bgwriter, but
> we'd want the same in all the background processes).  I tried to
> respond to Robert's complaint about the inaccurate comment just above
> sigsetjmp, too.
>
> This passes check-world, for what little that's worth.

Seems totally reasonable from here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Global snapshots