date:20200925

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 9:11 AM Greg Nancarrow  wrote:
>
> > > What if this
> > > ends up being invoked from inside C code?
> > >
> >
> > I think it shouldn't be a problem unless one is trying to do something
> > like insert into foreign key table. So, probably we can have an Assert
> > to catch it if possible. Do you have any other idea?
> >
>
> Note that the planner code updated by the patch does avoid creating a
> Parallel INSERT plan in the case of inserting into a table with a
> foreign key (so commandIds won't be created in the parallel-worker
> code).
> I'm not sure how to distinguish the "invoked from inside C code" case though.
>

I think if possible we can have an Assert to check if it is a
parallel-worker and relation has a foreign-key. Similarly, we can
enhance the check for any other un-safe use. This will prevent the
illegal usage of inserts via parallel workers.

-- 
With Regards,
Amit Kapila.

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 1:53 PM Dilip Kumar  wrote:
>
> On Fri, Sep 25, 2020 at 8:12 AM Amit Kapila  wrote:
> >
> >
> > No that won't happen because we send this option to the server
> > (publisher in this case) only when version is >=14 and user has
> > specified this option. See the below check in function
> > libpqrcv_startstreaming()
> > {
> > ..
> > if (options->proto.logical.streaming &&
> > PQserverVersion(conn->streamConn) >= 14)
> > appendStringInfo(&cmd, ", streaming 'on'");
> > ..
> > }
>
> Ok, I have modified as per your suggestion.
>

Pushed, thanks!

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 1:49 PM k.jami...@fujitsu.com
 wrote:
>
> Hi.
>
> > I'll send performance measurement results in the next email. Thanks a lot 
> > for
> > the reviews!
>
> Below are the performance measurement results.
> I was only able to use low-spec machine:
> CPU 4v, Memory 8GB, RHEL, xfs filesystem.
>
> [Failover/Recovery Test]
> 1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
> 2. (M) DELETE FROM TABLE (ex. all rows of 10,000 tables)
> 3. (Standby) To test with failover, pause the WAL replay on standby server.
> (SELECT pg_wal_replay_pause();)
> 4. (M) psql -c "\timing on" (measures total execution of SQL queries)
> 5. (M) VACUUM (whole db)
> 6. (M) After vacuum finishes, stop primary server: pg_ctl stop -w -mi
> 7. (S) Resume wal replay and promote standby.
> Because it's difficult to measure recovery time I used the attached script 
> (resume.sh)
> that prints timestamp before and after promotion. It basically does the 
> following
> - "SELECT pg_wal_replay_resume();" is executed and the WAL application is 
> resumed.
> - "pg_ctl promote" to promote standby.
> - The time difference of "select pg_is_in_recovery();" from "t" to "f" is 
> measured.
>
> [Results]
> Recovery/Failover performance (in seconds). 3 trial runs.
>
> | shared_buffers | master | patch  | %reg|
> ||||-|
> | 128MB  | 32.406 | 33.785 | 4.08%   |
> | 1GB| 36.188 | 32.747 | -10.51% |
> | 2GB| 41.996 | 32.88  | -27.73% |
>
> There's a bit of small regression with the default shared_buffers (128MB),
>

I feel we should try to address this. Basically, we can see the
smallest value of shared buffers above which the new algorithm is
beneficial and try to use that as threshold for doing this
optimization. I don't think it is beneficial to use this optimization
for a small value of shared_buffers.

> but as for the recovery time when we have large NBuffers, it's now at least 
> almost constant
> so there's boosted performance. IOW, we enter the optimization most of the 
> time
> during recovery.
>

Yeah, that is good to see. We can probably try to check with a much
larger value of shared buffers.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 2:25 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > No, during recovery also we need to be careful. We need to ensure that
> > we use cached value during recovery and cached value is always
> > up-to-date. We can't rely on lseek and I have provided some scenario
> > up thread [1] where such behavior can cause problem and then see the
> > response from Tom Lane why the same can be true for recovery as well.
> >
> > The basic approach we are trying to pursue here is to rely on the
> > cached value of 'number of blocks' (as that always gives correct value
> > and even if there is a problem that will be our bug, we don't need to
> > rely on OS for correct value and it will be better w.r.t performance
> > as well). It is currently only possible during recovery so we are
> > using it in recovery path and later once Thomas's patch to cache it
> > for non-recovery cases is also done, we can use it for non-recovery
> > cases as well.
>
> Although I may be still confused, I understood that Kirk-san's patch should:
>
> * Still focus on speeding up the replay of TRUNCATE during recovery.
>
> * During recovery, DropRelFileNodeBuffers() gets the cached size of the 
> relation fork.  If it is cached, trust it and optimize the buffer 
> invalidation.  If it's not cached, we can't trust the return value of 
> smgrnblocks() because it's the lseek(END) return value, so we avoid the 
> optimization.
>

I agree with the above two points.

> * Then, add a new function, say, smgrnblocks_cached() that simply returns the 
> cached block count, and DropRelFileNodeBuffers() uses it instead of 
> smgrnblocks().
>

I am not sure if it worth adding a new function for this. Why not
simply add a boolean variable in smgrnblocks for this? BTW, AFAICS,
the latest patch doesn't have code to address this point.

-- 
With Regards,
Amit Kapila.

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-25 Thread Julien Rouhaud

On Sat, Sep 26, 2020 at 8:17 AM Andy Fan  wrote:
>
> As for the testing with cache considered, I found how to estimate cache hit
> ratio is hard or how to control a hit ratio to test is hard. Recently I am 
> thinking
> a method that we can get a page_reads, shared_buffer_hit from pg_kernel
> and the real io (without the file system cache hit) at os level (just as what
> iotop/pidstat do). then we can know the shared_buffer hit ratio and file 
> system
> cache hit ratio (assume it will be stable after a long run). and then do a 
> testing.
> However this would be another branch of manual work and I still have not got
> it done until now.

FWIW pg_stat_kcache [1] extension accumulates per (database, user,
queryid) physical reads and writes, so you can easily compute a
shared_buffers / IO cache / disk hit ratio.

[1] https://github.com/powa-team/pg_stat_kcache

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Amit Kapila

On Sat, Sep 26, 2020 at 11:00 AM Bharath Rupireddy
 wrote:
>
> On Fri, Sep 25, 2020 at 9:23 PM Greg Nancarrow  wrote:
> >
> > On Fri, Sep 25, 2020 at 10:17 PM Amit Kapila  
> > wrote:
> > >
> >
> > Again, there's a fundamental difference in the Parallel Insert case.
> > Right at the top of ExecutePlan it calls EnterParallelMode().
> > For ParallelCopy(), there is no such problem. EnterParallelMode() is
> > only called just before ParallelCopyMain() is called. So it can easily
> > acquire the xid before this, because then parallel mode is not set.
> >
> > As it turns out, I think I have solved the commandId issue (and almost
> > the xid issue) by realising that both the xid and cid are ALREADY
> > being included as part of the serialized transaction state in the
> > Parallel DSM. So actually I don't believe that there is any need for
> > separately passing them in the DSM, and having to use those
> > AssignForWorker() functions in the worker code - not even in the
> > Parallel Copy case (? - need to check).
> >
>
> Thanks Gred for the detailed points.
>
> I further checked on full txn id and command id. Yes, these are
> getting passed to workers  via InitializeParallelDSM() ->
> SerializeTransactionState(). I tried to summarize what we need to do
> in case of parallel inserts in general i.e. parallel COPY, parallel
> inserts in INSERT INTO and parallel inserts in CTAS.
>
> In the leader:
> GetCurrentFullTransactionId()
> GetCurrentCommandId(true)
> EnterParallelMode();
> InitializeParallelDSM() --> calls SerializeTransactionState()
> (both full txn id and command id are serialized into parallel DSM)
>

This won't be true for Parallel Insert patch as explained by Greg as
well because we enter-parallel-mode much before we assign xid.


-- 
With Regards,
Amit Kapila.

Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers

2020-09-25 Thread Amit Kapila

On Tue, Sep 22, 2020 at 12:50 PM David Rowley  wrote:
>
> On Mon, 21 Sep 2020 at 19:15, Peter Eisentraut
>  wrote:
> >
> > On 2020-09-21 05:48, Amit Kapila wrote:
> > > What according to you should be the behavior here and how will it be
> > > better than current?
> >
> > I think if I write VACUUM (PARALLEL 5), it should use up to 5 workers
> > (up to the number of indexes), even if max_parallel_maintenance_workers
> > is 2.
>
> It would be good if we were consistent with these parallel options.
> Right now max_parallel_workers_per_gather will restrict the
> parallel_workers reloption.  I'd say this
> max_parallel_workers_per_gather is similar to
> max_parallel_maintenance_workers here and the PARALLEL vacuum option
> is like the parallel_workers reloption.
>
> If we want VACUUM's parallel option to work the same way as that then
> max_parallel_maintenance_workers should restrict whatever is mentioned
> in VACUUM PARALLEL.
>
> Or perhaps this is slightly different as the user is explicitly asking
> for this in the command, but you could likely say the same about ALTER
> TABLE  SET (parallel_workers = N); too.
>

This is exactly my feeling too. But how about changing documentation a
bit as proposed above [1] to make it precise.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LQWXS_4RwLo%2BWT7jusGnBkUvXO73xQOCsydWLYBpLBEg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Bharath Rupireddy

On Fri, Sep 25, 2020 at 9:23 PM Greg Nancarrow  wrote:
>
> On Fri, Sep 25, 2020 at 10:17 PM Amit Kapila  wrote:
> >
>
> Again, there's a fundamental difference in the Parallel Insert case.
> Right at the top of ExecutePlan it calls EnterParallelMode().
> For ParallelCopy(), there is no such problem. EnterParallelMode() is
> only called just before ParallelCopyMain() is called. So it can easily
> acquire the xid before this, because then parallel mode is not set.
>
> As it turns out, I think I have solved the commandId issue (and almost
> the xid issue) by realising that both the xid and cid are ALREADY
> being included as part of the serialized transaction state in the
> Parallel DSM. So actually I don't believe that there is any need for
> separately passing them in the DSM, and having to use those
> AssignForWorker() functions in the worker code - not even in the
> Parallel Copy case (? - need to check).
>

Thanks Gred for the detailed points.

I further checked on full txn id and command id. Yes, these are
getting passed to workers  via InitializeParallelDSM() ->
SerializeTransactionState(). I tried to summarize what we need to do
in case of parallel inserts in general i.e. parallel COPY, parallel
inserts in INSERT INTO and parallel inserts in CTAS.

In the leader:
GetCurrentFullTransactionId()
GetCurrentCommandId(true)
EnterParallelMode();
InitializeParallelDSM() --> calls SerializeTransactionState()
(both full txn id and command id are serialized into parallel DSM)

In the workers:
ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
full txn id and command id are restored into workers'
CurrentTransactionState->fullTransactionId and currentCommandId)
If the parallel workers are meant for insertions, then we need to set
currentCommandIdUsed = true; Maybe we can lift the assert in
GetCurrentCommandId(), if we don't want to touch that function, then
we can have a new function GetCurrentCommandidInWorker() whose
functionality will be same as GetCurrentCommandId() without the
Assert(!IsParallelWorker());.

Am I missing something?

If the above points are true, we might have to update the parallel
copy patch set, test the use cases and post separately in the parallel
copy thread in coming days.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 2:31 PM Bharath Rupireddy
 wrote:
>
> On Tue, Sep 22, 2020 at 10:26 AM Greg Nancarrow  wrote:
> >
> > For cases where it can't be allowed (e.g. INSERT into a table with
> > foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
> > ...") it at least allows parallelism of the SELECT part.
> >
>
> Thanks Greg for the patch.
>
> 2. What happens if the target table has triggers(before statement,
> after statement, before row, after row) that are parallel unsafe?
>

In such a case, the parallel insert shouldn't be selected. However, we
should still be able to execute the Select part in parallel.

> 3. Will each worker be doing single row insertions or multi inserts?
> If single row insertions, will the buffer lock contentions be more?
>

I don't think the purpose of this patch is to change the basic flow of
how Insert works and also I am not sure if it is worth the effort as
well. I have answered this earlier in a bit more detailed way [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Ks8Sqs29VHPS6koNj5E9YQdkGCzgGsSrQMeUbQfe28yg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 9:23 PM Greg Nancarrow  wrote:
>
> On Fri, Sep 25, 2020 at 10:17 PM Amit Kapila 
> As it turns out, I think I have solved the commandId issue (and almost
> the xid issue) by realising that both the xid and cid are ALREADY
> being included as part of the serialized transaction state in the
> Parallel DSM. So actually I don't believe that there is any need for
> separately passing them in the DSM, and having to use those
> AssignForWorker() functions in the worker code - not even in the
> Parallel Copy case (? - need to check). GetCurrentCommandId(true) and
> GetFullTransactionId() need to be called prior to Parallel DSM
> initialization, so they are included in the serialized transaction
> state.
> I just needed to add a function to set currentCommandIdUsed=true in
> the worker initialization (for INSERT case) and make a small tweak to
> the Assert in GetCurrentCommandId() to ensure that
> currentCommandIdUsed, in a parallel worker, never gets set to true
> when it is false. This is in line with the comment in that function,
> because we know that "currentCommandId was already true at the start
> of the parallel operation". With this in place, I don't need to change
> any of the original calls to GetCurrentCommandId(), so this addresses
> that issue raised by Andres.
>
> I am not sure yet how to get past the issue of the parallel mode being
> set at the top of ExecutePlan(). With that in place, it doesn't allow
> a xid to be acquired for the leader, without removing/changing that
> parallel-mode check in GetNewTransactionId().
>

I think now there is no fundamental problem in allocating xid in the
leader and then sharing it with workers who can use it to perform the
insert. So we can probably tweak that check so that it is true for
only parallel workers.

-- 
With Regards,
Amit Kapila.

Re: a potential size overflow issue

2020-09-25 Thread Tom Lane

David Zhang  writes:
> "InitBufTable" is the function used to initialize the buffer lookup 
> table for buffer manager. With the memory size increasing nowadays, 
> there is a potential overflow issue for the parameter "int size" used by 
> "InitBufTable". This function is invoked in freelist.c as below:
>      InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

> The number of buffer block “NBuffers” is also defined as "int", and 
> "NUM_BUFFER_PARTITIONS" has a default value 128. In theory, it may get 
> the chance to overflow the "size" parameter in "InitBufTable".

No, because guc.c limits NBuffers to INT_MAX/2.

> Therefore, it would be better to change "InitBufTable(int size)" to 
> "InitBufTable(long size)".

If I was worried about this, that wouldn't be much of a fix, since
on many platforms "long" is not any wider than "int".  (We really
oughta try to move away from relying on "long", because its size
is so poorly standardized.)

regards, tom lane

Re: Optimize memory allocation code

2020-09-25 Thread Merlin Moncure

On Fri, Sep 25, 2020 at 7:32 PM Li Japin  wrote:
>
>
>
> > On Sep 26, 2020, at 8:09 AM, Julien Rouhaud  wrote:
> >
> > Hi,
> >
> > On Sat, Sep 26, 2020 at 12:14 AM Li Japin  wrote:
> >>
> >> Hi, hackers!
> >>
> >> I find the palloc0() is similar to the palloc(), we can use palloc() 
> >> inside palloc0()
> >> to allocate space, thereby I think we can reduce  duplication of code.
> >
> > The code is duplicated on purpose.  There's a comment at the beginning
> > that mentions it:
> >
> >  /* duplicates MemoryContextAllocZero to avoid increased overhead */
> >
> > Same for MemoryContextAllocZero() itself.
>
> Thanks! How big is this overhead? Is there any way I can test it？

Profiler.  For example, oprofile. In hot areas of the code (memory
allocation is very hot), profiling is the first step.

merlin

Re: Optimize memory allocation code

2020-09-25 Thread Li Japin



> On Sep 26, 2020, at 8:09 AM, Julien Rouhaud  wrote:
> 
> Hi,
> 
> On Sat, Sep 26, 2020 at 12:14 AM Li Japin  wrote:
>> 
>> Hi, hackers!
>> 
>> I find the palloc0() is similar to the palloc(), we can use palloc() inside 
>> palloc0()
>> to allocate space, thereby I think we can reduce  duplication of code.
> 
> The code is duplicated on purpose.  There's a comment at the beginning
> that mentions it:
> 
>  /* duplicates MemoryContextAllocZero to avoid increased overhead */
> 
> Same for MemoryContextAllocZero() itself.

Thanks! How big is this overhead? Is there any way I can test it？

Best regards!

--
Japin Li

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-25 Thread Andy Fan

On Fri, Sep 25, 2020 at 5:15 PM Ashutosh Bapat 
wrote:

> On Tue, Sep 22, 2020 at 10:57 AM Andy Fan 
> wrote:
> >
> >
> > My tools set the random_page_cost to 8.6, but based on the fio data, it
> should be
> > set to 12.3 on the same hardware.  and I do see the better plan as well
> with 12.3.
> > Looks too smooth to believe it is true..
> >
> > The attached result_fio_mytool.tar.gz is my test result.   You can use
> git show HEAD^^
> > is the original plan with 8.6.  git show HEAD^  show the plan changes
> after we changed
> > the random_page_cost. git show HEAD shows the run time statistics
> changes for these queries.
> > I also uploaded the test tool [1] for this for your double check.
>
> The scripts seem to start and stop the server, drop caches for every
> query. That's where you are seeing that setting random_page_cost to
> fio based ratio provides better plans. But in practice, these costs
> need to be set on a server where the queries are run concurrently and
> repeatedly. That's where the caching behaviour plays an important

role. Can we write a tool which can recommend costs for that scenario?

I totally agree with this. Actually the first thing I did is to define a
proper IO workload. At the very beginning, I used DIRECT_IO for both seq
read
and random read on my SSD, and then found the result is pretty bad per
testing
(random_page_cost = ~1.6).  then I realized postgresql relies on the
prefetch
which is disabled by DIRECT_IO. After I fixed this, I tested again with the
above
scenario (cache hit ratio = 0) to verify my IO model. Per testing, it looks
good.
I am also thinking if the random_page_cost = 1.1 doesn't provide a good
result
on my SSD because it ignores the prefects of seq read.

After I am OK with my IO model, I test with the way you see above. but
I also detect the latency for file system cache hit, which is handled by
get_fs_cache_latency_us in my code (I ignored the shared buffer hits for
now).
and allows user to provides a cache_hit_ratio, the final random_page_cost
= (real_random_lat) / real_seq_lat, where
real_xxx_lat = cache_hit_ratio * fs_cache_lat + (1 - cache_hit_ratio) *
xxx_lat.
See function cal_real_lat and cal_random_page_cost.

As for the testing with cache considered, I found how to estimate cache hit
ratio is hard or how to control a hit ratio to test is hard. Recently I am
thinking
a method that we can get a page_reads, shared_buffer_hit from pg_kernel
and the real io (without the file system cache hit) at os level (just as
what
iotop/pidstat do). then we can know the shared_buffer hit ratio and file
system
cache hit ratio (assume it will be stable after a long run). and then do a
testing.
However this would be another branch of manual work and I still have not got
it done until now.

I'd not like to share too many details, but "lucky" many cases I
have haven't file
system cache, that makes things a bit easier. What I am doing right now is
to
calculate the random_page_cost with the above algorithm with only
shared_buffer
considered.  and test the real benefits with real workload to see how it
works.
If it works well, I think the only thing left is to handle file system
cache.

The testing is time consuming since I have to cooperate with many site
engineers,
so any improvement on the design will be much helpful.

> How do the fio based cost perform when the queries are run repeatedly?
>
>
That probably is not good since I have 280G+ file system cache and I have to
prepare much more than 280G data size for testing.

-- 
Best Regards
Andy Fan

Re: Optimize memory allocation code

2020-09-25 Thread Julien Rouhaud

Hi,

On Sat, Sep 26, 2020 at 12:14 AM Li Japin  wrote:
>
> Hi, hackers!
>
> I find the palloc0() is similar to the palloc(), we can use palloc() inside 
> palloc0()
> to allocate space, thereby I think we can reduce  duplication of code.

The code is duplicated on purpose.  There's a comment at the beginning
that mentions it:

  /* duplicates MemoryContextAllocZero to avoid increased overhead */

Same for MemoryContextAllocZero() itself.

a potential size overflow issue

2020-09-25 Thread David Zhang


Hi hackers,

"InitBufTable" is the function used to initialize the buffer lookup 
table for buffer manager. With the memory size increasing nowadays, 
there is a potential overflow issue for the parameter "int size" used by 
"InitBufTable". This function is invoked in freelist.c as below:

    InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

The number of buffer block “NBuffers” is also defined as "int", and 
"NUM_BUFFER_PARTITIONS" has a default value 128. In theory, it may get 
the chance to overflow the "size" parameter in "InitBufTable". The 
"size" parameter is later used by "ShmemInitHash" as "init_size" and 
"max_size", which are all defined as "long".


    SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
                                  size, size,
                                  &info,
                                  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);

Therefore, it would be better to change "InitBufTable(int size)" to 
"InitBufTable(long size)".


A simple patch is attached and it passed the “make installcheck-world” test.

--

David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
From 7df445121d3cf2aa7c0c22c831a8a920bc818d6e Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Fri, 25 Sep 2020 15:39:24 -0700
Subject: [PATCH] fix a potential overflow issue for InitBufTable

---
 src/backend/storage/buffer/buf_table.c | 2 +-
 src/include/storage/buf_internals.h| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c 
b/src/backend/storage/buffer/buf_table.c
index 4953ae9f82..d4f58c1ce0 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -49,7 +49,7 @@ BufTableShmemSize(int size)
  * size is the desired hash table size (possibly more than 
NBuffers)
  */
 void
-InitBufTable(int size)
+InitBufTable(long size)
 {
HASHCTL info;
 
diff --git a/src/include/storage/buf_internals.h 
b/src/include/storage/buf_internals.h
index 3377fa5676..25e1db6854 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -320,7 +320,7 @@ extern bool have_free_buffer(void);
 
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
-extern void InitBufTable(int size);
+extern void InitBufTable(long size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-- 
2.21.1 (Apple Git-122.3)

What does pg_stat_get_xact_function_self_time count exactly?

2020-09-25 Thread Chapman Flack

Hi,

Is the self-time of a tracked function (as reported by
pg_stat_get_xact_function_self_time) its total time minus

- the time of other tracked functions it calls?

- the time of other tracked or untracked functions it calls?

- the time of other tracked/untracked functions or SPI functions it calls?

- something else?

Regards,
-Chap

Re: Libpq support to connect to standby server as priority

2020-09-25 Thread Tom Lane

Greg Nancarrow  writes:
> [ v19-0001-Enhance-libpq-target_session_attrs-and-add-target_se.patch ]

I started to look through this, and I find that I'm really pretty
disappointed in the direction the patch has gone of late.  I think
there is no defensible reason for the choices that have been made
to have different behavior for v14-and-up servers than for older
servers.  It's not necessary and it complicates life for users.
We can use pg_is_in_recovery() on every server version that has
hot standby at all, so there is no reason not to treat the
GUC_REPORT indicator as an optimization that lets us skip a
separate inquiry transaction, rather than something we have to
have to make the feature work correctly.

So I think what we ought to have is the existing read-write
vs read-only distinction, implemented as it is now by checking
"SHOW transaction_read_only" if the server fails to send that
as a GUC_REPORT value; and orthogonally to that, a primary/standby
distinction implemented by checking pg_is_in_recovery(), again
with a fast path if we got a ParameterStatus report.

I do not like the addition of target_server_type.  It seems
unnecessary and confusing, particularly because you've had to
make a completely arbitrary decision about how it interacts with
target_session_attrs when both are specified.  I think the
justification that "it's more like JDBC" is risible.  Any user of
this will be writing C not Java.
 
A couple of other thoughts:

* Could we call the GUC "in_hot_standby" rather than "in_recovery"?
I think "recovery" is a poorly chosen legacy term that we ought to
avoid exposing to users more than we already have.  We're stuck
with pg_is_in_recovery() I suppose, but let's not double down on
bad decisions.

* I don't think you really need a hard-wired test on v14-or-not
in the libpq code.  The internal state about read-only and
hot-standby ought to be "yes", "no", or "unknown", starting in
the latter state.  Receipt of ParameterStatus changes it from
"unknown" to one of the other two states.  If we need to know
the value, and it's still "unknown", then we send a probe query.
We still need hard-coded version checks to know if the probe
query is safe, but those version breaks are far enough back to
be pretty well set in stone.  (In the back of my mind here is
that people might well choose to back-port the GUC_REPORT marking
of transaction_read_only, and maybe even the other GUC if they
were feeling ambitious.  So not having a hard-coded version
assumption where we don't particularly need it seems a good thing.)

* This can't be right can it?  Too many commas.

@@ -1618,7 +1619,7 @@ static struct config_bool ConfigureNamesBool[] =
 {"transaction_read_only", PGC_USERSET, CLIENT_CONN_STATEMENT,
 gettext_noop("Sets the current transaction's read-only status."),
 NULL,
-GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+GUC_REPORT, GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | 
GUC_DISALLOW_IN_FILE
 },
 &XactReadOnly,
 false,

(The compiler will fail to bitch about that unfortunately, since
there are more struct fields that we leave uninitialized normally.)

BTW, I think it would be worth splitting this into separate server-side
and libpq patches.  It looked to me like the server side is pretty
nearly committable, modulo bikeshedding about the new GUC name.  We could
get that out of the way and then have a much smaller libpq patch to argue
about.

regards, tom lane

Re: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-09-25 Thread Justin Pryzby

Added at
https://commitfest.postgresql.org/30/2739/
>From 831246aa6d6837b2b0da7c96ad2f44ffd6cd3951 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 24 Sep 2020 19:49:40 -0500
Subject: [PATCH v2] pg_upgrade --check to avoid tablespace failure mode

---
 src/bin/pg_upgrade/check.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 2f7aa632c5..154e3d249e 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -13,6 +13,7 @@
 #include "fe_utils/string_utils.h"
 #include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
+#include "common/relpath.h"
 
 static void check_new_cluster_is_empty(void);
 static void check_databases_are_compatible(void);
@@ -27,6 +28,7 @@ static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
+static void check_for_existing_tablespace_dirs(ClusterInfo *new_cluster);
 static char *get_canonical_locale_name(int category, const char *locale);
 
 
@@ -187,6 +189,8 @@ check_new_cluster(void)
 	check_is_install_user(&new_cluster);
 
 	check_for_prepared_transactions(&new_cluster);
+
+	check_for_existing_tablespace_dirs(&new_cluster);
 }
 
 
@@ -542,6 +546,39 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 }
 
 
+/*
+ * Check that tablespace dirs do not already exist for new cluster version.
+ * If they did, it'd cause an error while restoring global objects.
+ * This allows failing earlier rather than only after dumping schema.
+ *
+ * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
+ * being upgraded *to* has a suffix.
+ */
+static void
+check_for_existing_tablespace_dirs(ClusterInfo *new_cluster)
+{
+	char	old_tablespace_dir[MAXPGPATH];
+
+	prep_status("Checking for pre-existing tablespace directories");
+
+	for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+	{
+		struct stat statbuf;
+		snprintf(old_tablespace_dir, MAXPGPATH, "%s%s",
+os_info.old_tablespaces[tblnum],
+new_cluster->tablespace_suffix);
+
+		canonicalize_path(old_tablespace_dir);
+
+		// XXX: lstat ?
+		if (stat(old_tablespace_dir, &statbuf) == 0 || errno != ENOENT)
+			pg_fatal("tablespace directory already exists for new cluster version: \"%s\"\n",
+	 old_tablespace_dir);
+	}
+
+	check_ok();
+}
+
 /*
  * create_script_for_old_cluster_deletion()
  *
-- 
2.17.0

Re: history file on replica and double switchover

2020-09-25 Thread David Zhang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

"make installcheck-world" test was performed on branch "REL_13_STABLE" with tag 
"REL_13_0". The regression failed was not caused by the "history file" patch, 
since it has the same error on my environment even without any patch. So the 
failure is not related, in other words, the patch "history_replica_v4.patch" is 
good for me. 

Below is the failed message when tested with and without 
"history_replica_v4.patch".
t/004_timeline_switch.pl . ok   
t/005_replay_delay.pl  ok   
t/006_logical_decoding.pl  # Looks like your test exited with 29 
before it could output anything.
t/006_logical_decoding.pl  Dubious, test returned 29 (wstat 7424, 
0x1d00)
Failed 14/14 subtests 
t/007_sync_rep.pl  ok 
t/008_fsm_truncation.pl .. ok   
t/009_twophase.pl  ok 
t/010_logical_decoding_timelines.pl .. # Looks like your test exited with 29 
before it could output anything.
t/010_logical_decoding_timelines.pl .. Dubious, test returned 29 (wstat 7424, 
0x1d00)
Failed 13/13 subtests 
t/011_crash_recovery.pl .. ok   
t/012_subtransactions.pl . ok 
t/013_crash_restart.pl ... ok 
t/014_unlogged_reinit.pl . ok 
t/015_promotion_pages.pl . ok   
t/016_min_consistency.pl . ok   
t/017_shm.pl . ok   
t/018_wal_optimize.pl  ok 
t/019_replslot_limit.pl .. ok 
t/020_archive_status.pl .. ok 

Test Summary Report
---
t/006_logical_decoding.pl  (Wstat: 7424 Tests: 0 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 14 tests but ran 0.
t/010_logical_decoding_timelines.pl (Wstat: 7424 Tests: 0 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 13 tests but ran 0.
Files=20, Tests=202, 103 wallclock secs ( 0.18 usr  0.04 sys + 21.20 cusr 23.52 
csys = 44.94 CPU)
Result: FAIL
make[2]: *** [installcheck] Error 1
make[2]: Leaving directory `/home/david/git/postgres/src/test/recovery'
make[1]: *** [installcheck-recovery-recurse] Error 2
make[1]: Leaving directory `/home/david/git/postgres/src/test'
make: *** [installcheck-world-src/test-recurse] Error 2

Re: extension patch of CREATE OR REPLACE TRIGGER

2020-09-25 Thread Tom Lane

"osumi.takami...@fujitsu.com"  writes:
> [ CREATE_OR_REPLACE_TRIGGER_v11.patch ]

I took a quick look through this.  I think there's still work left to do.

* I'm concerned by the fact that there doesn't seem to be any defense
against somebody replacing a foreign-key trigger with something that
does something else entirely, and thereby silently breaking their
foreign key constraint.  I think it might be a good idea to forbid
replacing triggers for which tgisinternal is true; but I've not done
the legwork to see if that's exactly the condition we want.

* In the same vein, I'm not sure that the right things happen when fooling
with triggers attached to partitioned tables.  We presumably don't want to
allow mucking directly with a child trigger.  Perhaps refusing an update
when tgisinternal might fix this too (although we'll have to be careful to
make the error message not too confusing).

* I don't think that you've fully thought through the implications
of replacing a trigger for a table that the current transaction has
already modified.  Is it really sufficient, or even useful, to do
this:

+/*
+ * If this trigger has pending events, throw an error.
+ */
+if (trigger_deferrable && 
AfterTriggerPendingOnRel(RelationGetRelid(rel)))

As an example, if we change a BEFORE trigger to an AFTER trigger,
that's not going to affect the fact that we *already* fired that
trigger.  Maybe this is okay and we just need to document it, but
I'm not convinced.

* BTW, I don't think a trigger necessarily has to be deferrable
in order to have pending AFTER events.  The existing use of
AfterTriggerPendingOnRel certainly doesn't assume that.  But really,
I think we probably ought to be applying CheckTableNotInUse which'd
include that test.  (Another way in which there's fuzzy thinking
here is that AfterTriggerPendingOnRel isn't specific to *this*
trigger.)

* A lesser point is that I think you're overcomplicating the
code by applying heap_modify_tuple.  You might as well just
build the new tuple normally in all cases, and then apply
either CatalogTupleInsert or CatalogTupleUpdate.

* Also, the search for an existing trigger tuple is being
done the hard way.  You're using an index on (tgrelid, tgname),
so you could include the name in the index key and expect that
there's at most one match.

regards, tom lane

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-25 Thread John Scalia

FIPS only specifies which algorithms are approved for use on it. For instance, 
MD-5 is NOT approved at all under FIPS. I would say any algorithm should 
produce the same result regardless of where it is run. BTW, on Redhat servers, 
the first algorithm listed for use with SSH is MD-5. This causes the sshd 
daemon to abort when FIPS is enabled and that config file has not been edited. 
So, you can no longer connect with an SSH client as the daemon isn’t running. 
Ask me how I know this.

Sent from my iPad

> On Sep 25, 2020, at 3:39 PM, Bruce Momjian  wrote:
> 
> On Fri, Sep 25, 2020 at 03:38:22PM -0400, John Scalia wrote:
>> Bruce,
>> 
>> In my experience, any client is permitted to connect to FIPS140-2 compliant 
>> server. I set this up when I worked at SSA, at management’s request.
> 
> My question is whether the hash output would match if using different
> code.
> 
> -- 
>  Bruce Momjian  https://momjian.us
>  EnterpriseDB https://enterprisedb.com
> 
>  The usefulness of a cup is in its emptiness, Bruce Lee
>

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-25 Thread Bruce Momjian

On Fri, Sep 25, 2020 at 03:38:22PM -0400, John Scalia wrote:
> Bruce,
> 
>  In my experience, any client is permitted to connect to FIPS140-2 compliant 
> server. I set this up when I worked at SSA, at management’s request.

My question is whether the hash output would match if using different
code.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-25 Thread John Scalia

Bruce,

 In my experience, any client is permitted to connect to FIPS140-2 compliant 
server. I set this up when I worked at SSA, at management’s request.
—
Jay

Sent from my iPad

> On Sep 25, 2020, at 3:13 PM, Bruce Momjian  wrote:
> 
> On Fri, Sep 25, 2020 at 03:56:53PM +0900, Michael Paquier wrote:
>>> On Fri, Sep 25, 2020 at 01:36:44AM -0400, Tom Lane wrote:
>>> Peter Eisentraut  writes:
 However, again, the SCRAM 
 implementation would already appear to fail that requirement because it 
 uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a 
 covered algorithm.
>>> 
>>> Ugh.  But is there any available FIPS-approved library code that could be
>>> used instead?
>> 
>> That's a good point, and I think that this falls down to use OpenSSL's
>> HMAC_* interface for this job when building with OpenSSL:
>> https://www.openssl.org/docs/man1.1.1/man3/HMAC.html
>> 
>> Worth noting that these have been deprecated in 3.0.0 as per the
>> rather-recent commit dbde472, where they recommend the use of
>> EVP_MAC_*() instead.
> 
> Would a FIPS server only be able to talk to a FIPS client, or would our
> internal code produce the same output?
> 
> -- 
>  Bruce Momjian  https://momjian.us
>  EnterpriseDB https://enterprisedb.com
> 
>  The usefulness of a cup is in its emptiness, Bruce Lee
> 
> 
>

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Andres Freund

Hi,

On 2020-09-25 14:11:46 -0400, Tom Lane wrote:
> Christoph Berg  writes:
> > Ok, but given that Debian is currently targeting 22 architectures, I doubt 
> > the PostgreSQL buildfarm covers all of them with the extra JIT option, so I 
> > should probably make sure to do that here when running tests.
> 
> +1.  I rather doubt our farm is running this type of test on anything
> but x86_64.

There's quite a few arm animals and at least one mips animal that do
some minimal coverage of JITing (i.e. the queries that are actually
somewhat expensive). I pinged two owners asking whether one of the arm
animals could be changed to force JITing.

Greetings,

Andres Freund

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Andres Freund

Hi,

On 2020-09-25 19:05:52 +0200, Christoph Berg wrote:
> Am 25. September 2020 18:42:04 MESZ schrieb Andres Freund 
> >> * jit is not exercised enough by "make installcheck"
> >
> >So far we've exercised more widely it by setting up machines that use
> >it
> >for all queries (by setting the config option). I'm doubtful it's worth
> >doing differently.
> 
> Ok, but given that Debian is currently targeting 22 architectures, I
> doubt the PostgreSQL buildfarm covers all of them with the extra JIT
> option, so I should probably make sure to do that here when running
> tests.

Forcing to JIT a lot of queries that are otherwise really fast
unfortunately has a significant time cost. Doing that on slow
architectures might be prohibitively slow.  Kinda wonder if we shouldn't
just restrict JIT to a few architectures that we have a bit more regular
access to (x86, arm, maybe also ppc?). It's not like anybody would run
large analytics queries on mips.

Greetings,

Andres Freund

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-25 Thread Bruce Momjian

On Fri, Sep 25, 2020 at 03:56:53PM +0900, Michael Paquier wrote:
> On Fri, Sep 25, 2020 at 01:36:44AM -0400, Tom Lane wrote:
> > Peter Eisentraut  writes:
> >> However, again, the SCRAM 
> >> implementation would already appear to fail that requirement because it 
> >> uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a 
> >> covered algorithm.
> > 
> > Ugh.  But is there any available FIPS-approved library code that could be
> > used instead?
> 
> That's a good point, and I think that this falls down to use OpenSSL's
> HMAC_* interface for this job when building with OpenSSL:
> https://www.openssl.org/docs/man1.1.1/man3/HMAC.html
> 
> Worth noting that these have been deprecated in 3.0.0 as per the
> rather-recent commit dbde472, where they recommend the use of
> EVP_MAC_*() instead.

Would a FIPS server only be able to talk to a FIPS client, or would our
internal code produce the same output?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: history file on replica and double switchover

2020-09-25 Thread Fujii Masao




On 2020/09/26 2:58, Grigory Smolkin wrote:

Fujii Masao, David Zhang, Anastasia Lubennikova, many thanks to you for efforts 
with this patch!
Can I mark it as ready for committer?


Ok, but I attached the updated version of the patch. It's helpful if you review 
that.

In the latest patch, I changed walreceiver so that it creates .done file for the streamed 
timeline history file when archive_mode is NOT "always".

Walreceiver does the same thing for the streamed WAL files to prevent them from 
being archived later. Without this, the streamed WAL files can exist in pg_wal 
without any archive status files, and then they will be archived later 
accidentally because of lack of archive status.

OTOH, timeline history files will not be archived later even without archive 
status files. So there is no strong reason to make walreceiver create .doen 
file for the timeline history files. But at least for me it's strange to keep 
the file in pg_wal without archive status. So for now I'm just inclined to 
create .done files Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index beb309e668..42f01c515f 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1395,7 +1395,8 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
  If archive_mode is set to on, the
  archiver is not enabled during recovery or standby mode. If the standby
  server is promoted, it will start archiving after the promotion, but
- will not archive any WAL it did not generate itself. To get a complete
+ will not archive any WAL or timeline history files that
+ it did not generate itself. To get a complete
  series of WAL files in the archive, you must ensure that all WAL is
  archived, before it reaches the standby. This is inherently true with
  file-based log shipping, as the standby can only restore files that
diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 17f1a49f87..bb1d44ccb7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -758,6 +758,15 @@ WalRcvFetchTimeLineHistoryFiles(TimeLineID first, 
TimeLineID last)
 */
writeTimeLineHistoryFile(tli, content, len);
 
+   /*
+* Mark the streamed history file as ready for archiving
+* if archive_mode is always.
+*/
+   if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+   XLogArchiveForceDone(fname);
+   else
+   XLogArchiveNotify(fname);
+
pfree(fname);
pfree(content);
}

Re: Asynchronous Append on postgres_fdw nodes.

2020-09-25 Thread Andrey Lepikhov

Your AsyncAppend doesn't switch to another source if the data in current 
leader is available:


/*
 * The request for the next node cannot be sent before the leader
 * responds. Finish the current leader if possible.
 */
if (PQisBusy(leader_state->s.conn))
{
  int rc = WaitLatchOrSocket(NULL, WL_SOCKET_READABLE | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH, PQsocket(leader_state->s.conn), 0, 
WAIT_EVENT_ASYNC_WAIT);

  if (!(rc & WL_SOCKET_READABLE))
available = false;
}

/* fetch the leader's data and enqueue it for the next request */
if (available)
{
  fetch_received_data(leader);
  add_async_waiter(leader);
}

I don't understand, why it is needed. If we have fdw connections with 
different latency, then we will read data from the fast connection 
first. I think this may be a source of skew and decrease efficiency of 
asynchronous append.


For example, see below synthetic query:
CREATE TABLE l (a integer) PARTITION BY LIST (a);
CREATE FOREIGN TABLE f1 PARTITION OF l FOR VALUES IN (1) SERVER lb 
OPTIONS (table_name 'l1');
CREATE FOREIGN TABLE f2 PARTITION OF l FOR VALUES IN (2) SERVER lb 
OPTIONS (table_name 'l2');


INSERT INTO l (a) SELECT 2 FROM generate_series(1,200) as gs;
INSERT INTO l (a) SELECT 1 FROM generate_series(1,1000) as gs;

EXPLAIN ANALYZE (SELECT * FROM f1) UNION ALL (SELECT * FROM f2) LIMIT 400;

Result:
Limit  (cost=100.00..122.21 rows=400 width=4) (actual time=0.483..1.183 
rows=400 loops=1)
   ->  Append  (cost=100.00..424.75 rows=5850 width=4) (actual 
time=0.482..1.149 rows=400 loops=1)
 ->  Foreign Scan on f1  (cost=100.00..197.75 rows=2925 
width=4) (actual time=0.481..1.115 rows=400 loops=1)
 ->  Foreign Scan on f2  (cost=100.00..197.75 rows=2925 
width=4) (never executed)


As you can see, executor scans one input and doesn't tried to scan another.

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Problem of ko.po on branch REL9_5_STABLE

2020-09-25 Thread Alvaro Herrera

Hi,

On 2020-Sep-25, Yang, Rong wrote:

>   When I checked the encoding of the Po files, I noticed that 
> src/bin/pg_config/po/ko.po on REL9_5_STABLE branch seemed a little different.
>   The ‘Content-Type’ of this file and file’s encoding are different from 
> those under other modules, as follows:
>  
>   src/bin/pg_config/po/ko.po:
>    "Content-Type: text/plain; charset=euc-kr\n"
>   src/bin/initdb/po/ko.po:
>    "Content-Type: text/plain; charset=UTF-8\n"

Yeah, each file declares in its header the encoding it's in, and of
course the entire file must be in that encoding.  In the case of
pg_config's ko.po it was changed from euc-kr to UTF-8 sometime in
October 2016 (between 9.5 and 9.6).  It is not supposed to cause any
problems.  If it does, please let do us know.

If the Korean translation interests you, perhaps we could persuade you
to update it for the 13 branch, 
https://babel.postgresql.org/

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Tom Lane

Christoph Berg  writes:
> Ok, but given that Debian is currently targeting 22 architectures, I doubt 
> the PostgreSQL buildfarm covers all of them with the extra JIT option, so I 
> should probably make sure to do that here when running tests.

+1.  I rather doubt our farm is running this type of test on anything
but x86_64.

Of course, we can't actually *fix* any LLVM bugs, but it'd be nice
to know whether they're there.

regards, tom lane

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Ranier Vilela

Em sex., 25 de set. de 2020 às 14:36, Ranier Vilela 
escreveu:

> Em sex., 25 de set. de 2020 às 11:30, Christoph Berg 
> escreveu:
>
>> Re: Tom Lane
>> > >  Tom> It's hardly surprising that datumCopy would segfault when given
>> a
>> > >  Tom> null "value" and told it is pass-by-reference. However, to get
>> to
>> > >  Tom> the datumCopy call, we must have passed the
>> MemoryContextContains
>> > >  Tom> check on that very same pointer value, and that would surely
>> have
>> > >  Tom> segfaulted as well, one would think.
>> >
>> > > Nope, because MemoryContextContains just returns "false" if passed a
>> > > NULL pointer.
>> >
>> > Ah, right.  So you could imagine getting here if the finalfn had
>> returned
>> > PointerGetDatum(NULL) with isnull = false.  We have some aggregate
>> > transfns that are capable of doing that for internal-type transvalues,
>> > I think, but the finalfn never should do it.
>>
>> So I had another stab at this. As expected, the 13.0 upload to
>> Debian/unstable crashed again on the buildd, while a manual
>> everything-should-be-the-same build succeeded. I don't know why I
>> didn't try this before, but this time I took this manual build and
>> started a PG instance from it. Pasting the gs_group_1 queries made it
>> segfault instantly.
>>
>> So here we are:
>>
>> #0  datumCopy (value=0, typLen=-1, typByVal=false) at
>> ./build/../src/backend/utils/adt/datum.c:142
>> #1  0x02aa3bf6322e in datumCopy (value=,
>> typByVal=, typLen=)
>> at ./build/../src/backend/utils/adt/datum.c:178
>> #2  0x02aa3bda6dd6 in finalize_aggregate 
>> (aggstate=aggstate@entry=0x2aa3defbfd0,
>> peragg=peragg@entry=0x2aa3e0671f0,
>> pergroupstate=pergroupstate@entry=0x2aa3e026b78,
>> resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
>> at ./build/../src/backend/executor/nodeAgg.c:1153
>>
>> (gdb) p *resultVal
>> $3 = 0
>> (gdb) p *resultIsNull
>> $6 = false
>>
>> (gdb) p *peragg
>> $7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn =
>> {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
>> fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt =
>> 0x0, fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
>>   resulttypeLen = -1, resulttypeByVal = false, shareable = true}
>>
>> Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
>> branch of the if (OidIsValid) in finalize_aggregate():
>>
>> else
>> {
>> /* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it
>> */
>> *resultVal = pergroupstate->transValue;
>> *resultIsNull = pergroupstate->transValueIsNull;
>> }
>>
>> (gdb) p *pergroupstate
>> $12 = {transValue = 0, transValueIsNull = false, noTransValue = false}
>>
> Here transValueIsNull shouldn't be "true"?
> thus, DatumCopy would be protected, for this test: "!*resultIsNull"
>
Observe this excerpt (line 1129):
/* don't call a strict function with NULL inputs */
*resultVal = (Datum) 0;
*resultIsNull = true;

Now, it does not contradict this principle.
If all the values are null, they should be filled with True (1),
and not 0 (false)?

Line (4711), function ExecReScanAgg:
MemSet(econtext->ecxt_aggvalues, 0, sizeof(Datum) * node->numaggs);
MemSet(econtext->ecxt_aggnulls, true, sizeof(bool) * node->numaggs);

zero, here, mean False, aggvalues is Null? Not.

regards,
Ranier Vilela

Re: history file on replica and double switchover

2020-09-25 Thread Grigory Smolkin

Fujii Masao, David Zhang, Anastasia Lubennikova, many thanks to you for 
efforts with this patch!

Can I mark it as ready for committer?

On 9/25/20 10:07 AM, Fujii Masao wrote:



On 2020/09/25 13:00, David Zhang wrote:

On 2020-09-24 5:00 p.m., Fujii Masao wrote:



On 2020/09/25 8:15, David Zhang wrote:

Hi,

My understanding is that the "archiver" won't even start if 
"archive_mode = on" has been set on a "replica". Therefore, either 
(XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode != 
ARCHIVE_MODE_OFF) will produce the same result.


Yes, the archiver isn't invoked in the standby when archive_mode=on.
But, with the original patch, walreceiver creates .ready archive 
status file

even when archive_mode=on and no archiver is running. This causes
the history file to be archived after the standby is promoted and
the archiver is invoked.

With my patch, walreceiver creates .ready archive status for the 
history file

only when archive_mode=always, like it does for the streamed WAL files.
This is the difference between those two patches. To prevent the 
streamed
timeline history file from being archived, IMO we should use the 
condition

archive_mode==always in the walreceiver.

+1





Please see how the "archiver" is started in 
src/backend/postmaster/postmaster.c


5227 /*
5228  * Start the archiver if we're responsible for 
(re-)archiving received

5229  * files.
5230  */
5231 Assert(PgArchPID == 0);
5232 if (XLogArchivingAlways())
5233 PgArchPID = pgarch_start();

I did run the nice script "double_switchover.sh" using either 
"always" or "on" on patch v1 and v2. They all generate the same 
results below. In other words, whether history file 
(0003.history) is archived or not depends on "archive_mode" 
settings.


echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:40 
00010002

... ...
-rw--- 1 david david 16777216 Sep 24 14:40 
0002000A

-rw--- 1 david david   41 Sep 24 14:40 0002.history
-rw--- 1 david david   83 Sep 24 14:40 0003.history

echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:47 
00010002

... ...
-rw--- 1 david david 16777216 Sep 24 14:47 
0002000A

-rw--- 1 david david   41 Sep 24 14:47 0002.history


Personally, I prefer patch v2 since it allies to the statement 
here, 
https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY


"If archive_mode is set to on, the archiver is not enabled during 
recovery or standby mode. If the standby server is promoted, it 
will start archiving after the promotion, but will not archive any 
WAL it did not generate itself."


By the way, I think the last part of the sentence should be changed 
to something like below:


"but will not archive any WAL, which was not generated by itself."


I'm not sure if this change is an improvement or not. But if we apply
the patch I proposed, maybe we should mention also history file here.
For example, "but will not archive any WAL or timeline history files
that it did not generate itself".

make sense for me


So I included this change in the patch. Patch attached.

Regards,


--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Ranier Vilela

Em sex., 25 de set. de 2020 às 11:30, Christoph Berg 
escreveu:

> Re: Tom Lane
> > >  Tom> It's hardly surprising that datumCopy would segfault when given a
> > >  Tom> null "value" and told it is pass-by-reference. However, to get to
> > >  Tom> the datumCopy call, we must have passed the MemoryContextContains
> > >  Tom> check on that very same pointer value, and that would surely have
> > >  Tom> segfaulted as well, one would think.
> >
> > > Nope, because MemoryContextContains just returns "false" if passed a
> > > NULL pointer.
> >
> > Ah, right.  So you could imagine getting here if the finalfn had returned
> > PointerGetDatum(NULL) with isnull = false.  We have some aggregate
> > transfns that are capable of doing that for internal-type transvalues,
> > I think, but the finalfn never should do it.
>
> So I had another stab at this. As expected, the 13.0 upload to
> Debian/unstable crashed again on the buildd, while a manual
> everything-should-be-the-same build succeeded. I don't know why I
> didn't try this before, but this time I took this manual build and
> started a PG instance from it. Pasting the gs_group_1 queries made it
> segfault instantly.
>
> So here we are:
>
> #0  datumCopy (value=0, typLen=-1, typByVal=false) at
> ./build/../src/backend/utils/adt/datum.c:142
> #1  0x02aa3bf6322e in datumCopy (value=,
> typByVal=, typLen=)
> at ./build/../src/backend/utils/adt/datum.c:178
> #2  0x02aa3bda6dd6 in finalize_aggregate 
> (aggstate=aggstate@entry=0x2aa3defbfd0,
> peragg=peragg@entry=0x2aa3e0671f0,
> pergroupstate=pergroupstate@entry=0x2aa3e026b78,
> resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
> at ./build/../src/backend/executor/nodeAgg.c:1153
>
> (gdb) p *resultVal
> $3 = 0
> (gdb) p *resultIsNull
> $6 = false
>
> (gdb) p *peragg
> $7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn =
> {fn_addr = 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
> fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0,
> fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
>   resulttypeLen = -1, resulttypeByVal = false, shareable = true}
>
> Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
> branch of the if (OidIsValid) in finalize_aggregate():
>
> else
> {
> /* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
> *resultVal = pergroupstate->transValue;
> *resultIsNull = pergroupstate->transValueIsNull;
> }
>
> (gdb) p *pergroupstate
> $12 = {transValue = 0, transValueIsNull = false, noTransValue = false}
>
Here transValueIsNull shouldn't be "true"?
thus, DatumCopy would be protected, for this test: "!*resultIsNull"

regards,
Ranier Vilela

Re: New statistics for tuning WAL buffer size

2020-09-25 Thread Fujii Masao





On 2020/09/25 12:06, Masahiro Ikeda wrote:

On 2020-09-18 11:11, Kyotaro Horiguchi wrote:

At Fri, 18 Sep 2020 09:40:11 +0900, Masahiro Ikeda
 wrote in

Thanks. I confirmed that it causes HOT pruning or killing of
dead index tuple if DecodeCommit() is called.

As you said, DecodeCommit() may access the system table.

...

The wals are generated only when logical replication is performed.
So, I added pgstat_send_wal() in XLogSendLogical().

But, I concerned that it causes poor performance
since pgstat_send_wal() is called per wal record,


I think that's too frequent.  If we want to send any stats to the
collector, it is usually done at commit time using
pgstat_report_stat(), and the function avoids sending stats too
frequently. For logrep-worker, apply_handle_commit() is calling it. It
seems to be the place if we want to send the wal stats.  Or it may be
better to call pgstat_send_wal() via pgstat_report_stat(), like
pg_stat_slru().


Thanks for your comments.
Since I changed to use pgstat_report_stat() and DecodeCommit() is calling it,
the frequency to send statistics is not so high.


On second thought, it's strange to include this change in pg_stat_wal patch.
Because pgstat_report_stat() sends various stats and that change would
affect not only pg_stat_wal but also other stats views. That is, if we really
want to make some processes call pgstat_report_stat() newly, which
should be implemented as a separate patch. But I'm not sure how useful
this change is because probably the stats are almost negligibly small
in those processes.

This thought seems valid for pgstat_send_wal(). I changed the thought
and am inclined to be ok not to call pgstat_send_wal() in some background
processes that are very unlikely to generate WAL. For example, logical-rep
launcher, logical-rep walsender, and autovacuum launcher. Thought?





Currently logrep-laucher, logrep-worker and autovac-launcher (and some
other processes?) don't seem (AFAICS) sending scan stats at all but
according to the discussion here, we should let such processes send
stats.


I added pgstat_report_stat() to logrep-laucher and autovac-launcher.
As you said, logrep-worker already calls apply_handle_commit() and 
pgstat_report_stat().


Right.



The checkpointer doesn't seem to call pgstat_report_stat() currently,
but since there is a possibility to send wal statistics, I added 
pgstat_report_stat().


IMO it's better to call pgstat_send_wal() in the checkpointer, instead,
because of the above reason.

Thanks for updating the patch! I'd like to share my review comments.

+for details.

Like the description for pg_stat_bgwriter,  tag should be used
instead of .


+  
+   Number of WAL writes when the  are full
+  

I prefer the following description. Thought?

"Number of times WAL data was written to the disk because wal_buffers got full"


+the pg_stat_archiver view ,or 
wal

A comma should be just after "view" (not just before "or").


+/*
+ * WAL global statistics counter.
+ * This counter is incremented by both each backend and background.
+ * And then, sent to the stat collector process.
+ */
+PgStat_MsgWal WalStats;

What about merging the comments for BgWriterStats and WalStats into one because 
they are almost the same? For example,

---
/*
 * BgWriter and WAL global statistics counters.
 * Stored directly in a stats message structure so they can be sent
 * without needing to copy things around.  We assume these init to zeroes.
 */
PgStat_MsgBgWriter BgWriterStats;
PgStat_MsgWal WalStats;
---

BTW, originally there was the comment "(unused in other processes)"
for BgWriterStats. But it seems not true, so I removed it from
the above example.


+   rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+   (void) rc;  /* we'll check for 
error with ferror */

Since the patch changes the pgstat file format,
PGSTAT_FILE_FORMAT_ID should also be changed?


-* Clear out global and archiver statistics so they start from zero in
+* Clear out global, archiver and wal statistics so they start from 
zero in

This is not the issue of this patch, but isn't it better to mention
also SLRU stats here? That is, what about "Clear out global, archiver,
WAL and SLRU statistics so they start from zero in"?


I found "wal statistics" and "wal stats" in some comments in the patch,
but isn't it better to use "WAL statistics" and "WAL stats", instead,
if there is no special reason to use lowercase?


+   /*
+* Read wal stats struct
+*/
+   if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))

In pgstat_read_db_statsfile_timestamp(), the local variable myWalStats
should be declared and be used to store the WAL stats read via fread(),
instead.


+{ oid => '1136', descr => 'statistics: number of WAL writes when the wal 
buffers are full',

If we change the description of wal_buffers_full colum

Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-25 Thread Bruce Momjian

On Thu, Sep 24, 2020 at 09:59:50PM -0400, Tom Lane wrote:
> Kyotaro Horiguchi  writes:
> > Thank you Bruce, Michael. This is a rebased version.
> 
> I really strongly object to all the encoded data in this patch.
> One cannot read it, one cannot even easily figure out how long
> it is until the tests break by virtue of the certificates expiring.
> 
> One can, however, be entirely certain that they *will* break at
> some point.  I don't like the idea of time bombs in our test suite.
> That being the case, it'd likely be better to drop all the pre-made
> certificates and have the test scripts create them on the fly.
> That'd remove both the documentation problem (i.e., having readable
> info as to how the certificates were made) and the expiration problem.

I am not planning to apply the test parts of this patch.  I think
having the committer test it is sufficient.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Christoph Berg

Am 25. September 2020 18:42:04 MESZ schrieb Andres Freund 
>> * jit is not exercised enough by "make installcheck"
>
>So far we've exercised more widely it by setting up machines that use
>it
>for all queries (by setting the config option). I'm doubtful it's worth
>doing differently.

Ok, but given that Debian is currently targeting 22 architectures, I doubt the 
PostgreSQL buildfarm covers all of them with the extra JIT option, so I should 
probably make sure to do that here when running tests.

Re: [HACKERS] [PATCH] Generic type subscripting

2020-09-25 Thread Pavel Stehule

ne 20. 9. 2020 v 17:46 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Fri, Sep 18, 2020 at 07:23:11PM +0200, Pavel Stehule wrote:
> >
> > > In this way (returning an error on a negative indices bigger than the
> > > number of elements) functionality for assigning via subscripting will
> be
> > > already significantly differ from the original one via jsonb_set. Which
> > > in turn could cause a new wave of something similar to "why assigning
> an
> > > SQL NULL as a value returns NULL instead of jsonb?". Taking into
> account
> > > that this is not absolutely new interface, but rather a convenient
> > > shortcut for the existing one it probably makes sense to try to find a
> > > balance between both consistency with regular array and similarity with
> > > already existing jsonb modification functions.
> > >
> > > Having said that, my impression is that this balance should be not
> fully
> > > shifted towards consistensy with the regular array type, as jsonb array
> > > and regular array are fundamentally different in terms of
> > > implementation. If any differences are of concern, they should be
> > > addressed at different level. At the same time I've already sort of
> gave
> > > up on this patch in the form I wanted to see it anyway, so anything
> goes
> > > if it helps bring it to the finish point. In case if there would be no
> > > more arguments from other involved sides, I can post the next version
> > > with your suggestion included.
> > >
> >
> > This is a relatively new interface and at this moment we can decide if it
> > will be consistent or not.  I have not a problem if I have different
> > functions with different behaviors, but I don't like one interface with
> > slightly different behaviors for different types. I understand your
> > argument about implementing a lighter interface to some existing API.
> But I
> > think so more important should be consistency in maximall possible rate
> > (where it has sense).
> >
> > For me "jsonb" can be a very fundamental type in PLpgSQL development - it
> > can bring a lot of dynamic to this environment (it can work perfectly
> like
> > PL/SQL collection or like Perl dictionary), but for this purpose the
> > behaviour should be well consistent without surprising elements.
>
> And here we are, the rebased version with the following changes:
>
> insert into test_jsonb_subscript values (1, '[]');
> update test_jsonb_subscript set test_json[5] = 1;
> select * from test_jsonb_subscript;
>  id | test_json
> +---
>   1 | [null, null, null, null, null, 1]
> (1 row)
>
> update test_jsonb_subscript set test_json[-8] = 1;
> ERROR:  path element at position 1 is out of range: -8
>
> Thanks for the suggestions!
>

Thank you for accepting my suggestions.

I checked this set of patches and it looks well.

I have only one minor comment. I understand the error message, but I am not
sure if without deeper knowledge I can understand.

+update test_jsonb_subscript set test_json[-8] = 1;
+ERROR:  path element at position 1 is out of range: -8

Maybe 'value of subscript "-8" is out of range'. Current error message is
fully correct - but people probably have to think "what is a path element
at position 1?" It doesn't look intuitive.

Do you have some idea?

My comment is minor, and I mark this patch with pleasure as ready for
committer.

patching and compiling - without problems
implemented functionality - I like it
Building doc - without problems
make check-world - passed

Regards

Pavel

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Andres Freund

Hi,

On 2020-09-25 17:29:07 +0200, Christoph Berg wrote:
> I guess that suggests two things:
> * jit is not ready for prime time on s390x and I should disable it

I don't know how good LLVMs support for s390x JITing is, and given that
it's unrealistic for people to get access to s390x...

> * jit is not exercised enough by "make installcheck"

So far we've exercised more widely it by setting up machines that use it
for all queries (by setting the config option). I'm doubtful it's worth
doing differently.

Greetings,

Andres Freund

Re: Load TIME fields - proposed performance improvement

2020-09-25 Thread Tom Lane

Peter Smith  writes:
> The patch has been re-implemented based on previous advice.
> Please see attached.

Hm, so:

1. The caching behavior really needs to account for the possibility of
the timezone setting being changed intra-transaction.  That's not very
likely, perhaps, but we can't deliver a wrong answer.  It seems best
to make the dependency on session_timezone explicit instead of
allowing it to be hidden inside timestamp2tm.

2. I'd had in mind just one cache, not several.  Admittedly it
doesn't gain that much for different code paths to share the result,
but it seems silly not to, especially given the relative complexity
of getting the caching right.

That led me to refactor the patch as attached.  (I'd first thought
that we could just leave the j2date calculation in GetSQLCurrentDate
out of the cache, but performance measurements showed that it is
worthwhile to cache that.  An advantage of the way it's done here
is we probably won't have to redo j2date even across transactions.)

> (represents 19% improvement for this worst case table/data)

I'm getting a hair under 30% improvement on this admittedly
artificial test case, for either your patch or mine.

> Note: I patched the function GetCurrentTimeUsec consistently with the
> others, but actually I was not able to discover any SQL syntax which
> could cause that function to be invoked multiple times. Perhaps the
> patch for that function should be removed?

The existing callers for that are concerned with interpreting "now"
in datetime input strings.  It's certainly possible to have to do that
more than once per transaction --- an example would be COPY IN where
a lot of rows contain "now" as the value of a datetime column.

regards, tom lane

diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index eaaffa7137..057051fa85 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -299,20 +299,31 @@ EncodeSpecialDate(DateADT dt, char *str)
 DateADT
 GetSQLCurrentDate(void)
 {
-	TimestampTz ts;
-	struct pg_tm tt,
-			   *tm = &tt;
-	fsec_t		fsec;
-	int			tz;
+	struct pg_tm tm;
 
-	ts = GetCurrentTransactionStartTimestamp();
+	static int	cache_year = 0;
+	static int	cache_mon = 0;
+	static int	cache_mday = 0;
+	static DateADT cache_date;
 
-	if (timestamp2tm(ts, &tz, tm, &fsec, NULL, NULL) != 0)
-		ereport(ERROR,
-(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
- errmsg("timestamp out of range")));
+	GetCurrentDateTime(&tm);
 
-	return date2j(tm->tm_year, tm->tm_mon, tm->tm_mday) - POSTGRES_EPOCH_JDATE;
+	/*
+	 * date2j involves several integer divisions; moreover, unless our session
+	 * lives across local midnight, we don't really have to do it more than
+	 * once.  So it seems worth having a separate cache here.
+	 */
+	if (tm.tm_year != cache_year ||
+		tm.tm_mon != cache_mon ||
+		tm.tm_mday != cache_mday)
+	{
+		cache_date = date2j(tm.tm_year, tm.tm_mon, tm.tm_mday) - POSTGRES_EPOCH_JDATE;
+		cache_year = tm.tm_year;
+		cache_mon = tm.tm_mon;
+		cache_mday = tm.tm_mday;
+	}
+
+	return cache_date;
 }
 
 /*
@@ -322,18 +333,12 @@ TimeTzADT *
 GetSQLCurrentTime(int32 typmod)
 {
 	TimeTzADT  *result;
-	TimestampTz ts;
 	struct pg_tm tt,
 			   *tm = &tt;
 	fsec_t		fsec;
 	int			tz;
 
-	ts = GetCurrentTransactionStartTimestamp();
-
-	if (timestamp2tm(ts, &tz, tm, &fsec, NULL, NULL) != 0)
-		ereport(ERROR,
-(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
- errmsg("timestamp out of range")));
+	GetCurrentTimeUsec(tm, &fsec, &tz);
 
 	result = (TimeTzADT *) palloc(sizeof(TimeTzADT));
 	tm2timetz(tm, fsec, tz, result);
@@ -348,18 +353,12 @@ TimeADT
 GetSQLLocalTime(int32 typmod)
 {
 	TimeADT		result;
-	TimestampTz ts;
 	struct pg_tm tt,
 			   *tm = &tt;
 	fsec_t		fsec;
 	int			tz;
 
-	ts = GetCurrentTransactionStartTimestamp();
-
-	if (timestamp2tm(ts, &tz, tm, &fsec, NULL, NULL) != 0)
-		ereport(ERROR,
-(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
- errmsg("timestamp out of range")));
+	GetCurrentTimeUsec(tm, &fsec, &tz);
 
 	tm2time(tm, fsec, &result);
 	AdjustTimeForTypmod(&result, typmod);
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index dec2fad82a..bc682584f0 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -339,35 +339,78 @@ j2day(int date)
 /*
  * GetCurrentDateTime()
  *
- * Get the transaction start time ("now()") broken down as a struct pg_tm.
+ * Get the transaction start time ("now()") broken down as a struct pg_tm,
+ * according to the session timezone setting.
  */
 void
 GetCurrentDateTime(struct pg_tm *tm)
 {
-	int			tz;
 	fsec_t		fsec;
 
-	timestamp2tm(GetCurrentTransactionStartTimestamp(), &tz, tm, &fsec,
- NULL, NULL);
-	/* Note: don't pass NULL tzp to timestamp2tm; affects behavior */
+	/* This is just a convenience wrapper for GetCurrentTimeUsec */
+	GetCurrentTimeUsec(tm, &fsec, NULL);
 }
 
 /*
  * GetCurrentTimeUsec()
  *
  * Get the transaction start time ("now(

Optimize memory allocation code

2020-09-25 Thread Li Japin

Hi, hackers!

I find the palloc0() is similar to the palloc(), we can use palloc() inside 
palloc0()
to allocate space, thereby I think we can reduce  duplication of code.

Best regards!

--
Japin Li



0001-Optimize-memory-allocation-code.patch
Description: 0001-Optimize-memory-allocation-code.patch

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Greg Nancarrow

On Fri, Sep 25, 2020 at 10:17 PM Amit Kapila  wrote:
> But we can tighten the condition in GetCurrentCommandId() such that it
> Asserts for parallel worker only when currentCommandIdUsed is not set
> before start of parallel operation. I also find these changes in the
> callers of GetCurrentCommandId() quite adhoc and ugly even if they are
> correct. Also, why we don't face a similar problems for parallel copy?
>

For Parallel Insert, as part of query plan execution,
GetCurrentCommandId(true) is being called as part of INSERT statement
execution.
Parallel Copy of course doesn't have to deal with this. That's why
there's a difference. And also, it has its own parallel entry point
(ParallelCopyMain), so it's in full control, it's not trying to fit in
with the infrastructure for plan execution.

> So are you facing problems in this area because we EnterParallelMode
> before even assigning the xid in the leader? Because I don't think we
> should ever reach this code in the worker. If so, there are two
> possibilities that come to my mind (a) assign xid in leader before
> entering parallel mode or (b) change the check so that we don't assign
> the new xid in workers. In this case, I am again wondering how does
> parallel copy dealing this?
>

Again, there's a fundamental difference in the Parallel Insert case.
Right at the top of ExecutePlan it calls EnterParallelMode().
For ParallelCopy(), there is no such problem. EnterParallelMode() is
only called just before ParallelCopyMain() is called. So it can easily
acquire the xid before this, because then parallel mode is not set.

As it turns out, I think I have solved the commandId issue (and almost
the xid issue) by realising that both the xid and cid are ALREADY
being included as part of the serialized transaction state in the
Parallel DSM. So actually I don't believe that there is any need for
separately passing them in the DSM, and having to use those
AssignForWorker() functions in the worker code - not even in the
Parallel Copy case (? - need to check). GetCurrentCommandId(true) and
GetFullTransactionId() need to be called prior to Parallel DSM
initialization, so they are included in the serialized transaction
state.
I just needed to add a function to set currentCommandIdUsed=true in
the worker initialization (for INSERT case) and make a small tweak to
the Assert in GetCurrentCommandId() to ensure that
currentCommandIdUsed, in a parallel worker, never gets set to true
when it is false. This is in line with the comment in that function,
because we know that "currentCommandId was already true at the start
of the parallel operation". With this in place, I don't need to change
any of the original calls to GetCurrentCommandId(), so this addresses
that issue raised by Andres.

I am not sure yet how to get past the issue of the parallel mode being
set at the top of ExecutePlan(). With that in place, it doesn't allow
a xid to be acquired for the leader, without removing/changing that
parallel-mode check in GetNewTransactionId().

Regards,
Greg Nancarrow
Fujitsu Australia

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Christoph Berg

Re: To Tom Lane
> I poked around with the SET in the offending tests, and the crash is
> only present if `set jit_above_cost = 0;` is present. Removing that
> makes it pass. Removing work_mem or enable_hashagg does not make a
> difference. llvm version is 10.0.1.

I put jit_above_cost=0 into postgresql.conf and ran "make installcheck"
again. There are more crashes:

>From src/test/regress/sql/interval.sql:

2020-09-25 17:00:25.130 CEST [8135] LOG:  Serverprozess (PID 8369) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:00:25.130 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: select avg(f1) from interval_tbl;

>From src/test/regress/sql/tid.sql:

2020-09-25 17:01:20.593 CEST [8135] LOG:  Serverprozess (PID 9015) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:01:20.593 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: SELECT max(ctid) FROM tid_tab;

>From src/test/regress/sql/collate*.sql:

2020-09-25 17:07:17.852 CEST [8135] LOG:  Serverprozess (PID 12232) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:07:17.852 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: SELECT min(b), max(b) FROM collate_test1;

>From src/test/regress/sql/plpgsql.sql:

2020-09-25 17:11:56.495 CEST [8135] LOG:  Serverprozess (PID 15562) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:11:56.495 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: select * from returnqueryf();

Contrary to the others this one is not related to aggregates:

   -- test RETURN QUERY with dropped columns
   
   create table tabwithcols(a int, b int, c int, d int);
   insert into tabwithcols values(10,20,30,40),(50,60,70,80);
   
   create or replace function returnqueryf()
   returns setof tabwithcols as $$
   begin
 return query select * from tabwithcols;
 return query execute 'select * from tabwithcols';
   end;
   $$ language plpgsql;
   
   select * from returnqueryf();
   
   alter table tabwithcols drop column b;
   
   select * from returnqueryf();
   
   alter table tabwithcols drop column d;
   
   select * from returnqueryf();
   
   alter table tabwithcols add column d int;
   
   select * from returnqueryf();
   
   drop function returnqueryf();
   drop table tabwithcols;

src/test/regress/sql/rangefuncs.sql:

2020-09-25 17:16:04.209 CEST [8135] LOG:  Serverprozess (PID 17372) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:16:04.209 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: select * from usersview;

src/test/regress/sql/alter_table.sql:

2020-09-25 17:21:36.187 CEST [8135] LOG:  Serverprozess (PID 19217) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:21:36.187 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: update atacc3 set test2 = 4 where test2 is null;

src/test/regress/sql/polymorphism.sql:

2020-09-25 17:23:55.509 CEST [8135] LOG:  Serverprozess (PID 21010) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:23:55.509 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: select myleast(1.1, 0.22, 0.55);

2020-09-25 17:28:26.222 CEST [8135] LOG:  Serverprozess (PID 22325) wurde von 
Signal 11 beendet: Speicherzugriffsfehler
2020-09-25 17:28:26.222 CEST [8135] DETAIL:  Der fehlgeschlagene Prozess führte 
aus: select f.longname from fullname f;

(stopping here)


There are also a lot of these log lines (without prefix):

ORC error: No callback manager available for s390x-ibm-linux-gnu

Is that worrying? I'm not sure but I think I've seen these on other
architectures as well.


I guess that suggests two things:
* jit is not ready for prime time on s390x and I should disable it
* jit is not exercised enough by "make installcheck"

Christoph

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Christoph Berg

I poked around with the SET in the offending tests, and the crash is
only present if `set jit_above_cost = 0;` is present. Removing that
makes it pass. Removing work_mem or enable_hashagg does not make a
difference. llvm version is 10.0.1.


Test file:

--
-- Compare results between plans using sorting and plans using hash
-- aggregation. Force spilling in both cases by setting work_mem low
-- and altering the statistics.
--

create table gs_data_1 as
select g%1000 as g1000, g%100 as g100, g%10 as g10, g
   from generate_series(0,1999) g;

analyze gs_data_1;
alter table gs_data_1 set (autovacuum_enabled = 'false');
update pg_class set reltuples = 10 where relname='gs_data_1';

SET work_mem='64kB';

-- Produce results with sorting.

set enable_hashagg = false;
set jit_above_cost = 0; -- remove this to remove crash

explain (costs off)
select g100, g10, sum(g::numeric), count(*), max(g::text)
from gs_data_1 group by cube (g1000, g100,g10);

create table gs_group_1 as
select g100, g10, sum(g::numeric), count(*), max(g::text)
from gs_data_1 group by cube (g1000, g100,g10);



Christoph

Re: gs_group_1 crashing on 13beta2/s390x

2020-09-25 Thread Christoph Berg

Re: Tom Lane
> >  Tom> It's hardly surprising that datumCopy would segfault when given a
> >  Tom> null "value" and told it is pass-by-reference. However, to get to
> >  Tom> the datumCopy call, we must have passed the MemoryContextContains
> >  Tom> check on that very same pointer value, and that would surely have
> >  Tom> segfaulted as well, one would think.
> 
> > Nope, because MemoryContextContains just returns "false" if passed a
> > NULL pointer.
> 
> Ah, right.  So you could imagine getting here if the finalfn had returned
> PointerGetDatum(NULL) with isnull = false.  We have some aggregate
> transfns that are capable of doing that for internal-type transvalues,
> I think, but the finalfn never should do it.

So I had another stab at this. As expected, the 13.0 upload to
Debian/unstable crashed again on the buildd, while a manual
everything-should-be-the-same build succeeded. I don't know why I
didn't try this before, but this time I took this manual build and
started a PG instance from it. Pasting the gs_group_1 queries made it
segfault instantly.

So here we are:

#0  datumCopy (value=0, typLen=-1, typByVal=false) at 
./build/../src/backend/utils/adt/datum.c:142
#1  0x02aa3bf6322e in datumCopy (value=, typByVal=, typLen=)
at ./build/../src/backend/utils/adt/datum.c:178
#2  0x02aa3bda6dd6 in finalize_aggregate 
(aggstate=aggstate@entry=0x2aa3defbfd0, peragg=peragg@entry=0x2aa3e0671f0,
pergroupstate=pergroupstate@entry=0x2aa3e026b78, 
resultVal=resultVal@entry=0x2aa3e067108, resultIsNull=0x2aa3e06712a)
at ./build/../src/backend/executor/nodeAgg.c:1153

(gdb) p *resultVal
$3 = 0
(gdb) p *resultIsNull
$6 = false

(gdb) p *peragg
$7 = {aggref = 0x2aa3deef218, transno = 2, finalfn_oid = 0, finalfn = {fn_addr 
= 0x0, fn_oid = 0, fn_nargs = 0, fn_strict = false,
fn_retset = false, fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, 
fn_expr = 0x0}, numFinalArgs = 1, aggdirectargs = 0x0,
  resulttypeLen = -1, resulttypeByVal = false, shareable = true}

Since finalfn_oid is 0, resultVal/resultIsNull were set by the `else`
branch of the if (OidIsValid) in finalize_aggregate():

else
{
/* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
*resultVal = pergroupstate->transValue;
*resultIsNull = pergroupstate->transValueIsNull;
}

(gdb) p *pergroupstate
$12 = {transValue = 0, transValueIsNull = false, noTransValue = false}

That comes from finalize_aggregates:

#3  0x02aa3bda7e10 in finalize_aggregates 
(aggstate=aggstate@entry=0x2aa3defbfd0, peraggs=peraggs@entry=0x2aa3e067140,
pergroup=0x2aa3e026b58) at ./build/../src/backend/executor/nodeAgg.c:1369

/*
 * Run the final functions.
 */
for (aggno = 0; aggno < aggstate->numaggs; aggno++)
{
AggStatePerAgg peragg = &peraggs[aggno];
int transno = peragg->transno;
AggStatePerGroup pergroupstate;

pergroupstate = &pergroup[transno];

if (DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit))
finalize_partialaggregate(aggstate, peragg, pergroupstate,
  &aggvalues[aggno], &aggnulls[aggno]);
else
finalize_aggregate(aggstate, peragg, pergroupstate,
   &aggvalues[aggno], &aggnulls[aggno]);
}

... but at that point I'm lost. Maybe "aggno" and "transno" got mixed
up here?

(I'll leave the gdb session open for further suggestions.)

Christoph

Re: Probable documentation errors or improvements

2020-09-25 Thread Justin Pryzby

Split one patch about text search, added another one (sequences), added some
info to commit messages, and added here.
https://commitfest.postgresql.org/30/2744/

-- 
Justin
>From adf050ac6cc7d0905fc1613dce1a04f76a892609 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 22 Sep 2020 21:40:17 -0500
Subject: [PATCH v2 01/11] extraneous comma

---
 doc/src/sgml/rules.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index bcf860b68b..d6e3463ac2 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -769,7 +769,7 @@ SELECT t1.a, t2.b, t1.ctid FROM t1, t2 WHERE t1.a = t2.a;
 
 
 
-The benefit of implementing views with the rule system is,
+The benefit of implementing views with the rule system is
 that the planner has all
 the information about which tables have to be scanned plus the
 relationships between these tables plus the restrictive
@@ -781,7 +781,7 @@ SELECT t1.a, t2.b, t1.ctid FROM t1, t2 WHERE t1.a = t2.a;
 the best path to execute the query, and the more information
 the planner has, the better this decision can be. And
 the rule system as implemented in PostgreSQL
-ensures, that this is all information available about the query
+ensures that this is all information available about the query
 up to that point.
 
 
-- 
2.17.0

>From d6e81a48596edcd255015767095e8985464051e1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 23 Sep 2020 13:09:11 -0500
Subject: [PATCH v2 02/11] semicolons after END in programlisting

---
 doc/src/sgml/func.sgml|  2 +-
 doc/src/sgml/plpgsql.sgml | 12 ++--
 doc/src/sgml/rules.sgml   |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 461b748d89..1e18f332a3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26580,7 +26580,7 @@ BEGIN
  obj.object_name,
  obj.object_identity;
 END LOOP;
-END
+END;
 $$;
 CREATE EVENT TRIGGER test_event_trigger_for_drops
ON sql_drop
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 815912666d..a71a8e7e48 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1143,7 +1143,7 @@ BEGIN
 SELECT users.userid INTO STRICT userid
 FROM users WHERE users.username = get_userid.username;
 RETURN userid;
-END
+END;
 $$ LANGUAGE plpgsql;
 
  On failure, this function might produce an error message such as
@@ -1816,7 +1816,7 @@ BEGIN
 RETURN NEXT r; -- return current row of SELECT
 END LOOP;
 RETURN;
-END
+END;
 $BODY$
 LANGUAGE plpgsql;
 
@@ -1844,7 +1844,7 @@ BEGIN
 END IF;
 
 RETURN;
- END
+ END;
 $BODY$
 LANGUAGE plpgsql;
 
@@ -1918,7 +1918,7 @@ DECLARE myvar int := 5;
 BEGIN
   CALL triple(myvar);
   RAISE NOTICE 'myvar = %', myvar;  -- prints 15
-END
+END;
 $$;
 
 
@@ -3521,7 +3521,7 @@ BEGIN
 ROLLBACK;
 END IF;
 END LOOP;
-END
+END;
 $$;
 
 CALL transaction_test1();
@@ -5175,7 +5175,7 @@ DECLARE
 f1 int;
 BEGIN
 RETURN f1;
-END
+END;
 $$ LANGUAGE plpgsql;
 WARNING:  variable "f1" shadows a previously defined variable
 LINE 3: f1 int;
diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index d6e3463ac2..e81addcfa9 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -2087,7 +2087,7 @@ CREATE FUNCTION tricky(text, text) RETURNS bool AS $$
 BEGIN
 RAISE NOTICE '% => %', $1, $2;
 RETURN true;
-END
+END;
 $$ LANGUAGE plpgsql COST 0.01;
 
 SELECT * FROM phone_number WHERE tricky(person, phone);
-- 
2.17.0

>From db23e239d021e018c3b9d6cb4d983f5bbcae8b85 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 22 Sep 2020 22:51:47 -0500
Subject: [PATCH v2 03/11] punctuation

---
 doc/src/sgml/catalogs.sgml | 2 +-
 doc/src/sgml/config.sgml   | 4 ++--
 doc/src/sgml/parallel.sgml | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index de9bacd34f..d1b8fc8010 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1525,7 +1525,7 @@
   
   
Role can log in. That is, this role can be given as the initial
-   session authorization identifier
+   session authorization identifier.
   
  
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8eabf93834..1c753ccb7e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10084,8 +10084,8 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
   

-If set, do not trace locks for tables below this OID. (use to avoid
-output on system tables)
+If set, do not trace locks for tables below this OID (used to avoid
+output on system tables).


 This parameter is only available if the LOCK_DEBUG
diff --git a/doc/sr

Re: Dumping/restoring fails on inherited generated column

2020-09-25 Thread Tom Lane

Peter Eisentraut  writes:
> The proposed patches will cause the last statement to be omitted, but 
> that still won't recreate the original state.  The problem is that there 
> is no command to make a column generated afterwards, like the SET 
> DEFAULT command, so we can't dump it like this.

Right.  I'm not even sure what such a command should do ... would it run
through all existing rows and update them to be the GENERATED value?

> We could probably fix this by having ALTER TABLE ONLY / DROP EXPRESSION 
> update the attlocal column of direct children to true, to make the 
> catalog state look like something that can be restored.  However, that's 
> a fair amount of complicated code, so for now I propose to just prohibit 
> this command, meaning you can't use ONLY in this command if there are 
> children.  This is new in PG13, so this change would have very limited 
> impact in practice.

+1.  At this point we would want some fairly un-complicated fix for
the v13 branch anyhow, and this seems to fit the bill.  (Also, having
child columns suddenly grow an attislocal property doesn't seem to meet
the principle of least surprise.)

regards, tom lane

Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-25 Thread Masahiko Sawada

On Fri, 25 Sep 2020 at 18:21, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > I don't think it's always possible to avoid raising errors in advance.
> > Considering how postgres_fdw can implement your idea, I think
> > postgres_fdw would need PG_TRY() and PG_CATCH() for its connection
> > management. It has a connection cache in the local memory using HTAB.
> > It needs to create an entry for the first time to connect (e.g., when
> > prepare and commit prepared a transaction are performed by different
> > processes) and it needs to re-connect the foreign server when the
> > entry is invalidated. In both cases, ERROR could happen. I guess the
> > same is true for other FDW implementations. Possibly other FDWs might
> > need more work for example cleanup or releasing resources. I think
>
> Why does the client backend have to create a new connection cache entry 
> during PREPARE or COMMIT PREPARE?  Doesn't the client backend naturally 
> continue to use connections that it has used in its current transaction?

I think there are two cases: a process executes PREPARE TRANSACTION
and another process executes COMMIT PREPARED later, and if the
coordinator has cascaded foreign servers (i.g., a foreign server has
its foreign server) and temporary connection problem happens in the
intermediate node after PREPARE then another process on the
intermediate node will execute COMMIT PREPARED on its foreign server.

>
>
> > that the pros of your idea are to make the transaction manager simple
> > since we don't need resolvers and launcher but the cons are to bring
> > the complexity to FDW implementation codes instead. Also, IMHO I don't
> > think it's safe way that FDW does neither re-throwing an error nor
> > abort transaction when an error occurs.
>
> No, I didn't say the resolver is unnecessary.  The resolver takes care of 
> terminating remote transactions when the client backend encountered an error 
> during COMMIT/ROLLBACK PREPARED.

Understood. With your idea, we can remove at least the code of making
backend wait and inter-process communication between backends and
resolvers.

I think we need to consider that it's really safe and what needs to
achieve your idea safely.

>
>
> > In terms of performance you're concerned, I wonder if we can somewhat
> > eliminate the bottleneck if multiple resolvers are able to run on one
> > database in the future. For example, if we could launch resolver
> > processes as many as connections on the database, individual backend
> > processes could have one resolver process. Since there would be
> > contention and inter-process communication it still brings some
> > overhead but it might be negligible comparing to network round trip.
>
> Do you mean that if concurrent 200 clients each update data on two foreign 
> servers, there are 400 resolvers?  ...That's overuse of resources.

I think we have 200 resolvers in this case since one resolver process
per backend process. Or another idea is that all processes queue
foreign transactions to resolve into the shared memory queue and
resolver processes fetch and resolve them instead of assigning one
distributed transaction to one resolver process. Using asynchronous
execution, the resolver process can process a bunch of foreign
transactions across distributed transactions and grouped by the
foreign server at once. It might be more complex than the current
approach but having multiple resolver processes on one database would
increase through-put well especially by combining with asynchronous
execution.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Dumping/restoring fails on inherited generated column

2020-09-25 Thread Daniel Gustafsson

> On 25 Sep 2020, at 15:07, Peter Eisentraut  
> wrote:

> We could probably fix this by having ALTER TABLE ONLY / DROP EXPRESSION 
> update the attlocal column of direct children to true, to make the catalog 
> state look like something that can be restored.  However, that's a fair 
> amount of complicated code, so for now I propose to just prohibit this 
> command, meaning you can't use ONLY in this command if there are children.  
> This is new in PG13, so this change would have very limited impact in 
> practice.

That sounds a bit dramatic.  Do you propose to do that in v13 as well or just
in HEAD?  If the latter, considering that the window until the 14 freeze is
quite wide shouldn't we try to fix it first?

cheers ./daniel

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Bharath Rupireddy

On Fri, Sep 25, 2020 at 5:47 PM Amit Kapila  wrote:
>
> >
> > At least in the case of Parallel INSERT, the leader for the Parallel
> > INSERT gets a new xid (GetCurrentFullTransactionId) and it is passed
> > through and assigned to each of the workers during their
> > initialization (so they are assigned the same xid).
> >
>
> So are you facing problems in this area because we EnterParallelMode
> before even assigning the xid in the leader? Because I don't think we
> should ever reach this code in the worker. If so, there are two
> possibilities that come to my mind (a) assign xid in leader before
> entering parallel mode or (b) change the check so that we don't assign
> the new xid in workers. In this case, I am again wondering how does
> parallel copy dealing this?
>

In parallel copy, we are doing option (a) i.e. the leader gets the
full txn id before entering parallel mode and passes it to all
workers.
In the leader:
full_transaction_id = GetCurrentFullTransactionId();
EnterParallelMode();
shared_info_ptr->full_transaction_id = full_transaction_id;
In the workers:
AssignFullTransactionIdForWorker(pcshared_info->full_transaction_id);

Hence below part of the code doesn't get hit.
if (IsInParallelMode() || IsParallelWorker())
elog(ERROR, "cannot assign XIDs during a parallel operation");

We also deal with the commandid similarly i.e. the leader gets the
command id, and workers would use it while insertion.
In the leader:
shared_info_ptr->mycid = GetCurrentCommandId(true);
In the workers:
AssignCommandIdForWorker(pcshared_info->mycid, true);

[1]
void
AssignFullTransactionIdForWorker(FullTransactionId fullTransactionId)
{
TransactionState s = CurrentTransactionState;

Assert((IsInParallelMode() || IsParallelWorker()));
s->fullTransactionId = fullTransactionId;
}

void
AssignCommandIdForWorker(CommandId commandId, bool used)
{
Assert((IsInParallelMode() || IsParallelWorker()));

/* this is global to a transaction, not subtransaction-local */
if (used)
currentCommandIdUsed = true;

currentCommandId = commandId;
}

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Dumping/restoring fails on inherited generated column

2020-09-25 Thread Peter Eisentraut

I have been analyzing this issue again.  We have a few candidate patches 
that do very similar things for avoiding dumping the generation 
expression of table gtest1_1.  We can figure out later which one of 
these we like best.  But there is another issue lurking nearby.  The 
table hierarchy of gtest30, which is created in the regression tests 
like this:


CREATE TABLE gtest30 (
a int,
b int GENERATED ALWAYS AS (a * 2) STORED
);
CREATE TABLE gtest30_1 () INHERITS (gtest30);
ALTER TABLE ONLY gtest30 ALTER COLUMN b DROP EXPRESSION;

This drops the generation expression from the parent table but not the 
child table.  This is currently dumped like this:


CREATE TABLE public.gtest30 (
a integer,
b integer
);

CREATE TABLE public.gtest30_1 (
)
INHERITS (public.gtest30);

ALTER TABLE ONLY public.gtest30_1 ALTER COLUMN b SET DEFAULT (a * 2);

The proposed patches will cause the last statement to be omitted, but 
that still won't recreate the original state.  The problem is that there 
is no command to make a column generated afterwards, like the SET 
DEFAULT command, so we can't dump it like this.  We would have to produce


CREATE TABLE public.gtest30 (
a integer,
b integer
);

CREATE TABLE public.gtest30_1 (
b integer GENERATED ALWAYS AS (a * 2) STORED
)
INHERITS (public.gtest30);

but this will create the column "b" of gtest30_1 as attlocal, which the 
original command sequence does not.


We could probably fix this by having ALTER TABLE ONLY / DROP EXPRESSION 
update the attlocal column of direct children to true, to make the 
catalog state look like something that can be restored.  However, that's 
a fair amount of complicated code, so for now I propose to just prohibit 
this command, meaning you can't use ONLY in this command if there are 
children.  This is new in PG13, so this change would have very limited 
impact in practice.


Proposed patch attached.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 6c4209883dad527eb1140a61ed7ac6a610cf2a14 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 25 Sep 2020 14:51:59 +0200
Subject: [PATCH] Disallow ALTER TABLE ONLY / DROP EXPRESSION

The current implementation cannot handle this correct, so just forbid
it for now.

GENERATED clauses must be attached to the column definition and cannot
be added later like DEFAULT, so if a child table has a generation
expression that the parent does not have, the child column will
necessarily be an attlocal column.  So to implement ALTER TABLE ONLY /
DROP EXPRESSION, we'd need extra code to update attislocal of the
direct child tables, somewhat similar to how DROP COLUMN does it, so
that the resulting state can be properly dumped and restored.
---
 src/backend/commands/tablecmds.c| 22 +++---
 src/test/regress/expected/generated.out | 11 ++-
 src/test/regress/sql/generated.sql  |  2 +-
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 16285ad09f..b1b1b1e74a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -412,7 +412,7 @@ static ObjectAddress ATExecAddIdentity(Relation rel, const 
char *colName,
 static ObjectAddress ATExecSetIdentity(Relation rel, const char *colName,
   Node 
*def, LOCKMODE lockmode);
 static ObjectAddress ATExecDropIdentity(Relation rel, const char *colName, 
bool missing_ok, LOCKMODE lockmode);
-static void ATPrepDropExpression(Relation rel, AlterTableCmd *cmd, bool 
recursing);
+static void ATPrepDropExpression(Relation rel, AlterTableCmd *cmd, bool 
recurse, bool recursing, LOCKMODE lockmode);
 static ObjectAddress ATExecDropExpression(Relation rel, const char *colName, 
bool missing_ok, LOCKMODE lockmode);
 static ObjectAddress ATExecSetStatistics(Relation rel, const char *colName, 
int16 colNum,

 Node *newValue, LOCKMODE lockmode);
@@ -4142,7 +4142,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
case AT_DropExpression: /* ALTER COLUMN DROP EXPRESSION */
ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE);
ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode, 
context);
-   ATPrepDropExpression(rel, cmd, recursing);
+   ATPrepDropExpression(rel, cmd, recurse, recursing, 
lockmode);
pass = AT_PASS_DROP;
break;
case AT_SetStatistics:  /* ALTER COLUMN SET STATISTICS */
@@ -7240,8 +7240,24 @@ ATExecDropIdentity(Relation rel, const char *colName, 
bool missing_ok, LOCKMODE
  * ALTER TABLE ALTER COLUMN DROP EXPRESSION
  */
 static void
-ATPrepDropExpression(Relation rel, AlterTableCmd *cmd

Re: Feature improvement for FETCH tab completion

2020-09-25 Thread Fujii Masao




On 2020/09/25 17:21, btnakamichin wrote:

2020-09-25 15:38 に Fujii Masao さんは書きました:

On 2020/09/25 14:24, btnakamichin wrote:

Hello!

I’d like to improve the FETCH tab completion.

The FETCH tab completion . Therefore, this patch fixes the problem.

Previous function completed one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, but 
now it completes one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, ALL, NEXT, 
PRIOR, FIRST, LAST and Corresponded to later IN and FROM clauses.


Thanks for the patch! Here are review comments.

+    /* Complete FETCH BACKWARD or FORWARD with one of ALL */
+    else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD"))
+    COMPLETE_WITH("ALL");

Not only "ALL" but also "FROM" and "IN" should be displayed here
because they also can follow "BACKWARD" and "FORWARD"?

else if (Matches("FETCH|MOVE", MatchAny, MatchAny))
+    else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE",
MatchAny))
+    COMPLETE_WITH("FROM", "IN");

This change seems to cause "FETCH FORWARD FROM " to display "FROM"
and "IN". To avoid this confusing tab-completion, we should use something like
MatchAnyExcept("FROM|IN") here, instead?

Regards,


I’m Sorry, I forgot to include pgsql_hackers in the cc, so I resend it

Thank you, I appreciate your comment.

I have attached patch　with newline.


Thanks for updating the patch!

The patch should include only the change of tab-complete for FETCH,
but accidentally it also includes DEALLOCATE that's proposed by you
in another thread. So I exclude DEALLOCATE part from the patch.
Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9c6f5ecb6a..3cf7698c0b 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3076,19 +3076,27 @@ psql_completion(const char *text, int start, int end)
COMPLETE_WITH("SELECT", "INSERT", "DELETE", "UPDATE", 
"DECLARE");
 
 /* FETCH && MOVE */
-   /* Complete FETCH with one of FORWARD, BACKWARD, RELATIVE */
+
+   /*
+* Complete FETCH with one of ABSOLUTE, BACKWARD, FORWARD, RELATIVE, 
ALL,
+* NEXT, PRIOR, FIRST, LAST
+*/
else if (Matches("FETCH|MOVE"))
-   COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE");
-   /* Complete FETCH  with one of ALL, NEXT, PRIOR */
-   else if (Matches("FETCH|MOVE", MatchAny))
-   COMPLETE_WITH("ALL", "NEXT", "PRIOR");
+   COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE",
+ "ALL", "NEXT", "PRIOR", "FIRST", 
"LAST");
+
+   /* Complete FETCH BACKWARD or FORWARD with one of ALL, FROM, IN */
+   else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD"))
+   COMPLETE_WITH("ALL", "FROM", "IN");
 
/*
-* Complete FETCH   with "FROM" or "IN". These are 
equivalent,
+* Complete FETCH  with "FROM" or "IN". These are equivalent,
 * but we may as well tab-complete both: perhaps some users prefer one
 * variant or the other.
 */
-   else if (Matches("FETCH|MOVE", MatchAny, MatchAny))
+   else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE",
+MatchAnyExcept("FROM|IN")) ||
+Matches("FETCH|MOVE", "ALL|NEXT|PRIOR|FIRST|LAST"))
COMPLETE_WITH("FROM", "IN");
 
 /* FOREIGN DATA WRAPPER */

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-09-25 Thread Bharath Rupireddy

On Fri, Sep 25, 2020 at 3:21 PM Fujii Masao 
wrote:
>
> I think that we can simplify the code by merging the connection-retry
> code into them, like the attached very WIP patch (based on yours) does.
>

+1.

>
> +   else
> +   ereport(ERROR,
> +
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
> +errmsg("could not connect to
server \"%s\"",
> +
server->servername),
> +errdetail_internal("%s",
pchomp(PQerrorMessage(entry->conn);
>
> The above is not necessary? If this error occurs, connect_pg_server()
> reports an error, before the above code is reached. Right?
>

Removed.

Thanks for the comments.

I think we need to have a volatile qualifier for need_reset_conn, because
of the sigsetjmp.
Instead of "need_reset_conn", "retry_conn" would be more meaningful and
also instead of goto label name "reset;", "retry:".
I changed "closing connection %p to reestablish connection" to  "closing
connection %p to reestablish a new one"
I also adjusted the comments to be under the 80char limit.
I feel, when we are about to close an existing connection and reestablish a
new connection, it will be good to have a debug3 message saying that we
"could not connect to postgres_fdw connection %p"[1].

Attaching v5 patch that has the above changes. Both make check and make
check-world regression tests passes. Please review.

[1] This would tell the user that we are not able to connect to the
connection.
postgres=# SELECT * FROM foreign_tbl;
DEBUG:  starting remote transaction on connection 0x55ab0e416830
DEBUG:  could not connect to postgres_fdw connection 0x55ab0e416830
DEBUG:  closing connection 0x55ab0e416830 to reestablish a new one
DEBUG:  new postgres_fdw connection 0x55ab0e416830 for server
"foreign_server" (user mapping oid 16407, userid 10)
DEBUG:  starting remote transaction on connection 0x55ab0e416830
DEBUG:  closing remote transaction on connection 0x55ab0e416830
 a1  | b1
-+-
 100 | 200

Without the above message, it would look like we are starting remote txn,
and closing connection without any reason.

postgres=# SELECT * FROM foreign_tbl;
DEBUG:  starting remote transaction on connection 0x55ab0e4c0d50
DEBUG:  closing connection 0x55ab0e4c0d50 to reestablish a new one
DEBUG:  new postgres_fdw connection 0x55ab0e4c0d50 for server
"foreign_server" (user mapping oid 16389, userid 10)
DEBUG:  starting remote transaction on connection 0x55ab0e4c0d50
DEBUG:  closing remote transaction on connection 0x55ab0e4c0d50
 a1  | b1
-+-
 100 | 200

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v5-Retry-Cached-Remote-Connections-For-postgres_fdw.patch
Description: Binary data

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Amit Kapila

On Fri, Sep 25, 2020 at 10:02 AM Greg Nancarrow  wrote:
>
> Hi Andres,
>
> On Thu, Sep 24, 2020 at 12:21 PM Andres Freund  wrote:
> >
>
>
> >> @@ -116,7 +117,7 @@ toast_save_datum(Relation rel, Datum value,
> >>   TupleDesc   toasttupDesc;
> >>   Datum   t_values[3];
> >>   boolt_isnull[3];
> >> - CommandId   mycid = GetCurrentCommandId(true);
> >> + CommandId   mycid = GetCurrentCommandId(!IsParallelWorker());
> >>   struct varlena *result;
> >>   struct varatt_external toast_pointer;
> >>   union
> >
> >Hm? Why do we need this in the various places you have made this change?
>
> It's because for Parallel INSERT, we're assigning the same command-id
> to each worker up-front during worker initialization (the commandId
> has been retrieved by the leader and passed through to each worker)
> and "currentCommandIdUsed" has been set true. See the
> AssignCommandIdForWorker() function in the patch.
> If you see the code of GetCurrentCommandId(), you'll see it Assert
> that it's not being run by a parallel worker if the parameter is true.
> I didn't want to remove yet another check, without being able to know
> the context of the caller, because only for Parallel INSERT do I know
> that "currentCommandIdUsed was already true at the start of the
> parallel operation". See the comment in that function. Anyway, that's
> why I'm passing "false" to relevant GetCurrentCommandId() calls if
> they're being run by a parallel (INSERT) worker.
>

But we can tighten the condition in GetCurrentCommandId() such that it
Asserts for parallel worker only when currentCommandIdUsed is not set
before start of parallel operation. I also find these changes in the
callers of GetCurrentCommandId() quite adhoc and ugly even if they are
correct. Also, why we don't face a similar problems for parallel copy?

>
> >> diff --git a/src/backend/access/transam/varsup.c 
> >> b/src/backend/access/transam/varsup.c
> >> index a4944fa..9d3f100 100644
> >> --- a/src/backend/access/transam/varsup.c
> >> +++ b/src/backend/access/transam/varsup.c
> >> @@ -53,13 +53,6 @@ GetNewTransactionId(bool isSubXact)
> >>   TransactionId xid;
> >>
> >>   /*
> >> -  * Workers synchronize transaction state at the beginning of each 
> >> parallel
> >> -  * operation, so we can't account for new XIDs after that point.
> >> -  */
> >> - if (IsInParallelMode())
> >> - elog(ERROR, "cannot assign TransactionIds during a parallel 
> >> operation");
> >> -
> >> - /*
> >>* During bootstrap initialization, we return the special bootstrap
> >>* transaction id.
> >>*/
> >
> >Same thing, this code cannot just be allowed to be reachable. What
> >prevents you from assigning two different xids from different workers
> >etc?
>
> At least in the case of Parallel INSERT, the leader for the Parallel
> INSERT gets a new xid (GetCurrentFullTransactionId) and it is passed
> through and assigned to each of the workers during their
> initialization (so they are assigned the same xid).
>

So are you facing problems in this area because we EnterParallelMode
before even assigning the xid in the leader? Because I don't think we
should ever reach this code in the worker. If so, there are two
possibilities that come to my mind (a) assign xid in leader before
entering parallel mode or (b) change the check so that we don't assign
the new xid in workers. In this case, I am again wondering how does
parallel copy dealing this?

-- 
With Regards,
Amit Kapila.

Re: enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW

2020-09-25 Thread legrand legrand

oups, sorry
so +1 for this fix

Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Greg Nancarrow

On Fri, Sep 25, 2020 at 7:01 PM Bharath Rupireddy
 wrote:
> I have few points (inspired from parallel copy feature work) to mention:
>
> 1. What if the target table is a foreign table or partitioned table?
> 2. What happens if the target table has triggers(before statement,
> after statement, before row, after row) that are parallel unsafe?
> 3. Will each worker be doing single row insertions or multi inserts?
> If single row insertions, will the buffer lock contentions be more?
> 5. How does it behave with toast columns values?
> 6. How does it behave if we have a RETURNING clause with INSERT INTO SELECT?
>

Hi Bharath,

Thanks for pointing out more cases I need to exclude and things I need
to investigate further.
I have taken note of them, and will do more testing and improvement.
At least RETURNING clause with INSERT INTO SELECT is working!

Regards,
Greg Nancarrow
Fujitsu Australia

Re: BUG #16419: wrong parsing BC year in to_date() function

2020-09-25 Thread Asif Rehman

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Patch looks good to me.

The new status of this patch is: Ready for Committer

Re: Resetting spilled txn statistics in pg_stat_replication

2020-09-25 Thread Amit Kapila

On Thu, Sep 24, 2020 at 5:44 PM Amit Kapila  wrote:
>
> On Sat, Sep 19, 2020 at 1:48 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 8, 2020 at 7:02 PM Amit Kapila  wrote:
> > >
> > > On Tue, Sep 8, 2020 at 7:53 AM Masahiko Sawada
> > >  wrote:
> >
> > I have fixed these review comments in the attached patch.
> >
> >
> > Apart from the above,
> > (a) fixed one bug in ReorderBufferSerializeTXN() where we were
> > updating the stats even when we have not spilled anything.
> > (b) made changes in pgstat_read_db_statsfile_timestamp to return false
> > when the replication slot entry is corrupt.
> > (c) move the declaration and definitions in pgstat.c to make them
> > consistent with existing code
> > (d) made another couple of cosmetic fixes and changed a few comments
> > (e) Tested the patch by using a guc which allows spilling all the
> > changes. See v4-0001-guc-always-spill
> >
>
> I have found a way to write the test case for this patch. This is
> based on the idea we used in stats.sql. As of now, I have kept the
> test as a separate patch. We can decide to commit the test part
> separately as it is slightly timing dependent but OTOH as it is based
> on existing logic in stats.sql so there shouldn't be much problem. I
> have not changed anything apart from the test patch in this version.
> Note that the first patch is just a debugging kind of tool to test the
> patch.
>

I have done some more testing of this patch especially for the case
where we spill before streaming the transaction and found everything
is working as expected. Additionally, I have changed a few more
comments and ran pgindent. I am still not very sure whether we want to
display physical slots in this view as all the stats are for logical
slots but anyway we can add stats w.r.t physical slots in the future.
I am fine either way (don't show physical slots in this view or show
them but keep stats as 0). Let me know if you have any thoughts on
these points, other than that I am happy with the current state of the
patch.

-- 
With Regards,
Amit Kapila.


v6-0001-Track-statistics-for-spilling-of-changes-from-Reo.patch
Description: Binary data


v6-0002-Test-stats.patch
Description: Binary data

Re: enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW

2020-09-25 Thread Fujii Masao





On 2020/09/25 19:04, legrand legrand wrote:

Hi,

isn't this already fixed in pg14
https://www.postgresql.org/message-id/e1k0mzg-0002vn...@gemulon.postgresql.org
?


IIUC that commit handled CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW
and FETCH commands, but not REFRESH MATERIALIZED VIEW. Katsuragi-san's patch is
for REFRESH MATERIALIZED VIEW.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Online checksums verification in the backend

2020-09-25 Thread Julien Rouhaud

On Wed, Sep 16, 2020 at 11:46 AM Michael Paquier  wrote:
>
> On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:
> > Thanks!
>
> I got some numbers out of my pocket, using the following base
> configuration:
> [...]
>
> Within parenthesis is the amount of time taken by pg_relation_check()
> for a single check.  This is not completely exact and I saw some
> variations, just to give an impression:
> Conns  64 128   256
> force_lock=true  60590 (7~8s)  55652 (10.2~10.5s) 46812 (9~10s)
> force_lock=false   58637 (5s)54131 (6~7s)  37091 (1.1~1.2s)
>
> For connections higher than 128, I was kind of surprised to see
> pg_relation_check being more aggressive without the optimization, with
> much less contention on the buffer mapping LWLock actually, but that's
> an impression from looking at pg_stat_activity.
>
> Looking at the wait events for each run, I saw much more hiccups with
> the buffer mapping LWLock when forcing the lock rather than not, still
> I was able to see some contention when also not forcing the lock.  Not
> surprising as this rejects a bunch of pages from shared buffers.
>
> > I used all default settings, but by default checksum_cost_delay is 0
> > so there shouldn't be any throttling.
>
> Thanks, so did I.  From what I can see, there could be as well
> benefits in not using the optimization by default so as the relation
> check applies some natural throttling by making the checks actually
> slower (there is a link between the individual runtime of
> pg_relation_time and the TPS).

Thanks a lot for the tests!  I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster.  Do you have
any idea why that's the case?

>  So it also seems to me that the
> throttling parameters would be beneficial, but it looks to me that
> there is as well a point to not include any throttling in a first
> version if the optimization to go full speed is not around.  Using
> three new GUCs for those function calls is still too much IMO

I'm assuming that you prefer to remove both the optimization and the
throttling part?  I'll do that with the next version unless there's
objections.

>, so
> there is also the argument to move all this stuff into a new contrib/
> module, and have a bgworker implementation as part of it as it would
> need shared_preload_libraries anyway.
>
> Also, I have been putting some thoughts into the APIs able to fetch a
> buffer without going through the shared buffers.  And neither
> checksum.c, because it should be a place that those APIs depends on
> and include only the most-internal logic for checksum algorithm and
> computation, nor checksumfuncs.c, because we need to think about the
> case of a background worker as well (that could spawn a set of dynamic
> workers connecting to different databases able to do checksum
> verifications?).  It would be good to keep the handling of the buffer
> mapping lock as well as the calls to smgrread() into a single place.
> ReadBuffer_common() is a natural place for that, though it means for
> our use case the addition of three new options:
> - Being able to pass down directly a buffer pointer to save the page
> contents.
> - Being able to not verify directly a page, leaving the verification
> to the caller upthread.
> - Addition of a new mode, that I am calling here RBM_PRIVATE, where we
> actually read the page from disk if not yet in shared buffers, except
> that we fill in the contents of the page using the pointer given by
> the caller.  That's different than the use of local buffers, as we
> don't need this much amount of complications like temporary tables of
> course for per-page checks.
>
> Another idea would be to actually just let ReadBuffer_common just do
> the check by itself, with a different mode like a kind of
> RBM_VALIDATE, where we just return a verification state of the page
> that can be consumed by callers.
>
> This also comes with some more advantages:
> - Tracking of reads from disk with track_io_timing.
> - Addition of some private stats dedicated to this private mode, with
> new fields in pgBufferUsage, all in a single place

I agree that putting the code nearby ReadBuffer_common() would be a
good idea.  However, that means that I can't move all the code to
contrib/  I'm wondering what you'd like to see going there.  I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

 Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which  is maybe not a good
idea for something as critical as this functio

Re: enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW

2020-09-25 Thread legrand legrand

Hi,

isn't this already fixed in pg14
https://www.postgresql.org/message-id/e1k0mzg-0002vn...@gemulon.postgresql.org
?

Regards
PAscal





--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-09-25 Thread Fujii Masao




On 2020/09/25 13:56, Bharath Rupireddy wrote:

On Wed, Sep 23, 2020 at 8:19 PM Fujii Masao  wrote:



Please let me know if okay with the above agreed points, I will work on the new 
patch.


Yes, please work on the patch! Thanks! I may revisit the above points later, 
though ;)



Thanks, attaching v4 patch. Please consider this for further review.


Thanks for updating the patch!

In the orignal code, disconnect_pg_server() is called when invalidation
occurs and connect_pg_server() is called when no valid connection exists.
I think that we can simplify the code by merging the connection-retry
code into them, like the attached very WIP patch (based on yours) does.
Basically I'd like to avoid duplicating disconnect_pg_server(),
connect_pg_server() and begin_remote_xact() if possible. Thought?

Your patch adds several codes into PG_CATCH() section, but it's better
 to keep that section simple enough (as the source comment for
PG_CATCH() explains). So what about moving some codes out of PG_CATCH()
section?

+   else
+   ereport(ERROR,
+   
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+errmsg("could not connect to server 
\"%s\"",
+   server->servername),
+errdetail_internal("%s", 
pchomp(PQerrorMessage(entry->conn);

The above is not necessary? If this error occurs, connect_pg_server()
reports an error, before the above code is reached. Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/postgres_fdw/connection.c 
b/contrib/postgres_fdw/connection.c
index 08daf26fdf..9f76261d99 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -108,6 +108,7 @@ PGconn *
 GetConnection(UserMapping *user, bool will_prep_stmt)
 {
boolfound;
+   boolneed_reset_conn = false;
ConnCacheEntry *entry;
ConnCacheKey key;
 
@@ -159,23 +160,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Reject further use of connections which failed abort cleanup. */
pgfdw_reject_incomplete_xact_state_change(entry);
 
+reset:
/*
 * If the connection needs to be remade due to invalidation, disconnect 
as
-* soon as we're out of all transactions.
+* soon as we're out of all transactions. Also if previous attempt to 
start
+* new remote transaction failed on the cached connection, disconnect
+* it to reestablish new connection.
 */
-   if (entry->conn != NULL && entry->invalidated && entry->xact_depth == 0)
+   if ((entry->conn != NULL && entry->invalidated && entry->xact_depth == 
0) ||
+   need_reset_conn)
{
-   elog(DEBUG3, "closing connection %p for option changes to take 
effect",
-entry->conn);
+   if (need_reset_conn)
+   elog(DEBUG3, "closing connection %p to reestablish 
connection",
+entry->conn);
+   else
+   elog(DEBUG3, "closing connection %p for option changes 
to take effect",
+entry->conn);
disconnect_pg_server(entry);
}
 
-   /*
-* We don't check the health of cached connection here, because it would
-* require some overhead.  Broken connection will be detected when the
-* connection is actually used.
-*/
-
/*
 * If cache entry doesn't have a connection, we have to establish a new
 * connection.  (If connect_pg_server throws an error, the cache entry
@@ -206,9 +209,31 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
 
/*
-* Start a new transaction or subtransaction if needed.
+* We check the health of the cached connection here, while the remote
+* xact is going to begin.  Broken connection will be detected and the
+* connection is retried. We do this only once during each begin remote
+* xact call, if the connection is still broken after the retry attempt,
+* then the query will fail.
 */
-   begin_remote_xact(entry);
+   PG_TRY();
+   {
+   /* Start a new transaction or subtransaction if needed. */
+   begin_remote_xact(entry);
+   need_reset_conn = false;
+   }
+   PG_CATCH();
+   {
+   if (!entry->conn ||
+   PQstatus(entry->conn) != CONNECTION_BAD ||
+   entry->xact_depth > 0 ||
+   need_reset_conn)
+   PG_RE_THROW();
+   need_reset_conn = true;
+   }
+   PG_END_TRY();
+
+   if (need_reset_conn)
+

Re: [patch] Concurrent table reindex per-index progress reporting

2020-09-25 Thread Matthias van de Meent

On Fri, 25 Sep 2020 at 08:44, Michael Paquier  wrote:
>
> On Thu, Sep 24, 2020 at 09:19:18PM +0200, Matthias van de Meent wrote:
> > While working on a PG12-instance I noticed that the progress reporting of
> > concurrent index creation for non-index relations fails to update the
> > index/relation OIDs that it's currently working on in the
> > pg_stat_progress_create_index view after creating the indexes. Please find
> > attached a patch which properly sets these values at the appropriate places.
> >
> > Any thoughts?
>
> I agree that this is a bug and that we had better report correctly the
> heap and index IDs worked on, as these could also be part of a toast
> relation from the parent table reindexed.  However, your patch is
> visibly incorrect in the two places you are updating.

Thanks for checking the patch, I should've checked it more than I did.

>
> > +  * Configure progress reporting to report for this index.
> > +  * While we're at it, reset the phase as it is cleared by 
> > start_command.
> > +  */
> > + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, 
> > heapId);
> > + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, 
> > newIndexId);
> > + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
> > +  
> > PROGRESS_CREATEIDX_PHASE_WAIT_1);
>
> First, updating PROGRESS_CREATEIDX_PHASE here to WAIT_1 makes no
> sense, because this is not a wait phase, and index_build() called
> within index_concurrently_build() will set this state correctly a bit
> after so IMO it is fine to leave that unset here, and the build phase
> is by far the bulk of the operation that needs tracking.  I think that
> you are also missing to update PROGRESS_CREATEIDX_COMMAND to
> PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY, and
> PROGRESS_CREATEIDX_ACCESS_METHOD_OID to the correct index AM OID,
> similarly to reindex_index().

Updating to WAIT_1 was to set it back to whatever the state was before
we would get into the loop and start_command was called. I quite
apparently didn't take enough care to set all fields that could be
set, though, so thanks for noticing.

> > + /*
> > +  * Configure progress reporting to report for this index.
> > +  * While we're at it, reset the phase as it is cleared by 
> > start_command.
> > +  */
> > + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, 
> > heapId);
> > + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, 
> > newIndexId);
> > + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
> > +  
> > PROGRESS_CREATEIDX_PHASE_WAIT_2);
> > +
> >   validate_index(heapId, newIndexId, snapshot);
>
> Basically the same three comments here: PROGRESS_CREATEIDX_PHASE set
> to WAIT_2 makes no real sense, and validate_index() would update the
> phase as it should be.  This should set PROGRESS_CREATEIDX_COMMAND to
> PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY, and
> PROGRESS_CREATEIDX_ACCESS_METHOD_OID to the correct index AM OID.

Basically the same response: I didn't take enough care to correctly
reset the state, while being conservative in what I did put back.

>
> While reviewing this code, I also noticed that the wait state set just
> before dropping the indexes was wrong.  The code was using WAIT_4, but
> this has to be WAIT_5.
>
> And this gives me the attached.  This also means that we still set the
> table and relation OID to the last index worked on for WAIT_2, WAIT_4
> and WAIT_5, but we cannot set the command start to relationOid either
> as given in input of ReindexRelationConcurrently() as this could be a
> single index given for REINDEX INDEX CONCURRENTLY.  Matthias, does
> that look right to you?

It looks great, thanks!

-Matthias

Re: FETCH FIRST clause PERCENT option

2020-09-25 Thread Surafel Temesgen

Hi Michael
On Thu, Sep 24, 2020 at 6:58 AM Michael Paquier  wrote:

> On Mon, Aug 10, 2020 at 01:23:44PM +0300, Surafel Temesgen wrote:
> > I also Implement PERCENT WITH TIES option. patch is attached
> > i don't start a new tread because the patches share common code
>
> This fails to apply per the CF bot.  Could you send a rebase?
> --
>


Attaches are rebased patches

regards
Surafel
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a31abce7c9..135b98da5d 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3096,7 +3096,8 @@ estimate_path_cost_size(PlannerInfo *root,
 		if (fpextra && fpextra->has_limit)
 		{
 			adjust_limit_rows_costs(&rows, &startup_cost, &total_cost,
-	fpextra->offset_est, fpextra->count_est);
+	LIMIT_OPTION_COUNT, fpextra->offset_est,
+	fpextra->count_est);
 			retrieved_rows = rows;
 		}
 	}
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index b93e4ca208..9811380ec6 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
+[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } { ONLY | WITH TIES } ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1434,7 +1434,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }
+FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } { ONLY | WITH TIES }
 
 In this syntax, the start
 or count value is required by
@@ -1444,6 +1444,9 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
+With PERCENT count specifies the maximum number of rows to return 
+in percentage round up to the nearest integer. Using PERCENT
+is best suited to returning single-digit percentages of the query's total row count.
 The WITH TIES option is used to return any additional
 rows that tie for the last place in the result set according to
 the ORDER BY clause; ORDER BY
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index b6e58e8493..68e0c5e2e1 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -344,7 +344,7 @@ F862	 in subqueries			YES
 F863	Nested  in 			YES	
 F864	Top-level  in views			YES	
 F865	 in 			YES	
-F866	FETCH FIRST clause: PERCENT option			NO	
+F866	FETCH FIRST clause: PERCENT option			YES	
 F867	FETCH FIRST clause: WITH TIES option			YES	
 R010	Row pattern recognition: FROM clause			NO	
 R020	Row pattern recognition: WINDOW clause			NO	
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index d85cf7d93e..cfb28a9fd4 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -21,6 +21,8 @@
 
 #include "postgres.h"
 
+#include 
+
 #include "executor/executor.h"
 #include "executor/nodeLimit.h"
 #include "miscadmin.h"
@@ -29,6 +31,8 @@
 static void recompute_limits(LimitState *node);
 static int64 compute_tuples_needed(LimitState *node);
 
+#define IsPercentOption(opt) \
+	(opt == LIMIT_OPTION_PERCENT || opt == LIMIT_OPTION_PER_WITH_TIES)
 
 /* 
  *		ExecLimit
@@ -53,6 +57,7 @@ ExecLimit(PlanState *pstate)
 	 */
 	direction = node->ps.state->es_direction;
 	outerPlan = outerPlanState(node);
+	slot = node->subSlot;
 
 	/*
 	 * The main logic is a simple state machine.
@@ -82,7 +87,15 @@ ExecLimit(PlanState *pstate)
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
 			 */
-			if (node->count <= 0 && !node->noCount)
+			if (IsPercentOption(node->limitOption))
+			{
+if (node->percent == 0.0)
+{
+	node->lstate = LIMIT_EMPTY;
+	return NULL;
+}
+			}
+			else if (node->count <= 0 && !node->noCount)
 			{
 node->lstate = LIMIT_EMPTY;
 return NULL;
@@ -118,6 +131,16 @@ ExecLimit(PlanState *pstate)
 	break;
 			}
 
+			/*
+			 * We may needed this tuple in subsequent scan so put it into
+			 * tuplestore.
+			 */
+			if (IsPercentOption(node->limitOption))
+			{
+tuplestore_puttupleslot(node->tupleStore, slot);
+tuplestore_advance(node->tupleStore, true);
+			}
+
 			/*
 			 * Okay, we have the first tuple of the window.
 			 */
@@ -135,6 +158,85 @@ ExecLimit(PlanState *pstate)
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
 			{
+/*
+ * In case of coming back from backward scan the tuple is
+ * already in tuple sto

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread k.jami...@fujitsu.com

On Friday, September 25, 2020 6:02 PM, Tsunakawa-san wrote:

> From: Jamison, Kirk/ジャミソン カーク 
> > [Results]
> > Recovery/Failover performance (in seconds). 3 trial runs.
> >
> > | shared_buffers | master | patch  | %reg|
> > ||||-|
> > | 128MB  | 32.406 | 33.785 | 4.08%   |
> > | 1GB| 36.188 | 32.747 | -10.51% |
> > | 2GB| 41.996 | 32.88  | -27.73% |
> 
> Thanks for sharing good results.  We want to know if we can get as
> significant results as you gained before with hundreds of GBs of shared
> buffers, don't we?

Yes. But I don't have a high-spec machine I could use at the moment.
I'll try if I can get one by next week. Or if someone would like to reproduce 
the
test with their available higher spec machines, it'd would be much appreciated.
The test case is upthread [1]

> > I also did similar benchmark performance as what Tomas did [1], simple
> > "pgbench -S" tests (warmup and then 15 x 1-minute runs with 1, 8 and
> > 16 clients, but I'm not sure if my machine is reliable enough to
> > produce reliable results for 8 clients and more.
> 
> Let me confirm just in case.  Your patch should not affect pgbench
> performance, but you measured it.  Is there anything you're concerned
> about?
> 

Not really. Because In the previous emails, the argument was the BufferAlloc 
overhead. But we don't have it in the latest patch. But just in case somebody
asks about benchmark performance, I also posted the results.

[1] 
https://www.postgresql.org/message-id/OSBPR01MB2341683DEDE0E7A8D045036FEF360%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison

RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-25 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> I don't think it's always possible to avoid raising errors in advance.
> Considering how postgres_fdw can implement your idea, I think
> postgres_fdw would need PG_TRY() and PG_CATCH() for its connection
> management. It has a connection cache in the local memory using HTAB.
> It needs to create an entry for the first time to connect (e.g., when
> prepare and commit prepared a transaction are performed by different
> processes) and it needs to re-connect the foreign server when the
> entry is invalidated. In both cases, ERROR could happen. I guess the
> same is true for other FDW implementations. Possibly other FDWs might
> need more work for example cleanup or releasing resources. I think

Why does the client backend have to create a new connection cache entry during 
PREPARE or COMMIT PREPARE?  Doesn't the client backend naturally continue to 
use connections that it has used in its current transaction?


> that the pros of your idea are to make the transaction manager simple
> since we don't need resolvers and launcher but the cons are to bring
> the complexity to FDW implementation codes instead. Also, IMHO I don't
> think it's safe way that FDW does neither re-throwing an error nor
> abort transaction when an error occurs.

No, I didn't say the resolver is unnecessary.  The resolver takes care of 
terminating remote transactions when the client backend encountered an error 
during COMMIT/ROLLBACK PREPARED.


> In terms of performance you're concerned, I wonder if we can somewhat
> eliminate the bottleneck if multiple resolvers are able to run on one
> database in the future. For example, if we could launch resolver
> processes as many as connections on the database, individual backend
> processes could have one resolver process. Since there would be
> contention and inter-process communication it still brings some
> overhead but it might be negligible comparing to network round trip.

Do you mean that if concurrent 200 clients each update data on two foreign 
servers, there are 400 resolvers?  ...That's overuse of resources.


Regards
Takayuki Tsunakawa

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-25 Thread Ashutosh Bapat

On Tue, Sep 22, 2020 at 10:57 AM Andy Fan  wrote:
>
>
> My tools set the random_page_cost to 8.6, but based on the fio data, it 
> should be
> set to 12.3 on the same hardware.  and I do see the better plan as well with 
> 12.3.
> Looks too smooth to believe it is true..
>
> The attached result_fio_mytool.tar.gz is my test result.   You can use git 
> show HEAD^^
> is the original plan with 8.6.  git show HEAD^  show the plan changes after 
> we changed
> the random_page_cost. git show HEAD shows the run time statistics changes for 
> these queries.
> I also uploaded the test tool [1] for this for your double check.

The scripts seem to start and stop the server, drop caches for every
query. That's where you are seeing that setting random_page_cost to
fio based ratio provides better plans. But in practice, these costs
need to be set on a server where the queries are run concurrently and
repeatedly. That's where the caching behaviour plays an important
role. Can we write a tool which can recommend costs for that scenario?
How do the fio based cost perform when the queries are run repeatedly?

-- 
Best Wishes,
Ashutosh Bapat

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-25 Thread Bharath Rupireddy

On Tue, Sep 22, 2020 at 10:26 AM Greg Nancarrow  wrote:
>
> For cases where it can't be allowed (e.g. INSERT into a table with
> foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
> ...") it at least allows parallelism of the SELECT part.
>

Thanks Greg for the patch.

I have few points (inspired from parallel copy feature work) to mention:

1. What if the target table is a foreign table or partitioned table?
2. What happens if the target table has triggers(before statement,
after statement, before row, after row) that are parallel unsafe?
3. Will each worker be doing single row insertions or multi inserts?
If single row insertions, will the buffer lock contentions be more?
5. How does it behave with toast columns values?
6. How does it behave if we have a RETURNING clause with INSERT INTO SELECT?

I'm looking forward to seeing some initial numbers on execution times
with and without patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> [Results]
> Recovery/Failover performance (in seconds). 3 trial runs.
> 
> | shared_buffers | master | patch  | %reg|
> ||||-|
> | 128MB  | 32.406 | 33.785 | 4.08%   |
> | 1GB| 36.188 | 32.747 | -10.51% |
> | 2GB| 41.996 | 32.88  | -27.73% |

Thanks for sharing good results.  We want to know if we can get as significant 
results as you gained before with hundreds of GBs of shared buffers, don't we?


> I also did similar benchmark performance as what Tomas did [1], simple
> "pgbench -S" tests (warmup and then 15 x 1-minute runs with 1, 8 and 16
> clients, but I'm not sure if my machine is reliable enough to produce reliable
> results for 8 clients and more.

Let me confirm just in case.  Your patch should not affect pgbench performance, 
but you measured it.  Is there anything you're concerned about?


Regards
Takayuki Tsunakawa

enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW

2020-09-25 Thread Katsuragi Yuta


Hi,

pg_stat_statements tracks the number of rows processed
by some utility commands.
But, currently, it does not track the number of rows
processed by REFRESH MATERIALIZED VIEW.

Attached patch enables pg_stat_statements to track
processed rows by REFRESH MATERIALIZED VIEW.

Regards,
Katsuragi Yutadiff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index e0edb134f3..2a303a7f07 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -530,8 +530,8 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 
 --
 -- Track the total number of rows retrieved or affected by the utility
--- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW
--- and SELECT INTO
+-- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW,
+-- REFRESH MATERIALIZED VIEW and SELECT INTO
 --
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
@@ -543,6 +543,7 @@ CREATE TABLE pgss_ctas AS SELECT a, 'ctas' b FROM generate_series(1, 10) a;
 SELECT generate_series(1, 10) c INTO pgss_select_into;
 COPY pgss_ctas (a, b) FROM STDIN;
 CREATE MATERIALIZED VIEW pgss_matv AS SELECT * FROM pgss_ctas;
+REFRESH MATERIALIZED VIEW pgss_matv;
 BEGIN;
 DECLARE pgss_cursor CURSOR FOR SELECT * FROM pgss_matv;
 FETCH NEXT pgss_cursor;
@@ -586,10 +587,11 @@ SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE
  FETCH FORWARD 5 pgss_cursor | 0 | 1 |5
  FETCH FORWARD ALL pgss_cursor   | 0 | 1 |7
  FETCH NEXT pgss_cursor  | 0 | 1 |1
+ REFRESH MATERIALIZED VIEW pgss_matv | 0 | 1 |   13
  SELECT generate_series(1, 10) c INTO pgss_select_into   | 0 | 1 |   10
  SELECT pg_stat_statements_reset()   | 0 | 1 |1
  SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" | 1 | 0 |0
-(12 rows)
+(13 rows)
 
 --
 -- Track user activity and reset them
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..1c2ac24cf6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1173,11 +1173,12 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/*
 		 * Track the total number of rows retrieved or affected by
 		 * the utility statements of COPY, FETCH, CREATE TABLE AS,
-		 * CREATE MATERIALIZED VIEW and SELECT INTO.
+		 * CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW and SELECT INTO.
 		 */
 		rows = (qc && (qc->commandTag == CMDTAG_COPY ||
 	   qc->commandTag == CMDTAG_FETCH ||
-	   qc->commandTag == CMDTAG_SELECT)) ?
+	   qc->commandTag == CMDTAG_SELECT ||
+	   qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
 			qc->nprocessed : 0;
 
 		/* calc differences of buffer counters. */
diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
index 996a24a293..e9f5bb84e3 100644
--- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
+++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
@@ -252,8 +252,8 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 
 --
 -- Track the total number of rows retrieved or affected by the utility
--- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW
--- and SELECT INTO
+-- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW,
+-- REFRESH MATERIALIZED VIEW and SELECT INTO
 --
 SELECT pg_stat_statements_reset();
 
@@ -265,6 +265,7 @@ COPY pgss_ctas (a, b) FROM STDIN;
 13	copy
 \.
 CREATE MATERIALIZED VIEW pgss_matv AS SELECT * FROM pgss_ctas;
+REFRESH MATERIALIZED VIEW pgss_matv;
 BEGIN;
 DECLARE pgss_cursor CURSOR FOR SELECT * FROM pgss_matv;
 FETCH NEXT pgss_cursor;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..cdc2e62ad4 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -136,7 +136,7 @@ SetMatViewPopulatedState(Relation relation, bool newstate)
  */
 ObjectAddress
 ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
-   ParamListInfo params, QueryCompletion *qc)
+   ParamListInfo params, QueryCompletion *qc, uint64 *processed)
 {
 	Oid			matviewOid;
 	Relation	matviewRel;
@@ -147,7 +147,6 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
 	Oid			relowner;
 	Oid			OIDNewHeap;
 	DestReceiver *dest;
-	uint64		processed = 0;
 	bool		concurrent;
 	LOCKMODE	lockmode;
 	char

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> No, during recovery also we need to be careful. We need to ensure that
> we use cached value during recovery and cached value is always
> up-to-date. We can't rely on lseek and I have provided some scenario
> up thread [1] where such behavior can cause problem and then see the
> response from Tom Lane why the same can be true for recovery as well.
> 
> The basic approach we are trying to pursue here is to rely on the
> cached value of 'number of blocks' (as that always gives correct value
> and even if there is a problem that will be our bug, we don't need to
> rely on OS for correct value and it will be better w.r.t performance
> as well). It is currently only possible during recovery so we are
> using it in recovery path and later once Thomas's patch to cache it
> for non-recovery cases is also done, we can use it for non-recovery
> cases as well.

Although I may be still confused, I understood that Kirk-san's patch should:

* Still focus on speeding up the replay of TRUNCATE during recovery.

* During recovery, DropRelFileNodeBuffers() gets the cached size of the 
relation fork.  If it is cached, trust it and optimize the buffer invalidation. 
 If it's not cached, we can't trust the return value of smgrnblocks() because 
it's the lseek(END) return value, so we avoid the optimization.

* Then, add a new function, say, smgrnblocks_cached() that simply returns the 
cached block count, and DropRelFileNodeBuffers() uses it instead of 
smgrnblocks().



Regards
Takayuki Tsunakawa

AppendStringInfoChar instead of appendStringInfoString

2020-09-25 Thread Hou, Zhijie

Hi

In (/src/backend/replication/backup_manifest.c)

I found the following code:

appendStringInfoString(&buf, "\n");
appendStringInfoString(&buf, "\"");

Since only one bit string is appended here, 
I think it will be better to call appendStringInfoChar.

Best reagrds,
houzj




0001-appendStringInfoChar-instead-of-appendStringInfoString.patch
Description: 0001-appendStringInfoChar-instead-of-appendStringInfoString.patch

Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-25 Thread Masahiko Sawada

On Thu, 24 Sep 2020 at 17:23, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > So with your idea, I think we require FDW developers to not call
> > ereport(ERROR) as much as possible. If they need to use a function
> > including palloc, lappend etc that could call ereport(ERROR), they
> > need to use PG_TRY() and PG_CATCH() and return the control along with
> > the error message to the transaction manager rather than raising an
> > error. Then the transaction manager will emit the error message at an
> > error level lower than ERROR (e.g., WARNING), and call commit/rollback
> > API again. But normally we do some cleanup on error but in this case
> > the retrying commit/rollback is performed without any cleanup. Is that
> > right? I’m not sure it’s safe though.
>
>
> Yes.  It's legitimate to require the FDW commit routine to return control, 
> because the prepare of 2PC is a promise to commit successfully.  The 
> second-phase commit should avoid doing that could fail.  For example, if some 
> memory is needed for commit, it should be allocated in prepare or before.
>

I don't think it's always possible to avoid raising errors in advance.
Considering how postgres_fdw can implement your idea, I think
postgres_fdw would need PG_TRY() and PG_CATCH() for its connection
management. It has a connection cache in the local memory using HTAB.
It needs to create an entry for the first time to connect (e.g., when
prepare and commit prepared a transaction are performed by different
processes) and it needs to re-connect the foreign server when the
entry is invalidated. In both cases, ERROR could happen. I guess the
same is true for other FDW implementations. Possibly other FDWs might
need more work for example cleanup or releasing resources. I think
that the pros of your idea are to make the transaction manager simple
since we don't need resolvers and launcher but the cons are to bring
the complexity to FDW implementation codes instead. Also, IMHO I don't
think it's safe way that FDW does neither re-throwing an error nor
abort transaction when an error occurs.

In terms of performance you're concerned, I wonder if we can somewhat
eliminate the bottleneck if multiple resolvers are able to run on one
database in the future. For example, if we could launch resolver
processes as many as connections on the database, individual backend
processes could have one resolver process. Since there would be
contention and inter-process communication it still brings some
overhead but it might be negligible comparing to network round trip.

Perhaps we can hear more opinions on that from other hackers to decide
the FDW transaction API design.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-25 Thread Dilip Kumar

On Fri, Sep 25, 2020 at 8:12 AM Amit Kapila  wrote:
>
> On Thu, Sep 24, 2020 at 6:33 PM Ashutosh Sharma  wrote:
> >
> > Hi Amit,
> >
> > > Here, I think instead of using MySubscription->stream, we should use
> > > server/walrecv version number as we used at one place in tablesync.c.
> >
> > Should subscribers be setting the LR protocol value based on what is
> > the publisher version it is communicating with or should it be set
> > based on whether streaming was enabled or not while creating that
> > subscription? AFAIU if we set this value based on the publisher
> > version (which is lets say >= 14), then it's quite possible that the
> > subscriber will start streaming changes for the in-progress
> > transactions even if the streaming was disabled while creating the
> > subscription, won't it?
> >
>
> No that won't happen because we send this option to the server
> (publisher in this case) only when version is >=14 and user has
> specified this option. See the below check in function
> libpqrcv_startstreaming()
> {
> ..
> if (options->proto.logical.streaming &&
> PQserverVersion(conn->streamConn) >= 14)
> appendStringInfo(&cmd, ", streaming 'on'");
> ..
> }

Ok, I have modified as per your suggestion.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v4-0001-Bugfix-in-logical-protocol-version.patch
Description: Binary data

Re: Feature improvement for FETCH tab completion

2020-09-25 Thread btnakamichin


2020-09-25 15:38 に Fujii Masao さんは書きました:

On 2020/09/25 14:24, btnakamichin wrote:

Hello!

I’d like to improve the FETCH tab completion.

The FETCH tab completion . Therefore, this patch fixes the problem.

Previous function completed one of FORWARD, BACKWARD, RELATIVE, 
ABSOLUTE, but now it completes one of FORWARD, BACKWARD, RELATIVE, 
ABSOLUTE, ALL, NEXT, PRIOR, FIRST, LAST and Corresponded to later IN 
and FROM clauses.


Thanks for the patch! Here are review comments.

+   /* Complete FETCH BACKWARD or FORWARD with one of ALL */
+   else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD"))
+   COMPLETE_WITH("ALL");

Not only "ALL" but also "FROM" and "IN" should be displayed here
because they also can follow "BACKWARD" and "FORWARD"?

else if (Matches("FETCH|MOVE", MatchAny, MatchAny))
+   else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE",
MatchAny))
+   COMPLETE_WITH("FROM", "IN");

This change seems to cause "FETCH FORWARD FROM " to display "FROM"
and "IN". To avoid this confusing tab-completion, we should use 
something like

MatchAnyExcept("FROM|IN") here, instead?

Regards,


I’m Sorry, I forgot to include pgsql_hackers in the cc, so I resend it

Thank you, I appreciate your comment.

I have attached patch　with newline.

Regards,

NaokiNakamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d877cc86f0..dafbae0366 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2911,7 +2911,8 @@ psql_completion(const char *text, int start, int end)
 
 /* DEALLOCATE */
 	else if (Matches("DEALLOCATE"))
-		COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements " UNION SELECT 'ALL'");
+		COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements 
+		" UNION SELECT 'ALL'");
 
 /* DECLARE */
 	else if (Matches("DECLARE", MatchAny))
@@ -3076,19 +3077,27 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("SELECT", "INSERT", "DELETE", "UPDATE", "DECLARE");
 
 /* FETCH && MOVE */
-	/* Complete FETCH with one of FORWARD, BACKWARD, RELATIVE */
+	/*
+	 * Complete FETCH with one of FORWARD, BACKWARD, RELATIVE, ALL, NEXT,
+	 * PRIOR, FIRST, LAST
+	 */
 	else if (Matches("FETCH|MOVE"))
-		COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE");
-	/* Complete FETCH  with one of ALL, NEXT, PRIOR */
-	else if (Matches("FETCH|MOVE", MatchAny))
-		COMPLETE_WITH("ALL", "NEXT", "PRIOR");
+		COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE", "ALL", "NEXT",
+		  "PRIOR", "FIRST", "LAST");
+	/* Complete FETCH BACKWARD or FORWARD with one of ALL, IN, FROM */
+	else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD"))
+		COMPLETE_WITH("ALL", "FROM", "IN");
 
 	/*
-	 * Complete FETCH   with "FROM" or "IN". These are equivalent,
+	 * Complete FETCH  with "FROM" or "IN". These are equivalent,
 	 * but we may as well tab-complete both: perhaps some users prefer one
 	 * variant or the other.
 	 */
-	else if (Matches("FETCH|MOVE", MatchAny, MatchAny))
+	else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE",
+	 MatchAnyExcept("FROM|IN")))
+		COMPLETE_WITH("FROM", "IN");
+
+	else if (Matches("FETCH|MOVE", "ALL|NEXT|PRIOR|FIRST|LAST"))
 		COMPLETE_WITH("FROM", "IN");
 
 /* FOREIGN DATA WRAPPER */

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread k.jami...@fujitsu.com

Hi. 

> I'll send performance measurement results in the next email. Thanks a lot for
> the reviews!

Below are the performance measurement results.
I was only able to use low-spec machine:
CPU 4v, Memory 8GB, RHEL, xfs filesystem.

[Failover/Recovery Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (M) DELETE FROM TABLE (ex. all rows of 10,000 tables)
3. (Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
4. (M) psql -c "\timing on" (measures total execution of SQL queries)
5. (M) VACUUM (whole db)
6. (M) After vacuum finishes, stop primary server: pg_ctl stop -w -mi
7. (S) Resume wal replay and promote standby.
Because it's difficult to measure recovery time I used the attached script 
(resume.sh)
that prints timestamp before and after promotion. It basically does the 
following
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is 
resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is 
measured.

[Results]
Recovery/Failover performance (in seconds). 3 trial runs.

| shared_buffers | master | patch  | %reg| 
||||-| 
| 128MB  | 32.406 | 33.785 | 4.08%   | 
| 1GB| 36.188 | 32.747 | -10.51% | 
| 2GB| 41.996 | 32.88  | -27.73% |

There's a bit of small regression with the default shared_buffers (128MB),
but as for the recovery time when we have large NBuffers, it's now at least 
almost constant
so there's boosted performance. IOW, we enter the optimization most of the time
during recovery.

I also did similar benchmark performance as what Tomas did [1],
simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients, but I'm not sure if my machine is reliable enough to
produce reliable results for 8 clients and more.

| #  | master  | patch   | %reg   | 
||-|-|| 
| 1 client   | 1676.937825 | 1707.018029 | -1.79% | 
| 8 clients  | 7706.835401 | 7529.089044 | 2.31%  | 
| 16 clients | 9823.65254  | 9991.184206 | -1.71% |


If there's additional/necessary performance measurement, kindly advise me too.
Thank you in advance.

[1] 
https://www.postgresql.org/message-id/flat/20200806213334.3bzadeirly3mdtzl%40development#473168a61e229de40eaf36326232f86c

Best regards,
Kirk Jamison


resume.sh
Description: resume.sh


run.sh
Description: run.sh

[PATCH] Runtime control of CLOBBER_CACHE_ALWAYS

2020-09-25 Thread Craig Ringer

Hi all

While working on extensions I've often wanted to enable cache clobbering
for a targeted piece of code, without paying the price of running it for
all backends during postgres startup and any initial setup tasks that are
required.

So here's a patch that, when CLOBBER_CACHE_ALWAYS or
CLOBBER_CACHE_RECURSIVE are defined, adds a GUC named clobber_cache_depth .
It defaults to 1 for CLOBBER_CACHE_ALWAYS or 3 for CLOBBER_CACHE_RECURSIVE
to match the existing compiled-in behaviour. But with this change it's now
possible to run Pg with clobber_cache_depth=0 then set it to 1 only for
targeted tests.

clobber_cache_depth is treated as an unknown GUC if Pg was not built with
CLOBBER_CACHE_ALWAYS or CLOBBER_CACHE_RECURSIVE defined.

-

On a side note, to make things like this easier to use, I personally patch
all pg_regress tests to include the following at the start of each sql
input file:

\set ECHO none
-- Put per-test settings or overrides here
\set ECHO queries

then patch the expected files accordingly. That way it's easy for me to
make per-test adjustments while still running the whole suite. It's not
always practical to run just one targeted test with TESTS=foo.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
From 372defde443d178fb4a9c8cf4092dea7debf72ea Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 22 Sep 2020 09:51:00 +0800
Subject: [PATCH v1] Add runtime control over CLOBBER_CACHE_ALWAYS

When CLOBBER_CACHE_ALWAYS or CLOBBER_CACHE_RECURSIVE are defined,
a new GUC "clobber_cache_depth" becomes available. This allows runtime
control over the behaviour of cache clobber builds in order to allow
more targeted testing of new or changed features using aggressive
cache clobbering.

clobber_cache_depth defaults to 1 if CLOBBER_CACHE_ALWAYS is defined,
and to 3 if CLOBBER_CACHE_RECURSIVE is defined. That makes it match
the previous hardcoded behaviour of these macros, ensuring buildfarm
members won't get upset.

While we're at it, expose the macros in pg_config_manual.h
---
 src/backend/utils/cache/inval.c | 51 +++--
 src/backend/utils/misc/guc.c| 15 ++
 src/include/pg_config_manual.h  | 28 --
 src/include/utils/inval.h   |  8 ++
 4 files changed, 78 insertions(+), 24 deletions(-)

diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 628d6f5d0c..620c9558ac 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -109,6 +109,7 @@
 #include "storage/sinval.h"
 #include "storage/smgr.h"
 #include "utils/catcache.h"
+#include "utils/guc.h"
 #include "utils/inval.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
@@ -179,6 +180,9 @@ static SharedInvalidationMessage *SharedInvalidMessagesArray;
 static int	numSharedInvalidMessagesArray;
 static int	maxSharedInvalidMessagesArray;
 
+#if CACHE_CLOBBER_ALWAYS
+int clobber_cache_depth = 0;
+#endif
 
 /*
  * Dynamically-registered callback functions.  Current implementation
@@ -689,35 +693,38 @@ AcceptInvalidationMessages(void)
 	/*
 	 * Test code to force cache flushes anytime a flush could happen.
 	 *
-	 * If used with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS provides a
+	 * CLOBBER_CACHE_ALWAYS (clobber_cache_depth = 1) provides a
 	 * fairly thorough test that the system contains no cache-flush hazards.
 	 * However, it also makes the system unbelievably slow --- the regression
-	 * tests take about 100 times longer than normal.
-	 *
-	 * If you're a glutton for punishment, try CLOBBER_CACHE_RECURSIVELY. This
-	 * slows things by at least a factor of 1, so I wouldn't suggest
-	 * trying to run the entire regression tests that way.  It's useful to try
-	 * a few simple tests, to make sure that cache reload isn't subject to
-	 * internal cache-flush hazards, but after you've done a few thousand
-	 * recursive reloads it's unlikely you'll learn more.
+	 * tests take about 100 times longer than normal. To mitigate the
+	 * slowdown it can be turned on and off per-backend or globally using
+	 * the clobber_cache_depth GUC to allow targeted testing of specific
+	 * code paths.
+	 *
+	 * If you're a glutton for punishment, try CLOBBER_CACHE_RECURSIVELY
+	 * (set clobber_cache_depth to > 1). This slows things by at least a
+	 * factor of 1, so I wouldn't suggest trying to run the entire
+	 * regression tests that way.  It's useful to try a few simple tests,
+	 * to make sure that cache reload isn't subject to internal cache-flush
+	 * hazards, but after you've done a few thousand recursive reloads it's
+	 * unlikely you'll learn more.
+	 *
+	 * You can use postgresql.conf or any other convenient means to disable
+	 * clobbering by default then re-enable for targeted sections of tests,
+	 * e.g you can edit a specific pg_regress test to SET
+	 * clobber_cache_depth=1, then run the suite with:
+	 *
+	 * PGOPTIONS="-c clobber_cache_depth=0" make

Re: Get memory contexts of an arbitrary backend process

2020-09-25 Thread torikoshia


Hi,

Thanks for all your comments, I updated the patch.


On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito 
 wrote:



- How about managing the status of signal send/receive and dump
operations on a shared hash or others ?
Sending and receiving signals, dumping memory information, and
referencing dump information all work asynchronously.
Therefore, it would be good to have management information to
check
the status of each process.
A simple idea is that ..
- send a signal to dump to a PID, it first record following
information into the shared hash.
pid (specified pid)
loc (dump location, currently might be ASAP)
recv (did the pid process receive a signal? first false)
dumped (did the pid process dump a mem information?  first
false)
- specified process receive the signal, update the status in the
shared hash, then dumped at specified location.
- specified process finish dump mem information,  update the
status
in the shared hash.


I added a shared hash table consisted of minimal members
mainly for managing whether the file is dumped or not.
Some members like 'loc' seem useful in the future, but I
haven't added them since it's not essential at this point.



On 2020-09-01 10:54, Andres Freund wrote:


 CREATE VIEW pg_backend_memory_contexts AS
-SELECT * FROM pg_get_backend_memory_contexts();
+SELECT * FROM pg_get_backend_memory_contexts(-1);


-1 is odd. Why not use NULL or even 0?


Changed it from -1 to NULL.


+ rc = fwrite(&name_len, sizeof(int), 1, fpout);
+ rc = fwrite(name, name_len, 1, fpout);
+ rc = fwrite(&idlen, sizeof(int), 1, fpout);
+ rc = fwrite(clipped_ident, idlen, 1, fpout);
+ rc = fwrite(&level, sizeof(int), 1, fpout);
+ rc = fwrite(&parent_len, sizeof(int), 1, fpout);
+ rc = fwrite(parent, parent_len, 1, fpout);
+ (void) rc;  /* we'll check 
for error with ferror */

+
+ }

This format is not descriptive. How about serializing to json or
something? Or at least having field names?


Added field names for each value.


+ while (true)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ pg_usleep(1L);
+


Need better signalling back/forth here.


Added a shared hash table that has a flag for managing whether the file 
is dumped or not.

I modified it to use this flag.


+ /*
+  * Open a temp file to dump the current memory context.
+  */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+  errmsg("could not write temporary 
memory context file \"%s\": %m",

+ tmpfile)));
+ return;
+ }


Probably should be opened with O_CREAT | O_TRUNC?


AllocateFile() calls fopen() and AFAIU fopen() with mode "w" corresponds 
to open() with "O_WRONLY | O_CREAT | O_TRUNC".


Do you mean I should not use fopen() here?

On 2020-09-24 13:01, Michael Paquier wrote:

On Thu, Sep 10, 2020 at 09:11:21PM +0900, Kasahara Tatsuhito wrote:

I think it's fine to have an interface to delete in an emergency, but
I agree that
users shouldn't be made aware of the existence or deletion of dump
files, basically.


Per the CF bot, the number of tests needs to be tweaked, because we
test each entry filtered out with is_deeply(), meaning that the number
of tests needs to be updated to reflect that if the filtered list is
changed:
t/010_pg_basebackup.pl ... 104/109 # Looks like you planned 109 tests
but ran 110.
t/010_pg_basebackup.pl ... Dubious, test returned 255 (wstat 65280, 
0xff00)

All 109 subtests passed

Simple enough to fix.


Incremented the number of tests.


Any thoughts?

Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
From 4d3ff254a634895e8c23c83bb63f519a14785f06 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 25 Sep 2020 11:34:38 +0900
Subject: [PATCH] Enabled pg_get_backend_memory_contexts() to collect
arbitrary backend process's memory contexts.

Previsouly, pg_get_backend_memory_contexts() could only get the
local memory contexts. This patch enables to get memory contexts
of the arbitrary process which PID is specified by the argument.
---
 src/backend/access/transam/xlog.c|   7 +
 src/backend/catalog/system_views.sql |   4 +-
 src/backend/replication/basebackup.c |   3 +
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/tcop/postgres.c  |   5 +
 src/backend/utils/adt/mcxtfuncs.c| 381 ++-
 src/backend/utils/init/globals.c |   1 +
 src/bin/initdb/initdb.c  |   3 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   4 +-
 src/bin/pg_rewind/filemap.c  |   3 +
 src/include/catalog/pg_proc.dat  |  10

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-25 Thread Thomas Munro

On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro  wrote:
> Here's a new version.  The final thing I'm contemplating before
> pushing this is whether there may be hidden magical dependencies in
> the order of operations in CheckPointGuts(), which I've changed
> around.  Andres, any comments?

I nagged Andres off-list and he opined that it might be better to
reorder it a bit so that ProcessSyncRequests() comes after almost
everything else, so that if we ever teach more things to offload their
fsync work it'll be in the right order.  I reordered it like that; now
only CheckPointTwoPhase() comes later, based on the comment that
accompanies it.  In any case, we can always reconsider the ordering of
this function in later commits as required.  Pushed like that.

Re: history file on replica and double switchover

2020-09-25 Thread Fujii Masao




On 2020/09/25 13:00, David Zhang wrote:

On 2020-09-24 5:00 p.m., Fujii Masao wrote:



On 2020/09/25 8:15, David Zhang wrote:

Hi,

My understanding is that the "archiver" won't even start if "archive_mode = on" has been 
set on a "replica". Therefore, either (XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode 
!= ARCHIVE_MODE_OFF) will produce the same result.


Yes, the archiver isn't invoked in the standby when archive_mode=on.
But, with the original patch, walreceiver creates .ready archive status file
even when archive_mode=on and no archiver is running. This causes
the history file to be archived after the standby is promoted and
the archiver is invoked.

With my patch, walreceiver creates .ready archive status for the history file
only when archive_mode=always, like it does for the streamed WAL files.
This is the difference between those two patches. To prevent the streamed
timeline history file from being archived, IMO we should use the condition
archive_mode==always in the walreceiver.

+1





Please see how the "archiver" is started in src/backend/postmaster/postmaster.c

5227 /*
5228  * Start the archiver if we're responsible for 
(re-)archiving received
5229  * files.
5230  */
5231 Assert(PgArchPID == 0);
5232 if (XLogArchivingAlways())
5233 PgArchPID = pgarch_start();

I did run the nice script "double_switchover.sh" using either "always" or "on" on patch 
v1 and v2. They all generate the same results below. In other words, whether history file (0003.history) is 
archived or not depends on "archive_mode" settings.

echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:40 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:40 0002000A
-rw--- 1 david david   41 Sep 24 14:40 0002.history
-rw--- 1 david david   83 Sep 24 14:40 0003.history

echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:47 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:47 0002000A
-rw--- 1 david david   41 Sep 24 14:47 0002.history


Personally, I prefer patch v2 since it allies to the statement here, 
https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY

"If archive_mode is set to on, the archiver is not enabled during recovery or 
standby mode. If the standby server is promoted, it will start archiving after the 
promotion, but will not archive any WAL it did not generate itself."

By the way, I think the last part of the sentence should be changed to 
something like below:

"but will not archive any WAL, which was not generated by itself."


I'm not sure if this change is an improvement or not. But if we apply
the patch I proposed, maybe we should mention also history file here.
For example, "but will not archive any WAL or timeline history files
that it did not generate itself".

make sense for me


So I included this change in the patch. Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index beb309e668..42f01c515f 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1395,7 +1395,8 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
  If archive_mode is set to on, the
  archiver is not enabled during recovery or standby mode. If the standby
  server is promoted, it will start archiving after the promotion, but
- will not archive any WAL it did not generate itself. To get a complete
+ will not archive any WAL or timeline history files that
+ it did not generate itself. To get a complete
  series of WAL files in the archive, you must ensure that all WAL is
  archived, before it reaches the standby. This is inherently true with
  file-based log shipping, as the standby can only restore files that
diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 17f1a49f87..32693c56db 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -758,6 +758,10 @@ WalRcvFetchTimeLineHistoryFiles(TimeLineID first, 
TimeLineID last)
 */
writeTimeLineHistoryFile(tli, content, len);
 
+   /* Mark history file as ready for archiving */
+   if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+   XLogArchiveNotify(fname);
+
pfree(fname);
pfree(content);
}

80 matches

Mail list logo