subject:"RE\: \[Patch\] Optimize dropping of relation buffers using dlist"

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-03-12 Thread Amit Kapila

On Fri, Mar 12, 2021 at 12:07 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Kyotaro Horiguchi 
> > About the patch, it would be better to change the type of
> > BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current
> > value
> > doesn't harm.
>
> OK, attached, to be prepared for the distant future when NBuffers becomes 
> 64-bit.
>

Thanks for the patch. Pushed!

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread k.jami...@fujitsu.com

> From: Tsunakawa, Takayuki/綱川 貴之 
> From: Kyotaro Horiguchi 
> > About the patch, it would be better to change the type of
> > BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current
> value
> > doesn't harm.
> 
> OK, attached, to be prepared for the distant future when NBuffers becomes
> 64-bit.

Thank you, Tsunakawa-san, for sending the quick fix. (I failed to notice to my 
thread.)

Regards,
Kirk Jamison

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> About the patch, it would be better to change the type of
> BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current
> value
> doesn't harm.

OK, attached, to be prepared for the distant future when NBuffers becomes 
64-bit.


Regards
Takayuki Tsunakawa



v2-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch
Description:  v2-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread Kyotaro Horiguchi

At Fri, 12 Mar 2021 05:26:02 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Thomas Munro 
> > > uint64
> > 
> > +1
> 
> Thank you, the patch is attached (we tend to forget how large our world is... 
> 64-bit)  We're sorry to cause you trouble.

BUF_DROP_FULL_SCAN_THRESHOLD cannot be larger than the size of int
since Nbuffer is an int. but nBlocksToInvalidate being uint32 looks
somewhat too tight. So +1 for changing it to uint64.

We need fill all block[file][fork] array in DropRelFileNodesAllBuffers
so we cannot bailing out from the counting loop. We could do that
DropRelFileNodesAllBuffers but that doesn't seem effective so much.

So I vote for uint64 and not bailing out.

About the patch, it would be better to change the type of
BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current value
doesn't harm.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> > uint64
> 
> +1

Thank you, the patch is attached (we tend to forget how large our world is... 
64-bit)  We're sorry to cause you trouble.


Regards
Takayuki Tsunakawa



v1-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch
Description:  v1-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread tsunakawa.ta...@fujitsu.com

From: Thomas Munro 
> On Fri, Mar 12, 2021 at 5:20 PM Amit Kapila  wrote:
> > uint64
> 
> +1

+1
I'll send a patch later.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread Thomas Munro

On Fri, Mar 12, 2021 at 5:20 PM Amit Kapila  wrote:
> uint64

+1

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread Amit Kapila

On Fri, Mar 12, 2021 at 4:58 AM Thomas Munro  wrote:
>
> While rebasing CF #2933 (which drops the _cached stuff and makes this
> optimisation always available, woo), I happened to notice that we're
> summing the size of many relations and forks into a variable
> nBlocksToInvalidate of type BlockNumber.  That could overflow.
>

I also think so. I think we have two ways to address that: (a) check
immediately after each time we add blocks to nBlocksToInvalidate to
see if it crosses the threshold value BUF_DROP_FULL_SCAN_THRESHOLD and
if so, then just break the loop; (b) change the variable type to
uint64.

Any better ideas?

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread Thomas Munro

While rebasing CF #2933 (which drops the _cached stuff and makes this
optimisation always available, woo), I happened to notice that we're
summing the size of many relations and forks into a variable
nBlocksToInvalidate of type BlockNumber.  That could overflow.

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread k.jami...@fujitsu.com

On Wed, January 13, 2021 2:15 PM (JST), Amit Kapila wrote:
> On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila
> >  wrote in
> > > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com"
>  wrote in:
> > > > > > Thanks for the detailed tests. NBuffers/32 seems like an
> > > > > > appropriate value for the threshold based on these results. I
> > > > > > would like to slightly modify part of the commit message in
> > > > > > the first patch as below [1], otherwise, I am fine with the
> > > > > > changes. Unless you or anyone else has any more comments, I am
> > > > > > planning to push the 0001 and 0002 sometime next week.
> > > > > >
> > > > > > [1]
> > > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so
> > > > > > that scanning of the whole buffer pool can be avoided when the
> > > > > > number of blocks to be truncated in a relation is below a
> > > > > > certain threshold. For such cases, we find the buffers by doing
> lookups in BufMapping table.
> > > > > > This improves the performance by more than 100 times in many
> > > > > > cases when several small tables (tested with 1000 relations)
> > > > > > are truncated and where the server is configured with a large
> > > > > > value of shared buffers (greater than 100GB)."
> > > > >
> > > > > Thank you for taking a look at the results of the tests. And
> > > > > it's also consistent with the results from Tang too.
> > > > > The commit message LGTM.
> > > >
> > > > +1.
> > > >
> > >
> > > I have pushed the 0001.
> >
> > Thank you for commiting this.
> >
> 
> Pushed 0002 as well.
> 

Thank you very much for committing those two patches, and for everyone here
who contributed in the simplifying the approaches, code reviews, testing, etc.

I compile with the --enable-coverage and check if the newly-added code and 
updated
parts were covered by tests.
Yes, the lines were hit including the updated lines of DropRelFileNodeBuffers(),
DropRelFileNodesAllBuffers(), smgrdounlinkall(), smgrnblocks().
Newly added APIs were covered too: FindAndDropRelFileNodeBuffers() and
smgrnblocks_cached(). 
However, the parts where UnlockBufHdr(bufHdr, buf_state); is called is not hit.
But I noticed that exists as well in previously existing functions in bufmgr.c.

Thank you very much again.

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread Amit Kapila

On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila  
> wrote in
> > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
> > >  wrote in:
> > > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > > > > value for the threshold based on these results. I would like to
> > > > > slightly modify part of the commit message in the first patch as below
> > > > > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > > > > has any more comments, I am planning to push the 0001 and 0002
> > > > > sometime next week.
> > > > >
> > > > > [1]
> > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > > > > scanning of the whole buffer pool can be avoided when the number of
> > > > > blocks to be truncated in a relation is below a certain threshold. For
> > > > > such cases, we find the buffers by doing lookups in BufMapping table.
> > > > > This improves the performance by more than 100 times in many cases
> > > > > when several small tables (tested with 1000 relations) are truncated
> > > > > and where the server is configured with a large value of shared
> > > > > buffers (greater than 100GB)."
> > > >
> > > > Thank you for taking a look at the results of the tests. And it's also
> > > > consistent with the results from Tang too.
> > > > The commit message LGTM.
> > >
> > > +1.
> > >
> >
> > I have pushed the 0001.
>
> Thank you for commiting this.
>

Pushed 0002 as well.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread Kyotaro Horiguchi

At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila  wrote 
in 
> On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
> >  wrote in:
> > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > > > value for the threshold based on these results. I would like to
> > > > slightly modify part of the commit message in the first patch as below
> > > > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > > > has any more comments, I am planning to push the 0001 and 0002
> > > > sometime next week.
> > > >
> > > > [1]
> > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > > > scanning of the whole buffer pool can be avoided when the number of
> > > > blocks to be truncated in a relation is below a certain threshold. For
> > > > such cases, we find the buffers by doing lookups in BufMapping table.
> > > > This improves the performance by more than 100 times in many cases
> > > > when several small tables (tested with 1000 relations) are truncated
> > > > and where the server is configured with a large value of shared
> > > > buffers (greater than 100GB)."
> > >
> > > Thank you for taking a look at the results of the tests. And it's also
> > > consistent with the results from Tang too.
> > > The commit message LGTM.
> >
> > +1.
> >
> 
> I have pushed the 0001.

Thank you for commiting this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-11 Thread Amit Kapila

On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
>  wrote in:
> > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > > value for the threshold based on these results. I would like to
> > > slightly modify part of the commit message in the first patch as below
> > > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > > has any more comments, I am planning to push the 0001 and 0002
> > > sometime next week.
> > >
> > > [1]
> > > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > > scanning of the whole buffer pool can be avoided when the number of
> > > blocks to be truncated in a relation is below a certain threshold. For
> > > such cases, we find the buffers by doing lookups in BufMapping table.
> > > This improves the performance by more than 100 times in many cases
> > > when several small tables (tested with 1000 relations) are truncated
> > > and where the server is configured with a large value of shared
> > > buffers (greater than 100GB)."
> >
> > Thank you for taking a look at the results of the tests. And it's also
> > consistent with the results from Tang too.
> > The commit message LGTM.
>
> +1.
>

I have pushed the 0001.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread Kyotaro Horiguchi

At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote:
> > 
> > On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > [Results for VACUUM on single relation]
> > > Average of 5 runs.
> > >
> > > 1. % REGRESSION
> > > % Regression: (patched - master)/master
> > >
> > > | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> > > |--||||--|
> > > | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> > > | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> > > | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> > > | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> > > | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> > > | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> > > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
> > >
> > > For 100GB shared_buffers, we can observe regression
> > > beyond NBuffers/32. So with this, we can conclude
> > > that NBuffers/32 is the right threshold.
> > > For NBuffers/16 and beyond, the patched performs
> > > worse than master. In other words, the cost of for finding
> > > to be invalidated buffers gets higher in the optimized path
> > > than the traditional path.
> > >
> > > So in attached V39 patches, I have updated the threshold
> > > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
> > >
> > 
> > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > value for the threshold based on these results. I would like to
> > slightly modify part of the commit message in the first patch as below
> > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > has any more comments, I am planning to push the 0001 and 0002
> > sometime next week.
> > 
> > [1]
> > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > scanning of the whole buffer pool can be avoided when the number of
> > blocks to be truncated in a relation is below a certain threshold. For
> > such cases, we find the buffers by doing lookups in BufMapping table.
> > This improves the performance by more than 100 times in many cases
> > when several small tables (tested with 1000 relations) are truncated
> > and where the server is configured with a large value of shared
> > buffers (greater than 100GB)."
> 
> Thank you for taking a look at the results of the tests. And it's also 
> consistent with the results from Tang too.
> The commit message LGTM.

+1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread k.jami...@fujitsu.com

On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote:
> 
> On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
>  wrote:
> >
> > [Results for VACUUM on single relation]
> > Average of 5 runs.
> >
> > 1. % REGRESSION
> > % Regression: (patched - master)/master
> >
> > | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> > |--||||--|
> > | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> > | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> > | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> > | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> > | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> > | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
> >
> > For 100GB shared_buffers, we can observe regression
> > beyond NBuffers/32. So with this, we can conclude
> > that NBuffers/32 is the right threshold.
> > For NBuffers/16 and beyond, the patched performs
> > worse than master. In other words, the cost of for finding
> > to be invalidated buffers gets higher in the optimized path
> > than the traditional path.
> >
> > So in attached V39 patches, I have updated the threshold
> > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
> >
> 
> Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> value for the threshold based on these results. I would like to
> slightly modify part of the commit message in the first patch as below
> [1], otherwise, I am fine with the changes. Unless you or anyone else
> has any more comments, I am planning to push the 0001 and 0002
> sometime next week.
> 
> [1]
> "The recovery path of DropRelFileNodeBuffers() is optimized so that
> scanning of the whole buffer pool can be avoided when the number of
> blocks to be truncated in a relation is below a certain threshold. For
> such cases, we find the buffers by doing lookups in BufMapping table.
> This improves the performance by more than 100 times in many cases
> when several small tables (tested with 1000 relations) are truncated
> and where the server is configured with a large value of shared
> buffers (greater than 100GB)."

Thank you for taking a look at the results of the tests. And it's also 
consistent with the results from Tang too.
The commit message LGTM.

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread Amit Kapila

On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
 wrote:
>
> [Results for VACUUM on single relation]
> Average of 5 runs.
>
> 1. % REGRESSION
> % Regression: (patched - master)/master
>
> | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> |--||||--|
> | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
>
> For 100GB shared_buffers, we can observe regression
> beyond NBuffers/32. So with this, we can conclude
> that NBuffers/32 is the right threshold.
> For NBuffers/16 and beyond, the patched performs
> worse than master. In other words, the cost of for finding
> to be invalidated buffers gets higher in the optimized path
> than the traditional path.
>
> So in attached V39 patches, I have updated the threshold
> BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
>

Thanks for the detailed tests. NBuffers/32 seems like an appropriate
value for the threshold based on these results. I would like to
slightly modify part of the commit message in the first patch as below
[1], otherwise, I am fine with the changes. Unless you or anyone else
has any more comments, I am planning to push the 0001 and 0002
sometime next week.

[1]
"The recovery path of DropRelFileNodeBuffers() is optimized so that
scanning of the whole buffer pool can be avoided when the number of
blocks to be truncated in a relation is below a certain threshold. For
such cases, we find the buffers by doing lookups in BufMapping table.
This improves the performance by more than 100 times in many cases
when several small tables (tested with 1000 relations) are truncated
and where the server is configured with a large value of shared
buffers (greater than 100GB)."

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread Tang, Haiying

>I'd like take a look at them and redo some of the tests using my machine. I'll 
>send my test reults in a separate email after this.

I did the same tests with Kirk's scripts using the latest patch on my own 
machine. The results look pretty good and similar with Kirk's. 

average of 5 runs.

[VACUUM failover test for 1000 relations] Unit is second, 
%reg=(patched-master)/ master

| s_b   | Master| Patched   | %reg  | 
|--|---|--|--| 
| 128 MB| 0.408 | 0.308 |  -24.44%  | 
| 1 GB  | 0.809 | 0.308 |  -61.94%  | 
| 20 GB | 12.529| 0.308 |  -97.54%  | 
| 100 GB| 59.310| 0.369 |  -99.38%  |

[TRUNCATE failover test for 1000 relations] Unit is second, 
%reg=(patched-master)/ master

| s_b   | Master| Patched   | %reg  | 
|--|---|--|--| 
| 128 MB| 0.287 | 0.207 |  -27.91%  | 
| 1 GB  | 0.688 | 0.208 |  -69.84%  | 
| 20 GB | 12.449| 0.208 |  -98.33%  | 
| 100 GB| 61.800| 0.207 |  -99.66%  |

Besides, I did the test for threshold value again. (I rechecked my test process 
and found out that I forgot to check the data synchronization state on standby 
which may introduce some NOISE to my results.)
The following results show we can't get optimize over NBuffers/32 just like 
Kirk's test results, so I do approve with Kirk on the threshold value.

%regression:
| rel_size |128MB|1GB|20GB| 100GB |
|--||||---| 
| NB/512   | 0% | 0% | 0% | -48%  | 
| NB/256   | 0% | 0% | 0% | -33%  | 
| NB/128   | 0% | 0% | 0% | -9%   | 
| NB/64| 0% | 0% | 0% | -5%   | 
| NB/32| 0% | 0% |-4% | -3%   | 
| NB/16| 0% | 0% |-4% | 1%| 
| NB/8 | 1% | 0% | 1% | 3%|

Optimization details(unit: second):
patched:
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8
-
128M0.107   0.107   0.107   0.106   
0.107   0.107   0.107 
1G  0.107   0.107   0.107   0.107   
0.107   0.107   0.107 
20G 0.107   0.108   0.207   0.307   
0.442   0.876   1.577 
100G0.208   0.308   0.559   1.060   
1.961   4.567   7.922 

master:
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8
-
128M0.107   0.107   0.107   0.107   
0.107   0.107   0.106 
1G  0.107   0.107   0.107   0.107   
0.107   0.107   0.107 
20G 0.107   0.107   0.208   0.308   
0.457   0.910   1.560 
100G0.308   0.409   0.608   1.110   
2.011   4.516   7.721 

[Specs]
CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
Memory: 128G
OS: CentOS 8

Any question to my test is welcome.

Regards,
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread Tang, Haiying

Hi Kirk,

>And if you want to test, I have already indicated the detailed steps including 
>the scripts I used. Have fun testing!

Thank you for your sharing of test steps and scripts. I'd like take a look at 
them and redo some of the tests using my machine. I'll send my test reults in a 
separate email after this.

Regards,
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread k.jami...@fujitsu.com

On Wed, January 6, 2021 7:04 PM (JST), I wrote:
> I will resume the test similar to Tang, because she also executed the original
> failover test which I have been doing before.
> To avoid confusion and to check if the results from mine and Tang are
> consistent, I also did the recovery/failover test for VACUUM on single 
> relation,
> which I will send in a separate email after this.

A. Test to find the right THRESHOLD

So below are the procedures and results of the VACUUM recovery performance
test on single relation.
I followed the advice below and applied the supplementary patch on top of V39:
Test-for-threshold.patch
This will ensure that we'll always enter the optimized path.
We're gonna use the threshold then as the relation size.

> >One idea could be to remove "nBlocksToInvalidate < 
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && 
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it 
> >always use optimized path for the tests. Then use the relation size 
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of 
> >shared buffers as 128MB, 1GB, 20GB, 100GB.

Each relation size is NBuffers/XXX, so I used the attached "rel.sh" script
to test from NBuffers/512 until NBuffers/8 relation size per shared_buffers.
I did not go further beyond 8 because it took too much time, and I could
already observe significant results until that.

[Vacuum Recovery Performance on Single Relation]
1. Setup synchronous streaming replication. I used the configuration
  written at the bottom of this email.
2. [Primary] Create 1 table. (rel.sh create)
3. [Primary] Insert data of NBuffers/XXX size. Make sure to use the correct
   size for the set shared_buffers by commenting out the right size in "insert"
   of rel.sh script. (rel.sh insert)
4. [Primary] Delete table. (rel.sh delete)
5. [Standby] Optional: To double-check that DELETE is reflected on standby.
   SELECT count(*) FROM tableXXX;
  Make sure it returns 0.
6. [Standby] Pause WAL replay. (rel.sh pause)
  (This script will execute SELECT pg_wal_replay_pause(); .)
7. [Primary] VACUUM the single relation. (rel.sh vacuum)
8. [Primary] After the vacuum finishes, stop the server. (rel.sh stop)
   (The script will execute pg_ctl stop -D $PGDATA -w -mi)
9. [Standby] Resume WAL replay and promote the standby.
  (rel.sh resume)
  It basically prints a timestamp when resuming WAL replay,
  and prints another timestamp when the promotion is done.
  Compute the time difference.

[Results for VACUUM on single relation]
Average of 5 runs.

1. % REGRESSION
% Regression: (patched - master)/master

| rel_size | 128MB  | 1GB| 20GB   | 100GB| 
|--||||--| 
| NB/512   | 0.000% | 0.000% | 0.000% | -32.680% | 
| NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   | 
| NB/128   | 0.000% | 0.000% | 0.000% | -16.502% | 
| NB/64| 0.000% | 0.000% | 0.000% | -9.841%  | 
| NB/32| 0.000% | 0.000% | 0.000% | -6.219%  | 
| NB/16| 0.000% | 0.000% | 0.000% | 3.323%   | 
| NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |

For 100GB shared_buffers, we can observe regression
beyond NBuffers/32. So with this, we can conclude
that NBuffers/32 is the right threshold.
For NBuffers/16 and beyond, the patched performs
worse than master. In other words, the cost of for finding
to be invalidated buffers gets higher in the optimized path
than the traditional path.

So in attached V39 patches, I have updated the threshold
BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.

2. [PATCHED]
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|--|---|---|---|---| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.206 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.506 | 
| NB/64| 0.106 | 0.106 | 0.306 | 0.907 | 
| NB/32| 0.106 | 0.106 | 0.406 | 1.508 | 
| NB/16| 0.106 | 0.106 | 0.706 | 3.109 | 
| NB/8 | 0.106 | 0.106 | 1.307 | 6.614 |

3. MASTER
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|--|---|---|---|---| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.606 | 
| NB/64| 0.106 | 0.106 | 0.306 | 1.006 | 
| NB/32| 0.106 | 0.106 | 0.406 | 1.608 | 
| NB/16| 0.106 | 0.106 | 0.706 | 3.009 | 
| NB/8 | 0.106 | 0.106 | 1.307 | 6.114 |

I used the following configurations:
[postgesql.conf]
shared_buffers = 100GB #20GB,1GB,128MB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 1
max_wal_size = 20GB

# For streaming replication from primary. Don't uncomment on Standby.
synchronous_commit = remote_write
synchronous_standby_names = 'walreceiver'

# For Standby. Don't uncomment on Primary.
# hot_standby = on
#primary_conninfo = 'host=... user=... port=... application_name=walreceiver'

--
B. Regression Test using the NBuffers/32 Threshold (V39 Patches)

For this

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread k.jami...@fujitsu.com

On Sunday, January 3, 2021 10:35 PM (JST), Amit Kapila wrote:
> On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Happy new year. The V38 LGTM.
> > Apologies for a bit of delay on posting the test results, but since
> > it's the start of commitfest, here it goes and the results were interesting.
> >
> > I executed a VACUUM test using the same approach that Tsunakawa-san
> > did in [1], but only this time, the total time that DropRelFileNodeBuffers()
> took.
> >
> 
> Please specify the exact steps like did you deleted all the rows from a table 
> or
> some of it or none before performing Vacuum? How did you measure this
> time, did you removed the cached check? It would be better if you share the
> scripts and or the exact steps so that the same can be used by others to
> reproduce.

Basically, I used the TimestampDifference function in DropRelFileNodeBuffers().
I also executed DELETE before VACUUM.
I also removed nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD
And used the threshold as the relation size.

> Hmm, in the tests done by Tang, the results indicate that in some cases the
> patched version is slower at even NBuffers/32, so not sure if we can go to
> values shown by you unless she is doing something wrong. I think the
> difference in results could be because both of you are using different
> techniques to measure the timings, so it might be better if both of you can
> share scripts or exact steps used to measure the time and other can use the
> same technique and see if we are getting consistent results.

Right, since we want consistent results, please disregard the approach that I 
did.
I will resume the test similar to Tang, because she also executed the original 
failover
test which I have been doing before.
To avoid confusion and to check if the results from mine and Tang are 
consistent,
I also did the recovery/failover test for VACUUM on single relation, which I 
will send
in a separate email after this.

Regards,
Kirk Jamison

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-03 Thread Tang, Haiying

Hi Amit,

Sorry for my late reply. Here are my answers for your earlier questions.

>BTW, it is not clear why the advantage for single table is not as big as 
>multiple tables with the Truncate command
I guess it's the amount of table blocks caused this difference. For single 
table I tested the amount of block is threshold.
For multiple tables I test the amount of block is a value(like: one or dozens 
or hundreds) which far below threshold.
The closer table blocks to the threshold, the less advantage raised.

I tested below 3 situations of 50 tables when shared buffers=20G / 100G.
1. For multiple tables which had one or dozens or hundreds blocks(far below 
threshold) per table, we got significant improve, like [1]. 
2. For multiple tables which has half threshold blocks per table, advantage 
become less, like [2].
3. For multiple tables which has threshold blocks per table, advantage become 
more less, like [3].

[1].  247 blocks per table
s_b master  patched %reg((patched-master)/patched)

20GB1.109   0.108   -927%
100GB   3.113   0.108   -2782%

[2].  NBuffers/256/2 blocks per table
s_b master  patched %reg

20GB2.012   1.210   -66%
100GB   10.226  6.4 -60%

[3].  NBuffers/256 blocks per table
s_b master  patched %reg

20GB3.868   2.412   -60%
100GB   14.977  10.591  -41%

>Can you share your exact test steps for any one of the tests? Also, did you 
>change autovacumm = off for these tests?
Yes, I configured streaming replication environment as Kirk did before.
autovacumm = off. 
full_page_writes = off. 
checkpoint_timeout = 30min

Test steps: 
e.g. shared_buffers=20G, NBuffers/512, table blocks= 20*1024*1024/8/512=5120 . 
table size(kB)= 20*1024*1024/512=40960kB 
1. (Master) create table test(id int, v_ch varchar, v_ch1 varchar); 
2. (Master) insert about 40MB data to table.
3. (Master) delete from table (all rows of table) 
4. (Standby) To test with failover, pause the WAL replay on standby server.
   SELECT pg_wal_replay_pause();
5. (Master) VACUUM;
6. (Master) Stop primary server. pg_ctl stop -D $PGDATA -w 7. (Standby) Resume 
wal replay and promote standby. (get the recovery time from this step)

Regards
Tang

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-03 Thread Amit Kapila

On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com
 wrote:
>
> Happy new year. The V38 LGTM.
> Apologies for a bit of delay on posting the test results, but since it's the
> start of commitfest, here it goes and the results were interesting.
>
> I executed a VACUUM test using the same approach that Tsunakawa-san did in 
> [1],
> but only this time, the total time that DropRelFileNodeBuffers() took.
>

Please specify the exact steps like did you deleted all the rows from
a table or some of it or none before performing Vacuum? How did you
measure this time, did you removed the cached check? It would be
better if you share the scripts and or the exact steps so that the
same can be used by others to reproduce.

> I used only a single relation, tried with various sizes using the values of 
> threshold:
> NBuffers/512..NBuffers/1, as advised by Amit.
>
> Example of relation sizes for NBuffers/512.
> 100GB shared_buffers:  200 MB
> 20GB shared_buffers:   40 MB
> 1GB shared_buffers:2 MB
> 128MB shared_buffers:  0.25 MB
>
..
>
> Although the above table shows that NBuffers/2 would be the
> threshold, I know that the cost would vary depending on the machine
> specs. I think I can suggest the threshold and pick one from among
> NBuffers/2, NBuffers/4 or NBuffers/8, because their values are closer
> to the InvalidatedBuffers.
>

Hmm, in the tests done by Tang, the results indicate that in some
cases the patched version is slower at even NBuffers/32, so not sure
if we can go to values shown by you unless she is doing something
wrong. I think the difference in results could be because both of you
are using different techniques to measure the timings, so it might be
better if both of you can share scripts or exact steps used to measure
the time and other can use the same technique and see if we are
getting consistent results.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-02 Thread k.jami...@fujitsu.com

On Wednesday, December 30, 2020 8:58 PM, Amit Kapila wrote:
> On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying
>  wrote:
> >
> > Hi Amit,
> >
> > In last
> >
> mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e625718
> 2
> > 564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
> > I've sent you the performance test results(run only 1 time) on single table.
> Here is my the retested results(average by 15 times) which I think is more
> accurate.
> >
> > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is
> nonlinear(also include test results on shared buffers of 50G/60G), so it's a
> little difficult to decide the threshold from the two for me.
> > If just consider 100G, I think NBuffers/32 is the optimized max relation 
> > size.
> But I don't know how to judge for 20G. If you have any suggestion, kindly let
> me know.
> >
> 
> Considering these results NBuffers/64 seems a good threshold as beyond
> that there is no big advantage. BTW, it is not clear why the advantage for
> single table is not as big as multiple tables with the Truncate command. Can
> you share your exact test steps for any one of the tests?
> Also, did you change autovacumm = off for these tests, if not then the results
> might not be reliable because before you run the test via Vacuum command
> autovacuum would have done that work?

Happy new year. The V38 LGTM.
Apologies for a bit of delay on posting the test results, but since it's the
start of commitfest, here it goes and the results were interesting.

I executed a VACUUM test using the same approach that Tsunakawa-san did in [1],
but only this time, the total time that DropRelFileNodeBuffers() took.
I used only a single relation, tried with various sizes using the values of 
threshold:
NBuffers/512..NBuffers/1, as advised by Amit.

Example of relation sizes for NBuffers/512.
100GB shared_buffers:  200 MB 
20GB shared_buffers:   40 MB
1GB shared_buffers:2 MB
128MB shared_buffers:  0.25 MB

The regression, which means the patch performs worse than master, only happens
for relation size NBuffers/2 and below for all shared_buffers. The fastest
performance on a single relation was using the relation size NBuffers/512.

[VACUUM Recovery Performance on Single Relation]
Legend: P_XXX (Patch, NBuffers/XXX relation size),
M_XXX (Master, Nbuffers/XXX relation size)
Unit: seconds

| Rel Size | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| P_512|  0.012594  |  0.001989  |  0.81  |  0.12  | 
| M_512|  0.208757  |  0.046212  |  0.002013  |  0.000295  | 
| P_256|  0.026311  |  0.004416  |  0.000129  |  0.21  | 
| M_256|  0.241017  |  0.047234  |  0.002363  |  0.000298  | 
| P_128|  0.044684  |  0.009784  |  0.000290  |  0.42  | 
| M_128|  0.253588  |  0.047952  |  0.002454  |  0.000319  | 
| P_64 |  0.065806  |  0.017444  |  0.000521  |  0.75  | 
| M_64 |  0.268311  |  0.050361  |  0.002730  |  0.000339  | 
| P_32 |  0.121441  |  0.033431  |  0.001646  |  0.000112  | 
| M_32 |  0.285254  |  0.061486  |  0.003640  |  0.000364  | 
| P_16 |  0.255503  |  0.065492  |  0.001663  |  0.000144  | 
| M_16 |  0.377013  |  0.081613  |  0.003731  |  0.000372  | 
| P_8  |  0.560616  |  0.109509  |  0.005954  |  0.000465  | 
| M_8  |  0.692596  |  0.112178  |  0.006667  |  0.000553  | 
| P_4  |  1.109437  |  0.162924  |  0.011229  |  0.000861  | 
| M_4  |  1.162125  |  0.178764  |  0.011635  |  0.000935  | 
| P_2  |  2.202231  |  0.317832  |  0.020783  |  0.002646  | 
| M_2  |  1.583959  |  0.306269  |  0.015705  |  0.002021  | 
| P_1  |  3.080032  |  0.632747  |  0.032183  |  0.002660  | 
| M_1  |  2.705485  |  0.543970  |  0.030658  |  0.001941  | 

%reg = (Patched/Master)/Patched

| %reg_relsize | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| %reg_512 | -1557.587% | -2223.006% | -2385.185% | -2354.167% | 
| %reg_256 | -816.041%  | -969.691%  | -1731.783% | -1319.048% | 
| %reg_128 | -467.514%  | -390.123%  | -747.008%  | -658.333%  | 
| %reg_64  | -307.727%  | -188.704%  | -423.992%  | -352.000%  | 
| %reg_32  | -134.891%  | -83.920%   | -121.097%  | -225.970%  | 
| %reg_16  | -47.557%   | -24.614%   | -124.279%  | -157.390%  | 
| %reg_8   | -23.542%   | -2.437%| -11.967%   | -19.010%   | 
| %reg_4   | -4.749%| -9.722%| -3.608%| -8.595%| 
| %reg_2   | 28.075%| 3.638% | 24.436%| 23.615%| 
| %reg_1   | 12.160%| 14.030%| 4.739% | 27.010%| 

Since our goal is to get the approximate threshold where the cost for
finding to be invalidated buffers gets higher in optimized path than
the traditional path:
A. Traditional Path
 1. For each shared_buffers, compare the relfilenode.
 2. LockBufHdr()
 3. Compare block number, InvalidateBuffers() if

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-30 Thread Amit Kapila

On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying
 wrote:
>
> Hi Amit,
>
> In last 
> mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e6257182564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
> I've sent you the performance test results(run only 1 time) on single table. 
> Here is my the retested results(average by 15 times) which I think is more 
> accurate.
>
> In terms of 20G and 100G, the optimization on 100G is linear, but 20G is 
> nonlinear(also include test results on shared buffers of 50G/60G), so it's a 
> little difficult to decide the threshold from the two for me.
> If just consider 100G, I think NBuffers/32 is the optimized max relation 
> size. But I don't know how to judge for 20G. If you have any suggestion, 
> kindly let me know.
>

Considering these results NBuffers/64 seems a good threshold as beyond
that there is no big advantage. BTW, it is not clear why the advantage
for single table is not as big as multiple tables with the Truncate
command. Can you share your exact test steps for any one of the tests?
Also, did you change autovacumm = off for these tests, if not then the
results might not be reliable because before you run the test via
Vacuum command autovacuum would have done that work?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-29 Thread Tang, Haiying

Hi Amit,

In last 
mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e6257182564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
I've sent you the performance test results(run only 1 time) on single table. 
Here is my the retested results(average by 15 times) which I think is more 
accurate.

In terms of 20G and 100G, the optimization on 100G is linear, but 20G is 
nonlinear(also include test results on shared buffers of 50G/60G), so it's a 
little difficult to decide the threshold from the two for me. 
If just consider 100G, I think NBuffers/32 is the optimized max relation size. 
But I don't know how to judge for 20G. If you have any suggestion, kindly let 
me know.

#%reg   128M1G  20G 100G
---
%reg(NBuffers/512)   0% -1% -5% -26%
%reg(NBuffers/256)   0%  0%  5% -20%
%reg(NBuffers/128)  -1% -1% -10%-16%
%reg(NBuffers/64)   -1%  0%  0%  -8%
%reg(NBuffers/32)0%  0% -2% -4%
%reg(NBuffers/16)0%  0% -6%  4%
%reg(NBuffers/8) 1%  0%  2% -2%
%reg(NBuffers/4) 0%  0%  2%  2%

Optimization details(unit: second):
patched  (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
--
128M0.107   0.107   0.107   0.107   
0.107   0.107   0.108   0.208
1G  0.107   0.108   0.107   0.108   
0.208   0.208   0.308   0.409 
20G 0.199   0.299   0.317   0.408   
0.591   0.900   1.561   2.866 
100G0.318   0.381   0.645   0.992   
1.913   3.640   6.615   13.389

master(HEAD) (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
--
128M0.107   0.107   0.108   0.108   
0.107   0.107   0.107   0.208
1G  0.108   0.108   0.108   0.108   
0.208   0.207   0.308   0.409 
20G 0.208   0.283   0.350   0.408   
0.601   0.955   1.529   2.806 
100G0.400   0.459   0.751   1.068   
1.984   3.506   6.735   13.101

Regards
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-28 Thread Tang, Haiying

Hi Amit,

>I think one table with a varying amount of data is sufficient for the vacuum 
>test. 
>I think with more number of tables there is a greater chance of variation. 
>We have previously used multiple tables in one of the tests because of 
>the Truncate operation (which uses DropRelFileNodesAllBuffers that 
>takes multiple relations as input) and that is not true for Vacuum operation 
>which I suppose you are testing here.

I retested performance on single table for several times, the table size is 
varying with the BUF_DROP_FULL_SCAN_THRESHOLD for different shared buffers.
When shared_buffers is below 20G, there were no significant changes between 
master(HEAD) and patched.
And according to the results compared between 20G and 100G, we can get 
optimization up to NBuffers/128, but there is no benefit from NBuffers/256.
I've tested many times, most times the same results came out, I don't know why. 
But If I used 5 tables(each table size is set as BUF_DROP_FULL_SCAN_THRESHOLD), 
then we can get benefit from it(NBuffers/256).

Here is my test results for single table. If you have any question or 
suggestion, kindly let me know.

%reg= (patched- master(HEAD))/ patched
Optimized percentage:
shared_buffers  %reg(NBuffers/512)  %reg(NBuffers/256)  
%reg(NBuffers/128)  %reg(NBuffers/64)   %reg(NBuffers/32)   
%reg(NBuffers/16)   %reg(NBuffers/8)%reg(NBuffers/4)
--
128M0%  0%  -1% 
0%  1%  0%  
0%  0%
1G  -1% 0%  -1% 
0%  0%  0%  
0%  0%
20G 0%  0%  -33%
0%  0%  -13%
0%  0%
100G-32%0%  -49%
0%  10% 30% 
0%  6%

Result details(unit: second):
patched  (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
128M0.107   0.107   0.107   0.107   
0.108   0.107   0.108   0.208
1G  0.107   0.107   0.107   0.108   
0.208   0.208   0.308   0.409 
20G 0.208   0.308   0.308   0.409   
0.609   0.808   1.511   2.713 
100G0.309   0.408   0.609   1.010   
2.011   5.017   6.620   13.931

master(HEAD) (sec)  
shared_buffers  NBuffers/512NBuffers/256NBuffers/128NBuffers/64 
NBuffers/32 NBuffers/16 NBuffers/8  NBuffers/4
128M0.107   0.107   0.108   0.107   
0.107   0.107   0.108   0.208
1G  0.108   0.107   0.108   0.108   
0.208   0.207   0.308   0.409 
20G 0.208   0.309   0.409   0.409   
0.609   0.910   1.511   2.712 
100G0.408   0.408   0.909   1.010   
1.811   3.515   6.619   13.032

Regards
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying

Hi Amit,

>I think one table with a varying amount of data is sufficient for the vacuum 
>test. 
>I think with more number of tables there is a greater chance of variation. 
>We have previously used multiple tables in one of the tests because of the 
>Truncate operation (which uses DropRelFileNodesAllBuffers that takes multiple 
>relations as input) 
>and that is not true for Vacuum operation which I suppose you are testing here.

Thanks for your advice and kindly explanation. I'll continue the threshold test 
with one single table.

Regards,
Tang

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Amit Kapila

On Fri, Dec 25, 2020 at 9:28 AM Tang, Haiying
 wrote:
>
> Hi Amit,
>
> >But how can we conclude NBuffers/128 is the maximum relation size?
> >Because the maximum size would be where the performance is worse than
> >the master, no? I guess we need to try by NBuffers/64, NBuffers/32,
> > till we get the threshold where master performs better.
>
> You are right, we should keep on testing until no optimization.
>
> >I think we should find a better way to display these numbers because in
> >cases like where master takes 537.978s and patch takes 3.815s
>
> Yeah, I think we can change the %reg formula from (patched- master)/ master 
> to (patched- master)/ patched.
>
> >Table size should be more than 8k to get all this data because 8k means
> >just one block. I guess either it is a typo or some other mistake.
>
> 8k here is the relation size, not data size.
> For example, when I tested recovery performance of 400M relation size, I used 
> 51200 tables(8k per table).
> Please let me know if you think this is not appropriate.
>

I think one table with a varying amount of data is sufficient for the
vacuum test. I think with more number of tables there is a greater
chance of variation. We have previously used multiple tables in one of
the tests because of the Truncate operation (which uses
DropRelFileNodesAllBuffers that takes multiple relations as input) and
that is not true for Vacuum operation which I suppose you are testing
here.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying

Hi Amit,

>But how can we conclude NBuffers/128 is the maximum relation size?
>Because the maximum size would be where the performance is worse than 
>the master, no? I guess we need to try by NBuffers/64, NBuffers/32, 
> till we get the threshold where master performs better.

You are right, we should keep on testing until no optimization.

>I think we should find a better way to display these numbers because in 
>cases like where master takes 537.978s and patch takes 3.815s

Yeah, I think we can change the %reg formula from (patched- master)/ master to 
(patched- master)/ patched.

>Table size should be more than 8k to get all this data because 8k means 
>just one block. I guess either it is a typo or some other mistake.

8k here is the relation size, not data size. 
For example, when I tested recovery performance of 400M relation size, I used 
51200 tables(8k per table).
Please let me know if you think this is not appropriate.

Regards
Tang

-Original Message-
From: Amit Kapila  
Sent: Thursday, December 24, 2020 9:11 PM
To: Tang, Haiying/唐 海英 
Cc: Tsunakawa, Takayuki/綱川 貴之 ; Jamison, 
Kirk/ジャミソン カーク ; Kyotaro Horiguchi 
; Andres Freund ; Tom Lane 
; Thomas Munro ; Robert Haas 
; Tomas Vondra ; 
pgsql-hackers 
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist

On Thu, Dec 24, 2020 at 2:31 PM Tang, Haiying  
wrote:
>
> Hi Amit, Kirk
>
> >One idea could be to remove "nBlocksToInvalidate < 
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && 
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it 
> >always use optimized path for the tests. Then use the relation size 
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of 
> >shared buffers as 128MB, 1GB, 20GB, 100GB.
>
> I followed your idea to remove check and use different relation size for 
> different shared buffers as 128M,1G,20G,50G(my environment can't support 
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even 
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get 
> optimization in the three thresholds, Please let me know if I made something 
> wrong.
>

But how can we conclude NBuffers/128 is the maximum relation size?
Because the maximum size would be where the performance is worse than the 
master, no? I guess we need to try by NBuffers/64, NBuffers/32,  till we 
get the threshold where master performs better.

> Recovery after vacuum test results as below ' Optimized percentage' and ' 
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, 
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
>
> Optimized percentage:
> shared_buffers  %reg(512)   %reg(256)   %reg(128)
> -
> 128M0%  -1% -1%
> 1G  -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
>
> Optimization details(unit: second):
> shared_buffers  master(512) patched(512)master(256) patched(256)  
>   master(128) patched(128)
> -
> 128M0.108   0.108   0.109   0.108 
>   0.109   0.108
> 1G  0.310   0.107   0.410   0.208 
>   0.811   0.309
> 20G 94.493  1.511   188.777 3.014 
>   380.633 6.020
> 50G 537.978 3.815   867.453 7.524 
>   1559.07615.541
>

I think we should find a better way to display these numbers because in cases 
like where master takes 537.978s and patch takes 3.815s, it is clear that patch 
has reduced the time by more than 100 times whereas in your table it shows 99%.

> Test prepare:
> Below is test table amount for different shared buffers. Each table 
> size is 8k,
>

Table size should be more than 8k to get all this data because 8k means just 
one block. I guess either it is a typo or some other mistake.

--
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying

Hi Kirk,


>Perhaps there is a confusing part in the presented table where you indicated 
>master(512), master(256), master(128). 
>Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD and 
>just execute the existing default full scan of NBuffers.
>Or I may have misunderstood something?

Sorry for your confusion, I didn't make it clear. I didn't use 
BUF_DROP_FULL_SCAN_THRESHOLD for master. 
Master(512) means the test table amount in master is same with patched(512), so 
does master(256) and master(128).
I meant to mark 512/256/128 to distinguish results in master for the three 
threshold(applied in patches) .

Regards
Tang

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Amit Kapila

On Wed, Dec 23, 2020 at 6:27 PM k.jami...@fujitsu.com
 wrote:
>
>
> It compiles. Passes the regression tests too.
> Your feedbacks are definitely welcome.
>

Thanks, the patches look good to me now. I have slightly edited the
patches for comments, commit messages, and removed the duplicate
code/check in smgrnblocks. I have changed the order of patches (moved
Assert related patch to last because as mentioned earlier, I am not
sure if we want to commit it.). We might still have to change the scan
threshold value based on your and Tang-San's results.

-- 
With Regards,
Amit Kapila.

v38-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description: Binary data

v38-0002-Optimize-DropRelFileNodesAllBuffers-for-recovery.patch
Description: Binary data

v38-0003-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description: Binary data

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread k.jami...@fujitsu.com

On Thu, December 24, 2020 6:02 PM JST, Tang, Haiying wrote:
> Hi Amit, Kirk
> 
> >One idea could be to remove "nBlocksToInvalidate <
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always
> >use optimized path for the tests. Then use the relation size as
> >NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> >buffers as 128MB, 1GB, 20GB, 100GB.
> 
> I followed your idea to remove check and use different relation size for
> different shared buffers as 128M,1G,20G,50G(my environment can't support
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get
> optimization in the three thresholds, Please let me know if I made something
> wrong.
 

Hello Tang,
Thank you very much again for testing. Perhaps there is a confusing part in the
presented table where you indicated master(512), master(256), master(128). 
Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD
and just execute the existing default full scan of NBuffers.
Or I may have misunderstood something?

> Recovery after vacuum test results as below ' Optimized percentage' and '
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256,
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
> 
> Optimized percentage:
> shared_buffers%reg(512)%reg(256)%reg(128)
> -
> 128M0%-1%-1%
> 1G -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
> 
> Optimization details(unit: second):
> shared_buffersmaster(512)patched(512)master(256)patched(256)master(12
> 8)patched(128)
> -
> 
> 128M0.1080.1080.1090.1080.1090.108
> 1G0.310 0.107 0.410 0.208 0.811 0.309
> 20G 94.493 1.511 188.777 3.014 380.633 6.020
> 50G537.9783.815867.4537.5241559.07615.541
> 
> Test prepare:
> Below is test table amount for different shared buffers. Each table size is 
> 8k,
> so I use table amount = NBuffers/(512 or 256 or 128):
> shared_buffersNBuffersNBuffers/512NBuffers/256NBuffers/128
> -
> --
> 128M163843264128
> 1G1310722565121024
> 20G2621440   51201024020480
> 50G6553600   128002560051200
> 
> Besides, I also did single table performance test.
> Still, NBuffers/128 is the max relation size which we can get optimization.
> 
> Optimized percentage:
> shared_buffers%reg(512)%reg(256)%reg(128)
> -
> 128M0%0%-1%
> 1G 0%1%0%
> 20G 0%-24%-25%
> 50G 0%-24%-20%
> 
> Optimization details(unit: second):
> shared_buffersmaster(512)patched(512)master(256)patched(256)master(12
> 8)patched(128)
> -
> 
> 128M0.1070.1070.1080.1080.1080.107
> 1G0.108 0.108 0.107 0.108 0.108 0.108
> 20G0.208 0.208 0.409 0.309 0.409 0.308
> 50G0.309 0.308 0.408 0.309 0.509 0.408

I will also post results from my machine in the next email.
Adding what Amit mentioned that we should also test for NBuffers/64, etc.
until we determine which of the threshold performs worse than master.


Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Amit Kapila

On Thu, Dec 24, 2020 at 2:31 PM Tang, Haiying
 wrote:
>
> Hi Amit, Kirk
>
> >One idea could be to remove "nBlocksToInvalidate <
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
> >use optimized path for the tests. Then use the relation size as
> >NBuffers/128, NBuffers/256, NBuffers/512 for different values of
> >shared buffers as 128MB, 1GB, 20GB, 100GB.
>
> I followed your idea to remove check and use different relation size for 
> different shared buffers as 128M,1G,20G,50G(my environment can't support 
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even 
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get 
> optimization in the three thresholds, Please let me know if I made something 
> wrong.
>

But how can we conclude NBuffers/128 is the maximum relation size?
Because the maximum size would be where the performance is worse than
the master, no? I guess we need to try by NBuffers/64, NBuffers/32,
 till we get the threshold where master performs better.

> Recovery after vacuum test results as below ' Optimized percentage' and ' 
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, 
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
>
> Optimized percentage:
> shared_buffers  %reg(512)   %reg(256)   %reg(128)
> -
> 128M0%  -1% -1%
> 1G  -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
>
> Optimization details(unit: second):
> shared_buffers  master(512) patched(512)master(256) patched(256)  
>   master(128) patched(128)
> -
> 128M0.108   0.108   0.109   0.108 
>   0.109   0.108
> 1G  0.310   0.107   0.410   0.208 
>   0.811   0.309
> 20G 94.493  1.511   188.777 3.014 
>   380.633 6.020
> 50G 537.978 3.815   867.453 7.524 
>   1559.07615.541
>

I think we should find a better way to display these numbers because
in cases like where master takes 537.978s and patch takes 3.815s, it
is clear that patch has reduced the time by more than 100 times
whereas in your table it shows 99%.

> Test prepare:
> Below is test table amount for different shared buffers. Each table size is 
> 8k,
>

Table size should be more than 8k to get all this data because 8k
means just one block. I guess either it is a typo or some other
mistake.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread Tang, Haiying

Hi Amit, Kirk

>One idea could be to remove "nBlocksToInvalidate <
>BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
>nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
>use optimized path for the tests. Then use the relation size as
>NBuffers/128, NBuffers/256, NBuffers/512 for different values of
>shared buffers as 128MB, 1GB, 20GB, 100GB.

I followed your idea to remove check and use different relation size for 
different shared buffers as 128M,1G,20G,50G(my environment can't support 100G, 
so I choose 50G).
According to results, all three thresholds can get optimized, even NBuffers/128 
when shared_buffers > 128M.
IMHO, I think NBuffers/128 is the maximum relation size we can get optimization 
in the three thresholds, Please let me know if I made something wrong. 

Recovery after vacuum test results as below ' Optimized percentage' and ' 
Optimization details(unit: second)' shows:
(512),(256),(128): means relation size is NBuffers/512, NBuffers/256, 
NBuffers/128
%reg: means (patched(512)- master(512))/ master(512)

Optimized percentage:
shared_buffers  %reg(512)   %reg(256)   %reg(128)
-
128M0%  -1% -1%
1G  -65%-49%-62%
20G -98%-98%-98%
50G -99%-99%-99%

Optimization details(unit: second):
shared_buffers  master(512) patched(512)master(256) patched(256)
master(128) patched(128)
-
128M0.108   0.108   0.109   0.108   
0.109   0.108
1G  0.310   0.107   0.410   0.208   
0.811   0.309
20G 94.493  1.511   188.777 3.014   
380.633 6.020
50G 537.978 3.815   867.453 7.524   
1559.07615.541

Test prepare:
Below is test table amount for different shared buffers. Each table size is 8k, 
so I use table amount = NBuffers/(512 or 256 or 128):
shared_buffers  NBuffersNBuffers/512NBuffers/256NBuffers/128
---
128M16384   32  64  128
1G  131072  256 512 1024
20G 26214405120 10240   20480
50G 65536001280025600   51200

Besides, I also did single table performance test.
Still, NBuffers/128 is the max relation size which we can get optimization.

Optimized percentage:
shared_buffers  %reg(512)   %reg(256)   %reg(128)
-
128M0%  0%  -1%
1G  0%  1%  0%
20G 0%  -24%-25%
50G 0%  -24%-20%

Optimization details(unit: second):
shared_buffers  master(512) patched(512)master(256) patched(256)
master(128) patched(128)
-
128M0.107   0.107   0.108   0.108   
0.108   0.107
1G  0.108   0.108   0.107   0.108   
0.108   0.108
20G 0.208   0.208   0.409   0.309   
0.409   0.308
50G 0.309   0.308   0.408   0.309   
0.509   0.408

Any question on my test results is welcome.

Regards,
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
compiles. Passes the regression tests too.
> Your feedbacks are definitely welcome.

The code looks correct and has become further compact.  Remains ready for 
committer.


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread k.jami...@fujitsu.com

On Wed, December 23, 2020 5:57 PM (GMT+9), Amit Kapila wrote:
> >
> > At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com"
> >  wrote in
> > > From: Amit Kapila 
> > > > + /* Get the number of blocks for a relation's fork */ block[i][j]
> > > > + = smgrnblocks(smgr_reln[i], j, );
> > > > +
> > > > + if (!cached)
> > > > + goto buffer_full_scan;
> > > >
> > > > Why do we need to use goto here? We can simply break from the loop
> > > > and then check if (cached && nBlocksToInvalidate <
> > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid
> goto
> > > > if possible without much complexity.
> > >
> > > That's because two for loops are nested -- breaking there only exits
> > > the inner loop.  (I thought the same as you at first... And I
> > > understand some people have alergy to goto, I think modest use of
> > > goto makes the code readable.)
> >
> > I don't strongly oppose to goto's but in this case the outer loop can
> > break on the same condition with the inner loop, since cached is true
> > whenever the inner loop runs to the end. It is needed to initialize
> > the variable cache with true, instead of false, though.
> >
> 
> +1. I think it is better to avoid goto here as it can be done without
> introducing any complexity or making code any less readable.

I also do not mind, so I have removed the goto and followed the advice
of all reviewers. It works fine in the latest attached patch 0003.

Attached herewith are the sets of patches. 0002 & 0003 have the following
changes:

1. I have removed the modifications in smgrnblocks(). So the modifications of 
other functions that uses smgrnblocks() in the previous patch versions were
also reverted.
2. Introduced a new API smgrnblocks_cached() instead which returns either
a cached size for the specified fork or an InvalidBlockNumber.
Since InvalidBlockNumber is used, I think it is logical not to use the 
additional
boolean parameter "cached" in the function as it will be redundant.
Although in 0003, I only used the "cached" as a Boolean variable to do the trick
of not using goto.
This function is called both in DropRelFileNodeBuffers() and 
DropRelFileNodesAllBuffers().
3. Modified some minor comments from the patch and commit logs.

It compiles. Passes the regression tests too.
Your feedbacks are definitely welcome.

Regards,
Kirk Jamison

v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch

v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description:  v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch

v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 10:42 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Amit Kapila 
> > > + /* Get the number of blocks for a relation's fork */
> > > + block[i][j] = smgrnblocks(smgr_reln[i], j, );
> > > +
> > > + if (!cached)
> > > + goto buffer_full_scan;
> > >
> > > Why do we need to use goto here? We can simply break from the loop and
> > > then check if (cached && nBlocksToInvalidate <
> > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > > possible without much complexity.
> >
> > That's because two for loops are nested -- breaking there only exits the 
> > inner loop.  (I thought the same as you at first... And I understand some 
> > people have alergy to goto, I think modest use of goto makes the code 
> > readable.)
>
> I don't strongly oppose to goto's but in this case the outer loop can
> break on the same condition with the inner loop, since cached is true
> whenever the inner loop runs to the end. It is needed to initialize
> the variable cache with true, instead of false, though.
>

+1. I think it is better to avoid goto here as it can be done without
introducing any complexity or making code any less readable.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 1:07 PM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
>
> > In this code, I am slightly worried about the additional cost of each time
> > checking smgrexists. Consider a case where there are many relations and only
> > one or few of them have not cached the information, in such a case we will
> > pay the cost of smgrexists for many relations without even going to the
> > optimized path. Can we avoid that in some way or at least reduce its usage 
> > to
> > only when it is required? One idea could be that we first check if the 
> > nblocks
> > information is cached and if so then we don't need to call smgrnblocks,
> > otherwise, check if it exists. For this, we need an API like
> > smgrnblocks_cahced, something we discussed earlier but preferred the
> > current API. Do you have any better ideas?
>
> Right. I understand the point that let's say there are 100 relations, and
> the first 99 non-local relations happen to enter the optimization path, but 
> the last
> one does not, calling smgrexist() would be too costly and waste of time in 
> that case.
> The only solution I could think of as you mentioned is to reintroduce the new 
> API
> which we discussed before: smgrnblocks_cached().
> It's possible that we call smgrexist() only if smgrnblocks_cached() returns
> InvalidBlockNumber.
> So if everyone agrees, we can add that API smgrnblocks_cached() which will
> Include the "cached" flag parameter, and remove the additional flag 
> modifications
> from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() will use the new API.
>

Yeah, let's do it that way unless anyone has a better idea to suggest.
--
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com

On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila 
> wrote:
> > Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
> 
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> =
> ===
> 1.
> DropRelFileNodesAllBuffers()
> {
> ..
> +buffer_full_scan:
> + pfree(block);
> + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
> +for (i = 0; i < n; i++)  nodes[i] = smgr_reln[i]->smgr_rnode.node;
> +
> ..
> }
> 
> How is it correct to assign nodes array directly from smgr_reln? There is no
> one-to-one correspondence. If you see the code before patch, the passed
> array can have mixed of temp and non-temp relation information.

You are right. I mistakenly removed the array node that should have been
allocated for non-local relations. So I fixed that by doing:

SMgrRelation*rels;

rels = palloc(sizeof(SMgrRelation) * nnodes);   /* non-local relations 
*/

/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nnodes; i++)
{
if (RelFileNodeBackendIsTemp(smgr_reln[i]->smgr_rnode))
{
if (smgr_reln[i]->smgr_rnode.backend == MyBackendId)

DropRelFileNodeAllLocalBuffers(smgr_reln[i]->smgr_rnode.node);
}
else
rels[n++] = smgr_reln[i];
}
...
if (n == 0)
{
pfree(rels);
return;
}
...
//traditional path:

pfree(block);
nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
for (i = 0; i < n; i++)
nodes[i] = rels[i]->smgr_rnode.node;

> 2.
> + for (i = 0; i < n; i++)
>   {
> - pfree(nodes);
> + for (j = 0; j <= MAX_FORKNUM; j++)
> + {
> + /*
> + * Assign InvalidblockNumber to a block if a relation
> + * fork does not exist, so that we can skip it later
> + * when dropping the relation buffers.
> + */
> + if (!smgrexists(smgr_reln[i], j))
> + {
> + block[i][j] = InvalidBlockNumber;
> + continue;
> + }
> +
> + /* Get the number of blocks for a relation's fork */ block[i][j] =
> + smgrnblocks(smgr_reln[i], j, );
> 
> Similar to above, how can we assume smgr_reln array has all non-local
> relations? Have we tried the case with mix of temp and non-temp relations?

Similar to above reply.

> In this code, I am slightly worried about the additional cost of each time
> checking smgrexists. Consider a case where there are many relations and only
> one or few of them have not cached the information, in such a case we will
> pay the cost of smgrexists for many relations without even going to the
> optimized path. Can we avoid that in some way or at least reduce its usage to
> only when it is required? One idea could be that we first check if the nblocks
> information is cached and if so then we don't need to call smgrnblocks,
> otherwise, check if it exists. For this, we need an API like
> smgrnblocks_cahced, something we discussed earlier but preferred the
> current API. Do you have any better ideas?

Right. I understand the point that let's say there are 100 relations, and
the first 99 non-local relations happen to enter the optimization path, but the 
last
one does not, calling smgrexist() would be too costly and waste of time in that 
case.
The only solution I could think of as you mentioned is to reintroduce the new 
API
which we discussed before: smgrnblocks_cached().
It's possible that we call smgrexist() only if smgrnblocks_cached() returns
InvalidBlockNumber.
So if everyone agrees, we can add that API smgrnblocks_cached() which will
Include the "cached" flag parameter, and remove the additional flag 
modifications
from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() will use the new API.

Thoughts?


Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Zhihong Yu

Hi,
It is possible to come out of the nested loop without goto.

+   boolcached = true;
...
+* to that fork during recovery.
+*/
+   for (i = 0; i < n && cached; i++)
...
+   if (!cached)
+.  break;

Here I changed the initial value for cached to true so that we enter the
outer loop.
In place of the original goto, we break out of inner loop and exit outer
loop.

Cheers

On Tue, Dec 22, 2020 at 8:22 PM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

> From: Amit Kapila 
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, );
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> >
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
>
> That's because two for loops are nested -- breaking there only exits the
> inner loop.  (I thought the same as you at first... And I understand some
> people have alergy to goto, I think modest use of goto makes the code
> readable.)
>
>
> Regards
> Takayuki Tsunakawa
>
>
>
>
>

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Kyotaro Horiguchi

At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, );
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> > 
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
> 
> That's because two for loops are nested -- breaking there only exits the 
> inner loop.  (I thought the same as you at first... And I understand some 
> people have alergy to goto, I think modest use of goto makes the code 
> readable.)

I don't strongly oppose to goto's but in this case the outer loop can
break on the same condition with the inner loop, since cached is true
whenever the inner loop runs to the end. It is needed to initialize
the variable cache with true, instead of false, though.

The same pattern is seen in the tree.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> + /* Get the number of blocks for a relation's fork */
> + block[i][j] = smgrnblocks(smgr_reln[i], j, );
> +
> + if (!cached)
> + goto buffer_full_scan;
> 
> Why do we need to use goto here? We can simply break from the loop and
> then check if (cached && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> possible without much complexity.

That's because two for loops are nested -- breaking there only exits the inner 
loop.  (I thought the same as you at first... And I understand some people have 
alergy to goto, I think modest use of goto makes the code readable.)


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila

On Tue, Dec 22, 2020 at 5:41 PM Amit Kapila  wrote:
>
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila  wrote:
> >
> > Apart from tests, do let me know if you are happy with the changes in
> > the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
>
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> 
>
> In this code, I am slightly worried about the additional cost of each
> time checking smgrexists. Consider a case where there are many
> relations and only one or few of them have not cached the information,
> in such a case we will pay the cost of smgrexists for many relations
> without even going to the optimized path. Can we avoid that in some
> way or at least reduce its usage to only when it is required? One idea
> could be that we first check if the nblocks information is cached and
> if so then we don't need to call smgrnblocks, otherwise, check if it
> exists. For this, we need an API like smgrnblocks_cahced, something we
> discussed earlier but preferred the current API. Do you have any
> better ideas?
>

One more idea which is not better than what I mentioned above is that
we completely avoid calling smgrexists and rely on smgrnblocks. It
will throw an error in case the particular fork doesn't exist and we
can use try .. catch to handle it. I just mentioned it as it came
across my mind but I don't think it is better than the previous one.

One more thing about patch:
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, );
+
+ if (!cached)
+ goto buffer_full_scan;

Why do we need to use goto here? We can simply break from the loop and
then check if (cached && nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
possible without much complexity.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila

On Wed, Dec 23, 2020 at 6:30 AM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote:
>
> > Apart from tests, do let me know if you are happy with the changes in the
> > patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.
>
> Thank you, Amit.
> That looks more neat, combining the previous patches 0002-0003, so I am +1
> with the changes because of the clearer explanations for the threshold and
> optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch 
> sets.
> Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure 
> that we
> do it safely too and that we are not InRecovery.
>

I think the 0001 is mostly for test purposes but we will see once the
main patches are ready.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com

On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: 
> Attached, please find the updated patch with the following modifications, (a)
> updated comments at various places especially to tell why this is a safe
> optimization, (b) merged the patch for extending the smgrnblocks and
> vacuum optimization patch, (c) made minor cosmetic changes and ran
> pgindent, and (d) updated commit message. BTW, this optimization will help
> not only vacuum but also truncate when it is done in the same transaction in
> which the relation is created.  I would like to see certain tests to ensure 
> that
> the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I
> see that some testing has been done earlier [1] for this threshold but I am 
> not
> still able to conclude. The criteria to find the right threshold should be 
> what is
> the maximum size of relation to be truncated above which we don't get
> benefit with this optimization.
> 
> One idea could be to remove "nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always use optimized path for the tests. Then use the relation size as
> NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> buffers as 128MB, 1GB, 20GB, 100GB.

Alright. I will also repeat the tests with the different threshold settings, 
and thank you for the tip.

> Apart from tests, do let me know if you are happy with the changes in the
> patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.

Thank you, Amit.
That looks more neat, combining the previous patches 0002-0003, so I am +1
with the changes because of the clearer explanations for the threshold and
optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets.
Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that 
we
do it safely too and that we are not InRecovery.

> [1] -
> https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9
> FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila

On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila  wrote:
>
> Apart from tests, do let me know if you are happy with the changes in
> the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> optimization patch.
>

Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]

1.
DropRelFileNodesAllBuffers()
{
..
+buffer_full_scan:
+ pfree(block);
+ nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
+ for (i = 0; i < n; i++)
+ nodes[i] = smgr_reln[i]->smgr_rnode.node;
+
..
}

How is it correct to assign nodes array directly from smgr_reln? There
is no one-to-one correspondence. If you see the code before patch, the
passed array can have mixed of temp and non-temp relation information.

2.
+ for (i = 0; i < n; i++)
  {
- pfree(nodes);
+ for (j = 0; j <= MAX_FORKNUM; j++)
+ {
+ /*
+ * Assign InvalidblockNumber to a block if a relation
+ * fork does not exist, so that we can skip it later
+ * when dropping the relation buffers.
+ */
+ if (!smgrexists(smgr_reln[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, );

Similar to above, how can we assume smgr_reln array has all non-local
relations? Have we tried the case with mix of temp and non-temp
relations?

In this code, I am slightly worried about the additional cost of each
time checking smgrexists. Consider a case where there are many
relations and only one or few of them have not cached the information,
in such a case we will pay the cost of smgrexists for many relations
without even going to the optimized path. Can we avoid that in some
way or at least reduce its usage to only when it is required? One idea
could be that we first check if the nblocks information is cached and
if so then we don't need to call smgrnblocks, otherwise, check if it
exists. For this, we need an API like smgrnblocks_cahced, something we
discussed earlier but preferred the current API. Do you have any
better ideas?

[1] - 
https://www.postgresql.org/message-id/OSBPR01MB2341882F416A282C3F7D769DEFC70%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread Amit Kapila

On Tue, Dec 22, 2020 at 8:30 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Amit Kapila 
> > > Why would all client backends wait for AccessExclusive lock on this
> > > relation? Say, a client needs a buffer for some other relation and
> > > that might evict this buffer after we release the lock on the
> > > partition. In StrategyGetBuffer, it is important to either have a pin
> > > on the buffer or the buffer header itself must be locked to avoid
> > > getting picked as victim buffer. Am I missing something?
> >
> > Ouch, right.  (The year-end business must be making me crazy...)
> >
> > So, there are two choices here:
> >
> > 1) The current patch.
> > 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> > lwlock, and eliminate the buffer tag comparison as follows:
> >
> >   BufTableLookup();
> >   LockBufHdr();
> >   LWLockRelease();
> >   InvalidateBuffer();
> >
> > I think both are okay.  If I must choose either, I kind of prefer 1), 
> > because LWLockRelease() could take longer time to wake up other processes 
> > waiting on the lwlock, which is not very good to do while holding a 
> > spinlock.
>
> I like, as said before, the current patch.
>

Attached, please find the updated patch with the following
modifications, (a) updated comments at various places especially to
tell why this is a safe optimization, (b) merged the patch for
extending the smgrnblocks and vacuum optimization patch, (c) made
minor cosmetic changes and ran pgindent, and (d) updated commit
message. BTW, this optimization will help not only vacuum but also
truncate when it is done in the same transaction in which the relation
is created.  I would like to see certain tests to ensure that the
value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I see
that some testing has been done earlier [1] for this threshold but I
am not still able to conclude. The criteria to find the right
threshold should be what is the maximum size of relation to be
truncated above which we don't get benefit with this optimization.

One idea could be to remove "nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
use optimized path for the tests. Then use the relation size as
NBuffers/128, NBuffers/256, NBuffers/512 for different values of
shared buffers as 128MB, 1GB, 20GB, 100GB.

Apart from tests, do let me know if you are happy with the changes in
the patch? Next, I'll look into DropRelFileNodesAllBuffers()
optimization patch.

[1] - 
https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

v36-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description: Binary data

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread k.jami...@fujitsu.com

On Monday, December 21, 2020 10:25 PM, Amit Kapila wrote:
> I have started doing minor edits to the patch especially planning to write a
> theory why is this optimization safe and here is what I can come up with: 
> "To
> remove all the pages of the specified relation forks from the buffer pool, we
> need to scan the entire buffer pool but we can optimize it by finding the
> buffers from BufMapping table provided we know the exact size of each fork
> of the relation. The exact size is required to ensure that we don't leave any
> buffer for the relation being dropped as otherwise the background writer or
> checkpointer can lead to a PANIC error while flushing buffers corresponding
> to files that don't exist.
> 
> To know the exact size, we rely on the size cached for each fork by us during
> recovery which limits the optimization to recovery and on standbys but we
> can easily extend it once we have shared cache for relation size.
> 
> In recovery, we cache the value returned by the first lseek(SEEK_END) and
> the future writes keeps the cached value up-to-date. See smgrextend. It is
> possible that the value of the first lseek is smaller than the actual number 
> of
> existing blocks in the file due to buggy Linux kernels that might not have
> accounted for the recent write. But that should be fine because there must
> not be any buffers after that file size.
> 
> XXX We would make the extra lseek call for the unoptimized paths but that is
> okay because we do it just for the first fork and we anyway have to scan the
> entire buffer pool the cost of which is so high that the extra lseek call 
> won't
> make any visible difference. However, we can use InRecovery flag to avoid the
> additional cost but that doesn't seem worth it."
> 
> Thoughts?

+1 
Thank you very much for expanding the comments to carefully explain the
reason on why the optimization is safe. I was also struggling to explain it 
completely
but your description also covers the possibility of extending the optimization 
in the
future once we have shared cache for rel size. So I like this addition.

(Also, it seems that we have concluded to retain the locking mechanism of the 
existing patch based from the recent email exchanges. Both the traditional path 
and
the optimized path do the rechecking. So there seems to be no problem, I'm 
definitely
fine with it.)

Regards,
Kirk Jamison

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> Mmm. If that is true, doesn't the unoptimized path also need the
> rechecking?

Yes, the traditional processing does the recheck after acquiring the buffer 
header spinlock.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Kyotaro Horiguchi

At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
> 
> Ouch, right.  (The year-end business must be making me crazy...)
> 
> So, there are two choices here:
> 
> 1) The current patch.
> 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> lwlock, and eliminate the buffer tag comparison as follows:
> 
>   BufTableLookup();
>   LockBufHdr();
>   LWLockRelease();
>   InvalidateBuffer();
> 
> I think both are okay.  If I must choose either, I kind of prefer 1), because 
> LWLockRelease() could take longer time to wake up other processes waiting on 
> the lwlock, which is not very good to do while holding a spinlock.

I like, as said before, the current patch.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila

On Tue, Dec 22, 2020 at 8:12 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila  
> wrote in
>
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
>
> I think exactly like that.  If we acquire the bufHdr lock before
> releasing the partition lock, that steal doesn't happen but it doesn't
> seem good as a locking protocol.
>

Right, so let's keep the code as it is but I feel it is better to add
some comments explaining the rationale behind this code.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila

On Tue, Dec 22, 2020 at 8:18 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
>
> Ouch, right.  (The year-end business must be making me crazy...)
>
> So, there are two choices here:
>
> 1) The current patch.
> 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> lwlock, and eliminate the buffer tag comparison as follows:
>
>   BufTableLookup();
>   LockBufHdr();
>   LWLockRelease();
>   InvalidateBuffer();
>
> I think both are okay.  If I must choose either, I kind of prefer 1), because 
> LWLockRelease() could take longer time to wake up other processes waiting on 
> the lwlock, which is not very good to do while holding a spinlock.
>
>

I also prefer (1). I will add some comments about the locking protocol
in the next version of the patch.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> Why would all client backends wait for AccessExclusive lock on this
> relation? Say, a client needs a buffer for some other relation and
> that might evict this buffer after we release the lock on the
> partition. In StrategyGetBuffer, it is important to either have a pin
> on the buffer or the buffer header itself must be locked to avoid
> getting picked as victim buffer. Am I missing something?

Ouch, right.  (The year-end business must be making me crazy...)

So, there are two choices here:

1) The current patch.
2) Acquire the buffer header spinlock before releasing the buffer mapping 
lwlock, and eliminate the buffer tag comparison as follows:

  BufTableLookup();
  LockBufHdr();
  LWLockRelease();
  InvalidateBuffer();

I think both are okay.  If I must choose either, I kind of prefer 1), because 
LWLockRelease() could take longer time to wake up other processes waiting on 
the lwlock, which is not very good to do while holding a spinlock.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Kyotaro Horiguchi

At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Amit Kapila 
> > > This answers the second part of the question but what about the first
> > > part (We hold a buffer partition lock, and have done a lookup in th
> > > mapping table. Why are we then rechecking the
> > > relfilenode/fork/blocknum?)
> > >
> > > I think we don't need such a check, rather we can have an Assert
> > > corresponding to that if-condition in the patch. I understand it is
> > > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > > of the code.
> >
> > Hmm, you're right. I thought someone else could steal the found buffer and 
> > use it for another block because the buffer mapping lwlock is released 
> > without pinning the buffer or acquiring the buffer header spinlock.
> >
> 
> Okay, I see your point.
> 
> >  However, in this case (replay of TRUNCATE during recovery), nobody steals 
> > the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, 
> > and the client backend waits for AccessExclusive lock.
> >
> >

I understood that you are thinking that the rechecking is useless.

> Why would all client backends wait for AccessExclusive lock on this
> relation? Say, a client needs a buffer for some other relation and
> that might evict this buffer after we release the lock on the
> partition. In StrategyGetBuffer, it is important to either have a pin
> on the buffer or the buffer header itself must be locked to avoid
> getting picked as victim buffer. Am I missing something?

I think exactly like that.  If we acquire the bufHdr lock before
releasing the partition lock, that steal doesn't happen but it doesn't
seem good as a locking protocol.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Kyotaro Horiguchi

At Tue, 22 Dec 2020 01:42:55 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > This answers the second part of the question but what about the first
> > part (We hold a buffer partition lock, and have done a lookup in th
> > mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum?)
> > 
> > I think we don't need such a check, rather we can have an Assert
> > corresponding to that if-condition in the patch. I understand it is
> > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > of the code.
> 
> Hmm, you're right.  I thought someone else could steal the found
> buffer and use it for another block because the buffer mapping
> lwlock is released without pinning the buffer or acquiring the
> buffer header spinlock.  However, in this case (replay of TRUNCATE
> during recovery), nobody steals the buffer: bgwriter or checkpointer
> doesn't use a buffer for a new block, and the client backend waits
> for AccessExclusive lock.

Mmm. If that is true, doesn't the unoptimized path also need the
rechecking?

The AEL doesn't work for a buffer block. No new block can be allocted
for the relation but still BufferAlloc can steal the block for other
relations since the AEL doesn't work for each buffer block.  Am I
still missing something?


> > I have started doing minor edits to the patch especially planning to
> > write a theory why is this optimization safe and here is what I can
> > come up with:
> 
> Thank you, that's fluent and easier to understand.

+1

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila

On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > This answers the second part of the question but what about the first
> > part (We hold a buffer partition lock, and have done a lookup in th
> > mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum?)
> >
> > I think we don't need such a check, rather we can have an Assert
> > corresponding to that if-condition in the patch. I understand it is
> > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > of the code.
>
> Hmm, you're right. I thought someone else could steal the found buffer and 
> use it for another block because the buffer mapping lwlock is released 
> without pinning the buffer or acquiring the buffer header spinlock.
>

Okay, I see your point.

>  However, in this case (replay of TRUNCATE during recovery), nobody steals 
> the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, 
> and the client backend waits for AccessExclusive lock.
>
>

Why would all client backends wait for AccessExclusive lock on this
relation? Say, a client needs a buffer for some other relation and
that might evict this buffer after we release the lock on the
partition. In StrategyGetBuffer, it is important to either have a pin
on the buffer or the buffer header itself must be locked to avoid
getting picked as victim buffer. Am I missing something?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> This answers the second part of the question but what about the first
> part (We hold a buffer partition lock, and have done a lookup in th
> mapping table. Why are we then rechecking the
> relfilenode/fork/blocknum?)
> 
> I think we don't need such a check, rather we can have an Assert
> corresponding to that if-condition in the patch. I understand it is
> safe to compare relfilenode/fork/blocknum but it might confuse readers
> of the code.

Hmm, you're right.  I thought someone else could steal the found buffer and use 
it for another block because the buffer mapping lwlock is released without 
pinning the buffer or acquiring the buffer header spinlock.  However, in this 
case (replay of TRUNCATE during recovery), nobody steals the buffer: bgwriter 
or checkpointer doesn't use a buffer for a new block, and the client backend 
waits for AccessExclusive lock.


> I have started doing minor edits to the patch especially planning to
> write a theory why is this optimization safe and here is what I can
> come up with:

Thank you, that's fluent and easier to understand.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread Amit Kapila

On Thu, Nov 19, 2020 at 12:37 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Andres Freund 
>
> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum,
> > int nforks,
> > +   BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock)
> > ...
> > + /* Check that it is in the buffer pool. If not, do 
> > nothing.
> > */
> > + LWLockAcquire(bufPartitionLock, LW_SHARED);
> > + buf_id = BufTableLookup(, bufHash);
> > ...
> > + bufHdr = GetBufferDescriptor(buf_id);
> > +
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
> > + bufHdr->tag.forkNum == forkNum[i] &&
> > + bufHdr->tag.blockNum >= firstDelBlock[i])
> > + InvalidateBuffer(bufHdr);   /* releases
> > spinlock */
> > + else
> > + UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition lock, 
> > and
> > have done a lookup in the mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum? And why are we doing so holding the buffer header
> > lock, which is essentially a spinlock, so should only ever be held for very 
> > short
> > portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc, but 
> > there
> > the situation is different: We haven't done a buffer mapping lookup, and we
> > don't hold a partition lock!
>
> That's because the buffer partition lock is released immediately after the 
> hash table has been looked up.  As an aside, InvalidateBuffer() requires the 
> caller to hold the buffer header spinlock and doesn't hold the buffer 
> partition lock.
>

This answers the second part of the question but what about the first
part (We hold a buffer partition lock, and have done a lookup in the
mapping table. Why are we then rechecking the
relfilenode/fork/blocknum?)

I think we don't need such a check, rather we can have an Assert
corresponding to that if-condition in the patch. I understand it is
safe to compare relfilenode/fork/blocknum but it might confuse readers
of the code.

I have started doing minor edits to the patch especially planning to
write a theory why is this optimization safe and here is what I can
come up with: "To remove all the pages of the specified relation forks
from the buffer pool, we need to scan the entire buffer pool but we
can optimize it by finding the buffers from BufMapping table provided
we know the exact size of each fork of the relation. The exact size is
required to ensure that we don't leave any buffer for the relation
being dropped as otherwise the background writer or checkpointer can
lead to a PANIC error while flushing buffers corresponding to files
that don't exist.

To know the exact size, we rely on the size cached for each fork by us
during recovery which limits the optimization to recovery and on
standbys but we can easily extend it once we have shared cache for
relation size.

In recovery, we cache the value returned by the first lseek(SEEK_END)
and the future writes keeps the cached value up-to-date. See
smgrextend. It is possible that the value of the first lseek is
smaller than the actual number of existing blocks in the file due to
buggy Linux kernels that might not have accounted for the recent
write. But that should be fine because there must not be any buffers
after that file size.

XXX We would make the extra lseek call for the unoptimized paths but
that is okay because we do it just for the first fork and we anyway
have to scan the entire buffer pool the cost of which is so high that
the extra lseek call won't make any visible difference. However, we
can use InRecovery flag to avoid the additional cost but that doesn't
seem worth it."

Thoughts?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-17 Thread Tang, Haiying

Hello Kirk,

I noticed you have pushed a new version for your patch which has some changes 
on TRUNCATE on TOAST relation. 
Although you've done performance test for your changed part. I'd like to do a 
double check for your patch(hope you don't mind).
Below is the updated recovery performance test results for your new patch. All 
seems good.

*TOAST relation with PLAIN strategy like integer : 
1. Recovery after VACUUM test results(average of 15 times)
shared_buffers  master(sec)patched(sec) 
%reg=((patched-master)/patched)
--
128M2.111   1.604   -24%
10G 57.135  1.878   -97%
20G 167.122 1.932   -99%

2. Recovery after TRUNCATE test results(average of 15 times)
shared_buffers  master(sec)   patched(sec) 
%reg=((patched-master)/patched)
--
128M2.326   1.718   -26%
10G 82.397  1.738   -98%
20G 169.275 1.718   -99%

*TOAST relation with NON-PLAIN strategy like text/varchar: 
1. Recovery after VACUUM test results(average of 15 times)
shared_buffers  master(sec)patched(sec) 
%reg=((patched-master)/patched)
--
128M3.174   2.493   -21%
10G 72.716  2.246   -97%
20G 163.660 2.474   -98%

2. Recovery after TRUNCATE test results(average of 15 times): Although it looks 
like there are some improvements after patch applied. I think that's because of 
the average calculation. TRUNCATE results should be similar between master and 
patched because they all do full scan.
shared_buffers  master(sec)   patched(sec) 
%reg=((patched-master)/patched)
--
128M4.978   4.958   0%
10G 97.048  88.751  -9%
20G 183.230 173.226 -5% 

[Machine spec]
CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
Memory: 128G
OS: CentOS 8

[Failover test data]
Total table Size: 600M
Table: 1 tables (1000 rows per table)

[Configure in postgresql.conf]
autovacuum = off
wal_level = replica
max_wal_senders = 5
max_locks_per_transaction = 1

If you have any questions on my test results, please let me know.

Regards
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-13 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> Attached are the final updated patches.

Looks good, and the patch remains ready for committer.  (Personally, I wanted 
the code comment to touch upon the TOAST and FSM/VM for the reader, because we 
couldn't think of those possibilities and took some time to find why the 
optimization path wasn't taken.)


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-13 Thread k.jami...@fujitsu.com

On Friday, December 11, 2020 10:27 AM, Amit Kapila wrote:
> On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com
>  wrote:
> > So should I still not include that information?
> >
> 
> I think we can extend your existing comment like: "Otherwise if the size of a
> relation fork is not cached, we proceed to a full scan of the whole buffer 
> pool.
> This can happen if there is no update to a particular fork during recovery."

Attached are the final updated patches.
I followed this advice and updated the source code comment a little bit.
There are no changes from the previous except that and the unnecessary
"cached" condition which Tsunakawa-san mentioned.

Below is also the updated recovery performance test results for TRUNCATE.
(1000 tables, 1MB per table, results measured in seconds)
| s_b   | Master | Patched | % Reg   | 
|---||-|-| 
| 128MB | 0.406  | 0.406   | 0%  | 
| 512MB | 0.506  | 0.406   | -25%| 
| 1GB   | 0.806  | 0.406   | -99%| 
| 20GB  | 15.224 | 0.406   | -3650%  | 
| 100GB | 81.506 | 0.406   | -19975% |

Because of the size of relation, it is expected to enter full-scan for
the 128MB shared_buffers setting. And there was no regression.
Similar to previous test results, the recovery time was constant
for all shared_buffers setting with the patches applied.

Regards,
Kirk Jamison



v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com

From: tsunakawa.ta...@fujitsu.com 
> What's valuable as a code comment to describe the remaining issue is that the

You can attach XXX or FIXME in front of the issue description for easier 
search.  (XXX appears to be used much more often in Postgres.)


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread Amit Kapila

On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com
 wrote:
>
> On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> > On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > Yes, I have tested that optimization works for index relations.
> > >
> > > I have attached the V34, following the conditions that we use "cached"
> > > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > > for consistency.
> > > I added comment in 0004 the limitation of optimization when there are
> > > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > > works if the data types used are integers, OID, bytea, etc. But for
> > > TOAST-able data types like text, the optimization will be skipped and 
> > > force a
> > full scan during recovery.
> > >
> >
> > AFAIU, it won't take optimization path only when we have TOAST relation but
> > there is no insertion corresponding to it. If so, then we don't need to 
> > mention
> > it specifically because there are other similar cases where the optimization
> > won't work like when during recovery we have to just perform TRUNCATE.
> >
>
> Right, I forgot to add that there should be an update like insert to the TOAST
> relation for truncate optimization to work. However, that is only limited to
> TOAST relations with PLAIN strategy. I have tested with text data type, with
> Inserts before truncate, and it did not enter the optimization path.
>

I think you are seeing because text datatype allows creating toast
storage and your data is small enough to be toasted.

> OTOH,
> It worked for data type like integer.
>

It is not related to any datatype, it can happen whenever we don't
have any operation on any of the forks after recovery.

> So should I still not include that information?
>

I think we can extend your existing comment like: "Otherwise if the
size of a relation fork is not cached, we proceed to a full scan of
the whole buffer pool. This can happen if there is no update to a
particular fork during recovery."

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> > AFAIU, it won't take optimization path only when we have TOAST relation but
> > there is no insertion corresponding to it. If so, then we don't need to 
> > mention
> > it specifically because there are other similar cases where the optimization
> > won't work like when during recovery we have to just perform TRUNCATE.
> >
> 
> Right, I forgot to add that there should be an update like insert to the TOAST
> relation for truncate optimization to work. However, that is only limited to
> TOAST relations with PLAIN strategy. I have tested with text data type, with
> Inserts before truncate, and it did not enter the optimization path. OTOH,
> It worked for data type like integer. So should I still not include that 
> information?

What's valuable as a code comment to describe the remaining issue is that the 
reader can find clues to if this is related to the problem he/she has hit, 
and/or how to solve the issue.  I don't think the current comment is so bad in 
that regard, but it seems better to add:

* The condition of the issue: the table's ancillary storage (index, TOAST 
table, FSM, VM, etc.) was not updated during recovery.
(As an aside, "during recovery" here does not mean "after the last checkpoint" 
but "from the start of recovery", because the standby experiences many 
checkpoints (the correct term is restartpoints in case of standby).)

* The cause as a hint to solve the issue: The startup process does not find 
page modification WAL records.  As a result, it won't call 
XLogReadBufferExtended() and smgrnblocks() called therein, so the relation/fork 
size is not cached.

Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com

On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Yes, I have tested that optimization works for index relations.
> >
> > I have attached the V34, following the conditions that we use "cached"
> > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > for consistency.
> > I added comment in 0004 the limitation of optimization when there are
> > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > works if the data types used are integers, OID, bytea, etc. But for
> > TOAST-able data types like text, the optimization will be skipped and force 
> > a
> full scan during recovery.
> >
> 
> AFAIU, it won't take optimization path only when we have TOAST relation but
> there is no insertion corresponding to it. If so, then we don't need to 
> mention
> it specifically because there are other similar cases where the optimization
> won't work like when during recovery we have to just perform TRUNCATE.
> 

Right, I forgot to add that there should be an update like insert to the TOAST
relation for truncate optimization to work. However, that is only limited to
TOAST relations with PLAIN strategy. I have tested with text data type, with
Inserts before truncate, and it did not enter the optimization path. OTOH,
It worked for data type like integer. So should I still not include that 
information?

Also, I will remove the unnecessary "cached" from the line that Tsunakawa-san
mentioned. I will wait for a few more comments before reuploading, hopefully,
the final version & including the test for truncate,

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread Amit Kapila

On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
 wrote:
>
> Yes, I have tested that optimization works for index relations.
>
> I have attached the V34, following the conditions that we use "cached" flag
> for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
> consistency.
> I added comment in 0004 the limitation of optimization when there are TOAST
> relations that use NON-PLAIN strategy. i.e. The optimization works if the data
> types used are integers, OID, bytea, etc. But for TOAST-able data types like 
> text,
> the optimization will be skipped and force a full scan during recovery.
>

AFAIU, it won't take optimization path only when we have TOAST
relation but there is no insertion corresponding to it. If so, then we
don't need to mention it specifically because there are other similar
cases where the optimization won't work like when during recovery we
have to just perform TRUNCATE.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> I added comment in 0004 the limitation of optimization when there are TOAST
> relations that use NON-PLAIN strategy. i.e. The optimization works if the data
> types used are integers, OID, bytea, etc. But for TOAST-able data types like 
> text,
> the optimization will be skipped and force a full scan during recovery.

bytea is a TOAST-able type.


+   /*
+* Enter the optimization if the total number of blocks to be
+* invalidated for all relations is below the full scan threshold.
+*/
+   if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)

Checking cached here doesn't seem to be necessary, because if cached is false, 
the control goes to the full scan path as below:

+   if (!cached)
+   goto buffer_full_scan;
+


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com

On Thursday, December 10, 2020 12:27 PM, Amit Kapila wrote: 
> On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila
> >  wrote in
> > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > > expansion, so we cannot get to the optimized path in major cases
> > > > of truncation involving btree (and/or maybe other indexes).
> > > >
> > >
> > > AFAICS, btree insert should call smgrnblocks via
> > >
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExte
> nded->XLogReadBufferExtended->smgrnblocks.
> > > Similarly delete should also call smgrnblocks. Can you be bit more
> > > specific related to the btree case you have in mind?
> >
> > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> > called during buffer loading while recovery. So, smgrnblock is called
> > for indexes if any update happens on the heap relation.
> >
> 
> Okay, so this means that we can get the benefit of optimization in many cases
> in the Truncate code path as well even if we use 'cached'
> flag? If so, then I would prefer to keep the code consistent for both vacuum
> and truncate recovery code path.

Yes, I have tested that optimization works for index relations.

I have attached the V34, following the conditions that we use "cached" flag
for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
consistency.
I added comment in 0004 the limitation of optimization when there are TOAST
relations that use NON-PLAIN strategy. i.e. The optimization works if the data
types used are integers, OID, bytea, etc. But for TOAST-able data types like 
text,
the optimization will be skipped and force a full scan during recovery.

Regards,
Kirk Jamison



v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread Amit Kapila

On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila  
> wrote in
> > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > expansion, so we cannot get to the optimized path in major cases of
> > > truncation involving btree (and/or maybe other indexes).
> > >
> >
> > AFAICS, btree insert should call smgrnblocks via
> > btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks.
> > Similarly delete should also call smgrnblocks. Can you be bit more
> > specific related to the btree case you have in mind?
>
> Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> called during buffer loading while recovery. So, smgrnblock is called
> for indexes if any update happens on the heap relation.
>

Okay, so this means that we can get the benefit of optimization in
many cases in the Truncate code path as well even if we use 'cached'
flag? If so, then I would prefer to keep the code consistent for both
vacuum and truncate recovery code path.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> called during buffer loading while recovery. So, smgrnblock is called
> for indexes if any update happens on the heap relation.

I misunderstood that you said there's no problem with the TOAST index because 
TRUNCATE creates the meta page, resulting in the caching of the page and size 
of the relation.  Anyway, I'm relieved the concern disappeared.

Then, I'd like to hear your vote in my previous mail...


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread Kyotaro Horiguchi

At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila  wrote 
in 
> On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila  
> > wrote in
> > > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
> > >  wrote:
> > > >
> > > > From: Jamison, Kirk/ジャミソン カーク 
> > > > > Because one of the rel's cached value was false, it forced the
> > > > > full-scan path for TRUNCATE.
> > > > > Is there a possible workaround for this?
> > > >
> > > > Hmm, the other two relfilenodes are for the TOAST table and index of 
> > > > the target table.  I think the INSERT didn't access those TOAST 
> > > > relfilenodes because the inserted data was stored in the main storage.  
> > > > But TRUNCATE always truncates all the three relfilenodes.  So, the 
> > > > standby had not opened the relfilenode for the TOAST stuff or cached 
> > > > its size when replaying the TRUNCATE.
> > > >
> > > > I'm afraid this is more common than we can ignore and accept the slow 
> > > > traditional path, but I don't think of a good idea to use the cached 
> > > > flag.
> > > >
> > >
> > > I also can't think of a way to use an optimized path for such cases
> > > but I don't agree with your comment on if it is common enough that we
> > > leave this optimization entirely for the truncate path.
> >
> > Mmm. At least btree doesn't need to call smgrnblocks except at
> > expansion, so we cannot get to the optimized path in major cases of
> > truncation involving btree (and/or maybe other indexes).
> >
> 
> AFAICS, btree insert should call smgrnblocks via
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks.
> Similarly delete should also call smgrnblocks. Can you be bit more
> specific related to the btree case you have in mind?

Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
called during buffer loading while recovery. So, smgrnblock is called
for indexes if any update happens on the heap relation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-09 Thread Amit Kapila

On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila  
> wrote in
> > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
> >  wrote:
> > >
> > > From: Jamison, Kirk/ジャミソン カーク 
> > > > Because one of the rel's cached value was false, it forced the
> > > > full-scan path for TRUNCATE.
> > > > Is there a possible workaround for this?
> > >
> > > Hmm, the other two relfilenodes are for the TOAST table and index of the 
> > > target table.  I think the INSERT didn't access those TOAST relfilenodes 
> > > because the inserted data was stored in the main storage.  But TRUNCATE 
> > > always truncates all the three relfilenodes.  So, the standby had not 
> > > opened the relfilenode for the TOAST stuff or cached its size when 
> > > replaying the TRUNCATE.
> > >
> > > I'm afraid this is more common than we can ignore and accept the slow 
> > > traditional path, but I don't think of a good idea to use the cached flag.
> > >
> >
> > I also can't think of a way to use an optimized path for such cases
> > but I don't agree with your comment on if it is common enough that we
> > leave this optimization entirely for the truncate path.
>
> Mmm. At least btree doesn't need to call smgrnblocks except at
> expansion, so we cannot get to the optimized path in major cases of
> truncation involving btree (and/or maybe other indexes).
>

AFAICS, btree insert should call smgrnblocks via
btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks.
Similarly delete should also call smgrnblocks. Can you be bit more
specific related to the btree case you have in mind?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread k.jami...@fujitsu.com

On Wednesday, December 9, 2020 10:58 AM, Tsunakawa, Takayuki wrote: 
> From: Kyotaro Horiguchi 
> > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila
> >  wrote in
>  > I also can't think of a way to use an optimized path for such cases
> > > but I don't agree with your comment on if it is common enough that
> > > we leave this optimization entirely for the truncate path.
> >
> > An ugly way to cope with it would be to let other smgr functions
> > manage the cached value, for example, by calling smgrnblocks while
> > InRecovery.  Or letting smgr remember the maximum block number ever
> > accessed.  But we cannot fully rely on that since smgr can be closed
> > midst of a session and smgr doesn't offer such persistence.  In the
> > first place smgr doesn't seem to be the place to store such persistent
> > information.
> 
> Yeah, considering the future evolution of this patch to operations during
> normal running, I don't think that would be a good fit, either.
> 
> Then, the as we're currently targeting just recovery, the options we can take
> are below.  Which would vote for?  My choice would be (3) > (2) > (1).
> 
> 
> (1)
> Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
> This brings the most safety and code consistency.
> But this would not benefit from optimization for TRUNCATE in unexpectedly
> many cases -- when TOAST storage exists but it's not written, or FSM/VM is
> not updated after checkpoint.
> 
> 
> (2)
> Use the cached flag in VACUUM (0003), but use InRecovery instead of the
> cached flag in TRUNCATE (0004).
> This benefits from the optimization in all cases.
> But this lacks code consistency.
> You may be afraid of safety if the startup process smgrclose()s the relation
> after the shared buffer flushing hits disk full.  However, startup process
> doesn't smgrclose(), so it should be safe.  Just in case the startup process
> smgrclose()s, the worst consequence is PANIC shutdown after repeated
> failure of checkpoints due to lingering orphaned dirty shared buffers.  Accept
> it as Thomas-san's devil's suggestion.
> 
> 
> (3)
> Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
> This benefits from the optimization in all cases.
> The code is consistent and smaller.
> As for the safety, this is the same as (2), but it applies to VACUUM as well.

If we want code consistency, then we'd fall in either 1 or 3.
And if we want to take the benefits of optimization for both 
DropRelFileNodeBuffers
and DropRelFileNodesAllBuffers, then I'd choose 3.
However, if the reviewers and committer want to make use of the "cached" flag,
then we can live with "cached" value in place there even if it's not common to
get the optimization for TRUNCATE path. So only VACUUM would take the most
benefit.
My vote is also (3), then (2), and (1).

Regards,
Kirk Jamison

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila 
> wrote in
 > I also can't think of a way to use an optimized path for such cases
> > but I don't agree with your comment on if it is common enough that we
> > leave this optimization entirely for the truncate path.
> 
> An ugly way to cope with it would be to let other smgr functions
> manage the cached value, for example, by calling smgrnblocks while
> InRecovery.  Or letting smgr remember the maximum block number ever
> accessed.  But we cannot fully rely on that since smgr can be closed
> midst of a session and smgr doesn't offer such persistence.  In the
> first place smgr doesn't seem to be the place to store such persistent
> information.

Yeah, considering the future evolution of this patch to operations during 
normal running, I don't think that would be a good fit, either.

Then, the as we're currently targeting just recovery, the options we can take 
are below.  Which would vote for?  My choice would be (3) > (2) > (1).


(1)
Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
This brings the most safety and code consistency.
But this would not benefit from optimization for TRUNCATE in unexpectedly many 
cases -- when TOAST storage exists but it's not written, or FSM/VM is not 
updated after checkpoint.


(2)
Use the cached flag in VACUUM (0003), but use InRecovery instead of the cached 
flag in TRUNCATE (0004).
This benefits from the optimization in all cases.
But this lacks code consistency.
You may be afraid of safety if the startup process smgrclose()s the relation 
after the shared buffer flushing hits disk full.  However, startup process 
doesn't smgrclose(), so it should be safe.  Just in case the startup process 
smgrclose()s, the worst consequence is PANIC shutdown after repeated failure of 
checkpoints due to lingering orphaned dirty shared buffers.  Accept it as 
Thomas-san's devil's suggestion.


(3)
Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
This benefits from the optimization in all cases.
The code is consistent and smaller.
As for the safety, this is the same as (2), but it applies to VACUUM as well.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread Kyotaro Horiguchi

At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Jamison, Kirk/ジャミソン カーク 
> > > Because one of the rel's cached value was false, it forced the
> > > full-scan path for TRUNCATE.
> > > Is there a possible workaround for this?
> >
> > Hmm, the other two relfilenodes are for the TOAST table and index of the 
> > target table.  I think the INSERT didn't access those TOAST relfilenodes 
> > because the inserted data was stored in the main storage.  But TRUNCATE 
> > always truncates all the three relfilenodes.  So, the standby had not 
> > opened the relfilenode for the TOAST stuff or cached its size when 
> > replaying the TRUNCATE.
> >
> > I'm afraid this is more common than we can ignore and accept the slow 
> > traditional path, but I don't think of a good idea to use the cached flag.
> >
> 
> I also can't think of a way to use an optimized path for such cases
> but I don't agree with your comment on if it is common enough that we
> leave this optimization entirely for the truncate path.

Mmm. At least btree doesn't need to call smgrnblocks except at
expansion, so we cannot get to the optimized path in major cases of
truncation involving btree (and/or maybe other indexes). TOAST
relations are not accessed until we insert/update/retrive the values
in it.

An ugly way to cope with it would be to let other smgr functions
manage the cached value, for example, by calling smgrnblocks while
InRecovery.  Or letting smgr remember the maximum block number ever
accessed.  But we cannot fully rely on that since smgr can be closed
midst of a session and smgr doesn't offer such persistence.  In the
first place smgr doesn't seem to be the place to store such persistent
information.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread Amit Kapila

On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Jamison, Kirk/ジャミソン カーク 
> > Because one of the rel's cached value was false, it forced the
> > full-scan path for TRUNCATE.
> > Is there a possible workaround for this?
>
> Hmm, the other two relfilenodes are for the TOAST table and index of the 
> target table.  I think the INSERT didn't access those TOAST relfilenodes 
> because the inserted data was stored in the main storage.  But TRUNCATE 
> always truncates all the three relfilenodes.  So, the standby had not opened 
> the relfilenode for the TOAST stuff or cached its size when replaying the 
> TRUNCATE.
>
> I'm afraid this is more common than we can ignore and accept the slow 
> traditional path, but I don't think of a good idea to use the cached flag.
>

I also can't think of a way to use an optimized path for such cases
but I don't agree with your comment on if it is common enough that we
leave this optimization entirely for the truncate path.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> Because one of the rel's cached value was false, it forced the
> full-scan path for TRUNCATE.
> Is there a possible workaround for this?

Hmm, the other two relfilenodes are for the TOAST table and index of the target 
table.  I think the INSERT didn't access those TOAST relfilenodes because the 
inserted data was stored in the main storage.  But TRUNCATE always truncates 
all the three relfilenodes.  So, the standby had not opened the relfilenode for 
the TOAST stuff or cached its size when replaying the TRUNCATE.

I'm afraid this is more common than we can ignore and accept the slow 
traditional path, but I don't think of a good idea to use the cached flag.


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread k.jami...@fujitsu.com

On Tuesday, December 8, 2020 2:35 PM, Amit Kapila wrote:

> On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila
> >  wrote in
> > > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
> > >  wrote:
> > > > We drop
> > > > buffers for the old relfilenode on truncation anyway.
> > > >
> > > > What I did is:
> > > >
> > > > a: Create a physical replication pair.
> > > > b: On the master, create a table. (without explicitly starting a
> > > > tx)
> > > > c: On the master, insert a tuple into the table.
> > > > d: On the master truncate the table.
> > > >
> > > > On the standby, smgrnblocks is called for the old relfilenode of
> > > > the table at c, then the same function is called for the same
> > > > relfilenode at d and the function takes the cached path.
> > > >
> > >
> > > This is on the lines I have tried for recovery. So, it seems we are
> > > in agreement that we can use the 'cached' flag in
> > > DropRelFileNodesAllBuffers and it will take the optimized path in
> > > many such cases, right?
> >
> >
> > Mmm. There seems to be a misunderstanding.. What I opposed to is
> > referring only to InRecovery and ignoring the value of "cached".
> >
> 
> Okay, I think it was Kirk-San who proposed to use InRecovery and ignoring
> the value of "cached" based on the theory that even if Insert (or other DMLs)
> are done before Truncate, it won't use an optimized path and I don't agree
> with the same. So, I did a small test to check the same and found that it
> should use the optimized path and the same is true for the experiment done
> by you. I am not sure why Kirk-San is seeing something different?
> 
> > The remaining issue is we don't get to the optimized path when a
> > standby makes the first call to smgrnblocks() when truncating a
> > relation. Still we can get to the optimized path as far as any
> > update(+insert) or select is performed earlier on the relation so I
> > think it doesn't matter so match.
> >
> 
> +1.

My question/proposal before was to either use InRecovery,
or completely drop the smgrnblocks' "cached" flag.
But that is coming from the results of my investigation below when
I used "cached" in DropRelFileNodesAllBuffers().
The optimization path was skipped because one of the
Rels' "cached" value was "false".

Test Case. (shared_buffer = 1GB)
0. Set physical replication to both master and standby.
1. Create 1 table.
2. Insert Data (1MB) to TABLE.
16385 is the relnode for insert (both Master and Standby).

3. Pause WAL on Standby.
4. TRUNCATE table on Primary.
  nrels = 3. relNodes 16389, 16388, 16385.

5. Stop Primary.

6. Promote standby and resume WAL recovery.  nrels = 3 
  1st rel's check for optimization: "cached" is TRUE. relNode = 16389.
  2nd rel's check for optimization. "cached" was returned FALSE by
smgrnblocks). relNode = 16388.
  Since one of the rels' cached is "FALSE", the optimization check for
  3rd relation and the whole optimization itself is skipped.
  Go to full-scan path in DropRelFileNodesAllBuffers().
  Then smgrclose for relNodes 16389, 16388, 16385.

Because one of the rel's cached value was false, it forced the
full-scan path for TRUNCATE.
Is there a possible workaround for this? 

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Amit Kapila

On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila  
> wrote in
> > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
> >  wrote:
> > > We drop
> > > buffers for the old relfilenode on truncation anyway.
> > >
> > > What I did is:
> > >
> > > a: Create a physical replication pair.
> > > b: On the master, create a table. (without explicitly starting a tx)
> > > c: On the master, insert a tuple into the table.
> > > d: On the master truncate the table.
> > >
> > > On the standby, smgrnblocks is called for the old relfilenode of the
> > > table at c, then the same function is called for the same relfilenode
> > > at d and the function takes the cached path.
> > >
> >
> > This is on the lines I have tried for recovery. So, it seems we are in
> > agreement that we can use the 'cached' flag in
> > DropRelFileNodesAllBuffers and it will take the optimized path in many
> > such cases, right?
>
>
> Mmm. There seems to be a misunderstanding.. What I opposed to is
> referring only to InRecovery and ignoring the value of "cached".
>

Okay, I think it was Kirk-San who proposed to use InRecovery and
ignoring the value of "cached" based on the theory that even if Insert
(or other DMLs) are done before Truncate, it won't use an optimized
path and I don't agree with the same. So, I did a small test to check
the same and found that it should use the optimized path and the same
is true for the experiment done by you. I am not sure why Kirk-San is
seeing something different?

> The remaining issue is we don't get to the optimized path when a
> standby makes the first call to smgrnblocks() when truncating a
> relation. Still we can get to the optimized path as far as any
> update(+insert) or select is performed earlier on the relation so I
> think it doesn't matter so match.
>

+1.

With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Kyotaro Horiguchi

At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
>  wrote:
> > We drop
> > buffers for the old relfilenode on truncation anyway.
> >
> > What I did is:
> >
> > a: Create a physical replication pair.
> > b: On the master, create a table. (without explicitly starting a tx)
> > c: On the master, insert a tuple into the table.
> > d: On the master truncate the table.
> >
> > On the standby, smgrnblocks is called for the old relfilenode of the
> > table at c, then the same function is called for the same relfilenode
> > at d and the function takes the cached path.
> >
> 
> This is on the lines I have tried for recovery. So, it seems we are in
> agreement that we can use the 'cached' flag in
> DropRelFileNodesAllBuffers and it will take the optimized path in many
> such cases, right?

Mmm. There seems to be a misunderstanding.. What I opposed to is
referring only to InRecovery and ignoring the value of "cached".

The remaining issue is we don't get to the optimized path when a
standby makes the first call to smgrnblocks() when truncating a
relation. Still we can get to the optimized path as far as any
update(+insert) or select is performed earlier on the relation so I
think it doesn't matter so match.

But I'm not sure what others think.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Amit Kapila

On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
 wrote:
>
> I'm out of it more than usual..
>
> At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> > wrote in
> > > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Friday, December 4, 2020 8:27 PM, Amit Kapila 
> > > >  wrote:
> > > > Hi,
> > > > I have reported before that it is not always the case that the "cached" 
> > > > flag of
> > > > srnblocks() return true. So when I checked the truncate test case used 
> > > > in my
> > > > patch, it does not enter the optimization path despite doing INSERT 
> > > > before
> > > > truncation of table.
> > > > The reason for that is because in TRUNCATE, a new RelFileNode is 
> > > > assigned
> > > > to the relation when creating a new file. In recovery, 
> > > > XLogReadBufferExtended()
> > > > always opens the RelFileNode and calls smgrnblocks() for that 
> > > > RelFileNode for the
> > > > first time. And for recovery processing, different RelFileNodes are 
> > > > used for the
> > > > INSERTs to the table and TRUNCATE to the same table.
> > > >
> > >
> > > Hmm, how is it possible if Insert is done before Truncate? The insert
> > > should happen in old RelFileNode only. I have verified by adding a
> > > break-in (while (1), so that it stops there) heap_xlog_insert and
> > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > > How have you verified what you are saying?
>
> It's irrelvant that the insert happens on the old relfilenode.
>

I think it is relevant because it will allow the 'blocks' value to be cached.

> We drop
> buffers for the old relfilenode on truncation anyway.
>
> What I did is:
>
> a: Create a physical replication pair.
> b: On the master, create a table. (without explicitly starting a tx)
> c: On the master, insert a tuple into the table.
> d: On the master truncate the table.
>
> On the standby, smgrnblocks is called for the old relfilenode of the
> table at c, then the same function is called for the same relfilenode
> at d and the function takes the cached path.
>

This is on the lines I have tried for recovery. So, it seems we are in
agreement that we can use the 'cached' flag in
DropRelFileNodesAllBuffers and it will take the optimized path in many
such cases, right?

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Kyotaro Horiguchi

I'm out of it more than usual..

At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> wrote in 
> > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, December 4, 2020 8:27 PM, Amit Kapila 
> > >  wrote:
> > > Hi,
> > > I have reported before that it is not always the case that the "cached" 
> > > flag of
> > > srnblocks() return true. So when I checked the truncate test case used in 
> > > my
> > > patch, it does not enter the optimization path despite doing INSERT before
> > > truncation of table.
> > > The reason for that is because in TRUNCATE, a new RelFileNode is assigned
> > > to the relation when creating a new file. In recovery, 
> > > XLogReadBufferExtended()
> > > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode 
> > > for the
> > > first time. And for recovery processing, different RelFileNodes are used 
> > > for the
> > > INSERTs to the table and TRUNCATE to the same table.
> > >
> > 
> > Hmm, how is it possible if Insert is done before Truncate? The insert
> > should happen in old RelFileNode only. I have verified by adding a
> > break-in (while (1), so that it stops there) heap_xlog_insert and
> > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > How have you verified what you are saying?

It's irrelvant that the insert happens on the old relfilenode. We drop
buffers for the old relfilenode on truncation anyway.

What I did is:

a: Create a physical replication pair.
b: On the master, create a table. (without explicitly starting a tx)
c: On the master, insert a tuple into the table.
d: On the master truncate the table.

On the standby, smgrnblocks is called for the old relfilenode of the
table at c, then the same function is called for the same relfilenode
at d and the function takes the cached path.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Amit Kapila

On Tue, Dec 8, 2020 at 6:23 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> > wrote in
> > > Hmm, how is it possible if Insert is done before Truncate? The insert
> > > should happen in old RelFileNode only. I have verified by adding a
> > > break-in (while (1), so that it stops there) heap_xlog_insert and
> > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > > How have you verified what you are saying?
> >
> > You might be thinking of in-transaction sequence of
> > Inert-truncate. What *I* mention before is truncation of a relation
> > that smgrnblocks() has already been called for.  The most common way
> > to make it happen was INSERTs *before* the truncating transaction
> > starts.

What I have tried is Insert and Truncate in separate transactions like below:
postgres=# insert into mytbl values(1);
INSERT 0 1
postgres=# truncate mytbl;
TRUNCATE TABLE

After above, manually killed the server, and then during recovery, we
have called heap_xlog_insert() and DropRelFileNodesAllBuffers() and at
both places, RelFileNode is the same and I don't see any reason for it
to be different.

> > It may be a SELECT on a hot-standby.  Sorry for the confusing
> > expression.
>
> And ,to make sure, it is a bit off from the point of the discussion as
> I noted.  I just meant that the proposition that "smgrnblokcs() always
> returns false for "cached" when it is called in
> DropRelFileNodesAllBuffers()" doesn't always holds.
>

Right, I feel in some cases the 'cached' won't be true like if we
would have done Checkpoint after Insert in the above case (say when
the only WAL to replay during recovery is of Truncate) but I think
that should be fine. What do you think?


-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Kyotaro Horiguchi

At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> wrote in 
> > Hmm, how is it possible if Insert is done before Truncate? The insert
> > should happen in old RelFileNode only. I have verified by adding a
> > break-in (while (1), so that it stops there) heap_xlog_insert and
> > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > How have you verified what you are saying?
> 
> You might be thinking of in-transaction sequence of
> Inert-truncate. What *I* mention before is truncation of a relation
> that smgrnblocks() has already been called for.  The most common way
> to make it happen was INSERTs *before* the truncating transaction
> starts. It may be a SELECT on a hot-standby.  Sorry for the confusing
> expression.

And ,to make sure, it is a bit off from the point of the discussion as
I noted.  I just meant that the proposition that "smgrnblokcs() always
returns false for "cached" when it is called in
DropRelFileNodesAllBuffers()" doesn't always holds.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Kyotaro Horiguchi

At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  wrote 
in 
> On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
>  wrote:
> >
> > On Friday, December 4, 2020 8:27 PM, Amit Kapila  
> > wrote:
> > Hi,
> > I have reported before that it is not always the case that the "cached" 
> > flag of
> > srnblocks() return true. So when I checked the truncate test case used in my
> > patch, it does not enter the optimization path despite doing INSERT before
> > truncation of table.
> > The reason for that is because in TRUNCATE, a new RelFileNode is assigned
> > to the relation when creating a new file. In recovery, 
> > XLogReadBufferExtended()
> > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode 
> > for the
> > first time. And for recovery processing, different RelFileNodes are used 
> > for the
> > INSERTs to the table and TRUNCATE to the same table.
> >
> 
> Hmm, how is it possible if Insert is done before Truncate? The insert
> should happen in old RelFileNode only. I have verified by adding a
> break-in (while (1), so that it stops there) heap_xlog_insert and
> DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> How have you verified what you are saying?

You might be thinking of in-transaction sequence of
Inert-truncate. What *I* mention before is truncation of a relation
that smgrnblocks() has already been called for.  The most common way
to make it happen was INSERTs *before* the truncating transaction
starts. It may be a SELECT on a hot-standby.  Sorry for the confusing
expression.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread Amit Kapila

On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
 wrote:
>
> On Friday, December 4, 2020 8:27 PM, Amit Kapila  
> wrote:
> > On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com"
> > >  wrote in
> > > > > From: Kyotaro Horiguchi  Hello, Kirk.
> > > > > Thank you for the new version.
> > > >
> > > > Hi, Horiguchi-san. Thank you for your very helpful feedback.
> > > > I'm updating the patches addressing those.
> > > >
> > > > > +   if (!smgrexists(rels[i], j))
> > > > > +   continue;
> > > > > +
> > > > > +   /* Get the number of blocks for a relation's fork 
> > > > > */
> > > > > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > > > > NULL);
> > > > >
> > > > > If we see a fork which its size is not cached we must give up this
> > > > > optimization for all target relations.
> > > >
> > > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and
> > > > use InRecovery when deciding for optimization because of the following
> > reasons:
> > > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to
> > > > relation page contents. So in DropRelFileNodeBuffers(),
> > > > XLogReadBufferExtended() is called during VACUUM replay because
> > VACUUM changes the page content.
> > > > OTOH, TRUNCATE doesn't change the relation content, it just
> > > > truncates relation pages without changing the page contents. So
> > > > XLogReadBufferExtended() is not called, and the "cached" flag will
> > > > always return false. I tested with "cached" flags before, and this
> > >
> > > A bit different from the point, but if some tuples have been inserted
> > > to the truncated table, XLogReadBufferExtended() is called for the
> > > table and the length is cached.
> > >
> > > > always return false, at least in DropRelFileNodesAllBuffers. Due to
> > > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers().
> > > > However, I think we can still rely on smgrnblocks to get the file
> > > > size as long as we're InRecovery. That cached nblocks is still 
> > > > guaranteed
> > to be the maximum in the shared buffer.
> > > > Thoughts?
> > >
> > > That means that we always think as if smgrnblocks returns "cached" (or
> > > "safe") value during recovery, which is out of our current consensus.
> > > If we go on that side, we don't need to consult the "cached" returned
> > > from smgrnblocks at all and it's enough to see only InRecovery.
> > >
> > > I got confused..
> > >
> > > We are relying on the "fact" that the first lseek() call of a
> > > (startup) process tells the truth.  We added an assertion so that we
> > > make sure that the cached value won't be cleared during recovery.  A
> > > possible remaining danger would be closing of an smgr object of a live
> > > relation just after a file extension failure. I think we are thinking
> > > that that doesn't happen during recovery.  Although it seems to me
> > > true, I'm not confident.
> > >
> >
> > Yeah, I also think it might not be worth depending upon whether smgr close
> > has been done before or not. I feel the current idea of using 'cached'
> > parameter is relatively solid and we should rely on that.
> > Also, which means that in DropRelFileNodesAllBuffers() we should rely on
> > the same, I think doing things differently in this regard will lead to 
> > confusion. I
> > agree in some cases we might not get benefits but it is more important to be
> > correct and keep the code consistent to avoid introducing bugs now or in the
> > future.
> >
> Hi,
> I have reported before that it is not always the case that the "cached" flag 
> of
> srnblocks() return true. So when I checked the truncate test case used in my
> patch, it does not enter the optimization path despite doing INSERT before
> truncation of table.
> The reason for that is because in TRUNCATE, a new RelFileNode is assigned
> to the relation when creating a new file. In recovery, 
> XLogReadBufferExtended()
> always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for 
> the
> first time. And for recovery processing, different RelFileNodes are used for 
> the
> INSERTs to the table and TRUNCATE to the same table.
>

Hmm, how is it possible if Insert is done before Truncate? The insert
should happen in old RelFileNode only. I have verified by adding a
break-in (while (1), so that it stops there) heap_xlog_insert and
DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
How have you verified what you are saying?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-06 Thread k.jami...@fujitsu.com

On Friday, December 4, 2020 8:27 PM, Amit Kapila  
wrote:
> On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com"
> >  wrote in
> > > > From: Kyotaro Horiguchi  Hello, Kirk.
> > > > Thank you for the new version.
> > >
> > > Hi, Horiguchi-san. Thank you for your very helpful feedback.
> > > I'm updating the patches addressing those.
> > >
> > > > +   if (!smgrexists(rels[i], j))
> > > > +   continue;
> > > > +
> > > > +   /* Get the number of blocks for a relation's fork */
> > > > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > > > NULL);
> > > >
> > > > If we see a fork which its size is not cached we must give up this
> > > > optimization for all target relations.
> > >
> > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and
> > > use InRecovery when deciding for optimization because of the following
> reasons:
> > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to
> > > relation page contents. So in DropRelFileNodeBuffers(),
> > > XLogReadBufferExtended() is called during VACUUM replay because
> VACUUM changes the page content.
> > > OTOH, TRUNCATE doesn't change the relation content, it just
> > > truncates relation pages without changing the page contents. So
> > > XLogReadBufferExtended() is not called, and the "cached" flag will
> > > always return false. I tested with "cached" flags before, and this
> >
> > A bit different from the point, but if some tuples have been inserted
> > to the truncated table, XLogReadBufferExtended() is called for the
> > table and the length is cached.
> >
> > > always return false, at least in DropRelFileNodesAllBuffers. Due to
> > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers().
> > > However, I think we can still rely on smgrnblocks to get the file
> > > size as long as we're InRecovery. That cached nblocks is still guaranteed
> to be the maximum in the shared buffer.
> > > Thoughts?
> >
> > That means that we always think as if smgrnblocks returns "cached" (or
> > "safe") value during recovery, which is out of our current consensus.
> > If we go on that side, we don't need to consult the "cached" returned
> > from smgrnblocks at all and it's enough to see only InRecovery.
> >
> > I got confused..
> >
> > We are relying on the "fact" that the first lseek() call of a
> > (startup) process tells the truth.  We added an assertion so that we
> > make sure that the cached value won't be cleared during recovery.  A
> > possible remaining danger would be closing of an smgr object of a live
> > relation just after a file extension failure. I think we are thinking
> > that that doesn't happen during recovery.  Although it seems to me
> > true, I'm not confident.
> >
> 
> Yeah, I also think it might not be worth depending upon whether smgr close
> has been done before or not. I feel the current idea of using 'cached'
> parameter is relatively solid and we should rely on that.
> Also, which means that in DropRelFileNodesAllBuffers() we should rely on
> the same, I think doing things differently in this regard will lead to 
> confusion. I
> agree in some cases we might not get benefits but it is more important to be
> correct and keep the code consistent to avoid introducing bugs now or in the
> future.
> 
Hi, 
I have reported before that it is not always the case that the "cached" flag of
srnblocks() return true. So when I checked the truncate test case used in my
patch, it does not enter the optimization path despite doing INSERT before
truncation of table.
The reason for that is because in TRUNCATE, a new RelFileNode is assigned
to the relation when creating a new file. In recovery, XLogReadBufferExtended()
always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for 
the
first time. And for recovery processing, different RelFileNodes are used for the
INSERTs to the table and TRUNCATE to the same table.

As we cannot use "cached" flag for both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() based from above.
I am thinking that if we want consistency, correctness, and to still make use of
the optimization, we can completely drop the "cached" flag parameter in 
smgrnblocks,
and use InRecovery.
Tsunakawa-san mentioned in [1] that it is safe because smgrclose is not called
by the startup process in recovery. Shared-inval messages are not sent to 
startup
process.

Otherwise, we use the current patch form as it is: using "cached" in
DropRelFileNodeBuffers() and InRecovery for DropRelFileNodesAllBuffers().
However, that does not seem to be what is wanted in this thread.

Thoughts?

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/TYAPR01MB2990B42570A5FAC349EE983AFEF40%40TYAPR01MB2990.jpnprd01.prod.outlook.com

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-04 Thread Amit Kapila

On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi
 wrote:
>
> At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" 
>  wrote in
> > > From: Kyotaro Horiguchi 
> > > Hello, Kirk. Thank you for the new version.
> >
> > Hi, Horiguchi-san. Thank you for your very helpful feedback.
> > I'm updating the patches addressing those.
> >
> > > +   if (!smgrexists(rels[i], j))
> > > +   continue;
> > > +
> > > +   /* Get the number of blocks for a relation's fork */
> > > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > > NULL);
> > >
> > > If we see a fork which its size is not cached we must give up this 
> > > optimization
> > > for all target relations.
> >
> > I did not use the "cached" flag in DropRelFileNodesAllBuffers and use 
> > InRecovery
> > when deciding for optimization because of the following reasons:
> > XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation 
> > page
> > contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called
> > during VACUUM replay because VACUUM changes the page content.
> > OTOH, TRUNCATE doesn't change the relation content, it just truncates 
> > relation pages
> > without changing the page contents. So XLogReadBufferExtended() is not 
> > called, and
> > the "cached" flag will always return false. I tested with "cached" flags 
> > before, and this
>
> A bit different from the point, but if some tuples have been inserted
> to the truncated table, XLogReadBufferExtended() is called for the
> table and the length is cached.
>
> > always return false, at least in DropRelFileNodesAllBuffers. Due to this, 
> > we cannot use
> > the cached flag in DropRelFileNodesAllBuffers(). However, I think we can 
> > still rely on
> > smgrnblocks to get the file size as long as we're InRecovery. That cached 
> > nblocks is still
> > guaranteed to be the maximum in the shared buffer.
> > Thoughts?
>
> That means that we always think as if smgrnblocks returns "cached" (or
> "safe") value during recovery, which is out of our current
> consensus. If we go on that side, we don't need to consult the
> "cached" returned from smgrnblocks at all and it's enough to see only
> InRecovery.
>
> I got confused..
>
> We are relying on the "fact" that the first lseek() call of a
> (startup) process tells the truth.  We added an assertion so that we
> make sure that the cached value won't be cleared during recovery.  A
> possible remaining danger would be closing of an smgr object of a live
> relation just after a file extension failure. I think we are thinking
> that that doesn't happen during recovery.  Although it seems to me
> true, I'm not confident.
>

Yeah, I also think it might not be worth depending upon whether smgr
close has been done before or not. I feel the current idea of using
'cached' parameter is relatively solid and we should rely on that.
Also, which means that in DropRelFileNodesAllBuffers() we should rely
on the same, I think doing things differently in this regard will lead
to confusion. I agree in some cases we might not get benefits but it
is more important to be correct and keep the code consistent to avoid
introducing bugs now or in the future.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread k.jami...@fujitsu.com

On Friday, December 4, 2020 12:42 PM, Tang, Haiying wrote:
> Hello, Kirk
> 
> Thanks for providing the new patches.
> I did the recovery performance test on them, the results look good. I'd like 
> to
> share them with you and everyone else.
> (I also record VACUUM and TRUNCATE execution time on master/primary in
> case you want to have a look.)

Hi, Tang.
Thank you very much for verifying the performance using the latest set of 
patches.
Although it's not supposed to affect the non-recovery path (execution on 
primary),
It's good to see those results too.

> 1. VACUUM and Failover test results(average of 15 times) [VACUUM]
> ---execution time on master/primary
> shared_buffers  master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M9.440  9.483   0%
> 10G74.689 76.219   2%
> 20G   152.538138.292  -9%
> 
> [Failover] ---execution time on standby
> shared_buffers master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M3.6292.961-18%
> 10G82.4432.627-97%
> 20G   171.3882.607-98%
> 
> 2. TRUNCATE and Failover test results(average of 15 times) [TRUNCATE]
> ---execution time on master/primary
> shared_buffers master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M   49.271   49.867 1%
> 10G   172.437  175.197 2%
> 20G   279.658  278.752 0%
> 
> [Failover] ---execution time on standby
> shared_buffersmaster(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M   4.8773.989-18%
> 10G   92.6803.975-96%
> 20G  182.0353.962-98%
> 
> [Machine spec]
> CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
> Memory: 64G
> OS: CentOS 8
> 
> [Failover test data]
> Total table Size: 700M
> Table: 1 tables (1000 rows per table)
> 
> If you have question on my test, please let me know.

Looks great.
That was helpful to see if there were any performance differences than the 
previous
versions' results. But I am glad it turned out great too.

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread Kyotaro Horiguchi

At Thu, 3 Dec 2020 07:18:16 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Jamison, Kirk/ジャミソン カーク 
> > Apologies for the delay, but attached are the updated versions to simplify 
> > the
> > patches.
> 
> Looks good for me.  Thanks to Horiguchi-san and Andres-san, the code bebecame 
> further compact and easier to read.  I've marked this ready for committer.
> 
> 
> To the committer:
> I don't think it's necessary to refer to COMMIT/ROLLBACK PREPARED in the 
> following part of the 0003 commit message.  They surely call 
> DropRelFileNodesAllBuffers(), but COMMIT/ROLLBACK also call it.
> 
> the full scan threshold. This improves the DropRelationFiles()
> performance when the TRUNCATE command truncated off any of the empty
> pages at the end of relation, and when dropping relation buffers if a
> commit/rollback transaction has been prepared in FinishPreparedTransaction().

I think whether we can use this optimization only by looking
InRecovery is still in doubt.  Or if we can decide that on that
criteria, 0003 also can be simplivied using the same assumption.


Separate from the maybe-remaining discussion, I have a comment on the
revised code in 0004.

+* equal to the full scan threshold.
+*/
+   if (nBlocksToInvalidate >= BUF_DROP_FULL_SCAN_THRESHOLD)
+   {
+   pfree(block);
+   goto buffer_full_scan;
+   }

I don't particularily hate goto statement but we can easily avoid that
by reversing the condition here.  You might consider the length of the
line calling "FindAndDropRelFileNodeBuffers" but the indentation can
be lowered by inverting the condition on BLockNumberIsValid.

!| if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 | {
 |  for (i = 0; i < n; i++)
 |  {
 |  /*
 |   * If block to drop is valid, drop the buffers of the fork.
 |   * Zero the firstDelBlock because all buffers will be
 |   * dropped anyway.
 |   */
 |  for (j = 0; j <= MAX_FORKNUM; j++)
 |  {
!|  if (!BlockNumberIsValid(block[i][j]))
!|  continue;
 | 
 |  
FindAndDropRelFileNodeBuffers(smgr_reln[i]->smgr_rnode.node,
 |  
  j, block[i][j], 0);
 |  }
 |  }
 |  pfree(block);
 |  return;
 | }
 | 
 | pfree(block);

Or we can separate the calcualtion part and the execution part by
introducing a flag "do_fullscan".

 |  /*
 |   * We enter the optimization iff we are in recovery.  Otherwise,
 |   * we proceed to full scan of the whole buffer pool.
 |   */
 |  if (InRecovery)
 |  {
...
!|  if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
!|  do_fullscan = false;
!|  }
!|
!|  if (!do_fullscan)
!|  {
 |  for (i = 0; i < n; i++)
 |  {
 |  /*
 |   * If block to drop is valid, drop the buffers of the 
fork.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread Kyotaro Horiguchi

Thanks for the new version.

This contains only replies. I'll send some further comments in another
mail later.
   
At Thu, 3 Dec 2020 03:49:27 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> > Hello, Kirk. Thank you for the new version.
> 
> Apologies for the delay, but attached are the updated versions to simplify 
> the patches.
> The changes reflected most of your comments/suggestions.
> 
> Summary of changes in the latest versions.
> 1. Updated the function description of DropRelFileNodeBuffers in 0003.
> 2. Updated the commit logs of 0003 and 0004.
> 3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
>   instead of for all involved forks.
> 4. Removed the unnecessary palloc() and subscripts like forks[][],
>firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
>allocation for block[][] was also simplified.
>So 0004 became simpler and more readable.
...
> > > a reliable size of nblocks for supplied relation's fork at that time,
> > > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > > that cached nblocks will not be invalidated by file extension during
> > > recovery.  Otherwise, or if not in recovery, proceed to sequential
> > > search of the whole buffer pool.
> > 
> > This sentence seems involving confusion. It reads as if "we can rely on it
> > because we're relying on it".  And "the cached value won't be invalidated"
> > doesn't explain the reason precisely. The reason I think is that the cached
> > value is guaranteed to be the maximum page we have in shared buffer at least
> > while recovery, and that guarantee is holded by not asking fseek once we
> > cached the value.
> 
> Fixed the commit log of 0003.

Thanks!

...
> > +   nforks = palloc(sizeof(int) * n);
> > +   forks = palloc(sizeof(ForkNumber *) * n);
> > +   blocks = palloc(sizeof(BlockNumber *) * n);
> > +   firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM
> > + 1));
> > +   for (i = 0; i < n; i++)
> > +   {
> > +   forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM +
> > 1));
> > +   blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM
> > + 1));
> > +   }
> > 
> > We can allocate the whole array at once like this.
> > 
> >  BlockNumber (*blocks)[MAX_FORKNUM+1] =
> >   (BlockNumber (*)[MAX_FORKNUM+1])
> >   palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))
> 
> Thank you for suggesting to reduce the lines for the 2d dynamic memory alloc.
> I followed this way in 0004, but it's my first time to see it written this 
> way.
> I am very glad it works, though is it okay to write it this way since I 
> cannot find
> a similar code of declaring and allocating 2D arrays like this in Postgres 
> source code?

Actually it would be somewhat novel for a certain portion of people,
but it is fundamentally the same with function pointers.  Hard to make
it from scratch, but I suppose not so hard to read:)

int (*func_char_to_int)(char x) = some_func;

FWIW isn.c has the following part:

> static bool
> check_table(const char *(*TABLE)[2], const unsigned TABLE_index[10][2])


> > +   nBlocksToInvalidate += blocks[i][numForks];
> > +
> > +   forks[i][numForks++] = j;
> > 
> > We can signal to the later code the absense of a fork by setting
> > InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be
> > removed.
> 
> Followed it in 0004.

Looks fine to me, thanks.

> > +   /* Zero the array of blocks because these will all be dropped anyway
> > */
> > +   MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n *
> > (MAX_FORKNUM +
> > +1));
> > 
> > We don't need to prepare nforks, forks and firstDelBlocks for all relations
> > before looping over relations.  In other words, we can fill in the arrays 
> > for a
> > relation at every iteration of relations.
> 
> Followed your advice. Although I now drop the buffers per fork, which now
> removes forks[][], nforks, firstDelBlocks[].

That's fine for me.

> > +* We enter the optimization iff we are in recovery and the number of
> > +blocks to
> > 
> > This comment ticks out of 80 columns. (I'm not sure whether that convention
> > is still valid..)
> 
> Fixed.
>  
> > +   if (InRecovery && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD)
> > 
> > We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't do
> > that.
> 
> 
> As for DropRelFileNodesAllBuffers use case, I used InRecovery
> so that the optimization still works.
>   Horiguchi-san also wrote in another mail:
> > A bit different from the point, but if some tuples have been inserted to the
> > truncated table, XLogReadBufferExtended() is called for the table and the
> > length is cached.
> I was wrong in my previous claim that the "cached" value always return false.
> When I checked the recovery test log from recovery tap test, there was only
> one example when "cached" became true (script below) and entered

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread Tang, Haiying

Hello, Kirk

Thanks for providing the new patches.
I did the recovery performance test on them, the results look good. I'd like to 
share them with you and everyone else. 
(I also record VACUUM and TRUNCATE execution time on master/primary in case you 
want to have a look.)  

1. VACUUM and Failover test results(average of 15 times) 
[VACUUM] ---execution time on master/primary
shared_buffers  master(sec)   patched(sec) 
%reg=((patched-master)/master)
--
128M9.440  9.483   0%
10G74.689 76.219   2%
20G   152.538138.292  -9%

[Failover] ---execution time on standby
shared_buffers master(sec)patched(sec) 
%reg=((patched-master)/master)
--
128M3.6292.961-18%
10G82.4432.627-97%
20G   171.3882.607-98%

2. TRUNCATE and Failover test results(average of 15 times) 
[TRUNCATE] ---execution time on master/primary
shared_buffers master(sec)patched(sec) 
%reg=((patched-master)/master)
--
128M   49.271   49.867 1%
10G   172.437  175.197 2%
20G   279.658  278.752 0%

[Failover] ---execution time on standby
shared_buffersmaster(sec)   patched(sec) 
%reg=((patched-master)/master)
--
128M   4.8773.989-18%
10G   92.6803.975-96%
20G  182.0353.962-98% 

[Machine spec]
CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
Memory: 64G
OS: CentOS 8

[Failover test data]
Total table Size: 700M
Table: 1 tables (1000 rows per table)

If you have question on my test, please let me know.

Regards,
Tang

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-02 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> Apologies for the delay, but attached are the updated versions to simplify the
> patches.

Looks good for me.  Thanks to Horiguchi-san and Andres-san, the code bebecame 
further compact and easier to read.  I've marked this ready for committer.


To the committer:
I don't think it's necessary to refer to COMMIT/ROLLBACK PREPARED in the 
following part of the 0003 commit message.  They surely call 
DropRelFileNodesAllBuffers(), but COMMIT/ROLLBACK also call it.

the full scan threshold. This improves the DropRelationFiles()
performance when the TRUNCATE command truncated off any of the empty
pages at the end of relation, and when dropping relation buffers if a
commit/rollback transaction has been prepared in FinishPreparedTransaction().


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-02 Thread k.jami...@fujitsu.com

On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> Hello, Kirk. Thank you for the new version.

Apologies for the delay, but attached are the updated versions to simplify the 
patches.
The changes reflected most of your comments/suggestions.

Summary of changes in the latest versions.
1. Updated the function description of DropRelFileNodeBuffers in 0003.
2. Updated the commit logs of 0003 and 0004.
3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
  instead of for all involved forks.
4. Removed the unnecessary palloc() and subscripts like forks[][],
   firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
   allocation for block[][] was also simplified.
   So 0004 became simpler and more readable.


> At Thu, 26 Nov 2020 03:04:10 +, "k.jami...@fujitsu.com"
>  wrote in
> > On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > > From: Andres Freund 
> > > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers
> > > > (), which 3/4 doesn't address and 4/4 doesn't mention.
> > > >
> > > > 4/4 seems to address DropRelationFiles(), but only talks about
> > > TRUNCATE?
> > >
> > > Yes.  DropRelationFiles() is used in the following two paths:
> > >
> > > [Replay of TRUNCATE during recovery]
> > > xact_redo_commit/abort() -> DropRelationFiles()  ->
> > > smgrdounlinkall() ->
> > > DropRelFileNodesAllBuffers()
> > >
> > > [COMMIT/ROLLBACK PREPARED]
> > > FinishPreparedTransaction() -> DropRelationFiles()  ->
> > > smgrdounlinkall()
> > > -> DropRelFileNodesAllBuffers()
> >
> > Yes. The concern is that it was not clear in the function descriptions
> > and commit logs what the optimizations for the
> > DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() are for. So
> > I revised only the function description of DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions
> would suffice.
> 
> I read the commit message of 3/4.  (Though this is not involved literally in 
> the
> final commit.)
> 
> > While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or
> > autovacuum are replayed, the buffers are dropped when the sizes of all
> > involved forks of a relation are already "cached". We can get
> 
> This sentence seems missing "dropped by (or using) what".
> 
> > a reliable size of nblocks for supplied relation's fork at that time,
> > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > that cached nblocks will not be invalidated by file extension during
> > recovery.  Otherwise, or if not in recovery, proceed to sequential
> > search of the whole buffer pool.
> 
> This sentence seems involving confusion. It reads as if "we can rely on it
> because we're relying on it".  And "the cached value won't be invalidated"
> doesn't explain the reason precisely. The reason I think is that the cached
> value is guaranteed to be the maximum page we have in shared buffer at least
> while recovery, and that guarantee is holded by not asking fseek once we
> cached the value.

Fixed the commit log of 0003.

> > > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > > DropRelFileNodesAllBuffers() can be used for many relations at
> > > > once, this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD
> - 1
> > > lookups a lot
> > > > of times, once for each of nnodes relations?
> > >
> > > So, the threshold value should be compared with the total number of
> > > blocks of all target relations, not each relation.  You seem to be right, 
> > > got
> it.
> >
> > Fixed this in 0004 patch. Now we compare the total number of
> > buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
> 
> I didn't see the previous version, but the row of additional palloc/pfree's in
> this version looks uneasy.

Fixed this too.
 
>   int i,
> + j,
> + *nforks,
>   n = 0;
> 
> Perhaps I think we don't define variable in different types at once.
> (I'm not sure about defining multple variables at once.)

Fixed this too.

> @@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend
> *rnodes, int nnodes)
> 
>   DropRelFileNodeAllLocalBuffers(rnodes[i].node);
>   }
>   else
> + {
> + rels[n] = smgr_reln[i];
>   nodes[n++] = rnodes[i].node;
> + }
>   }
> 
> We don't need to remember nodes and rnodes here since rnodes[n] is
> rels[n]->smgr_rnode here.  Or we don't even need to store rels since we can
> scan the smgr_reln later again.
> 
> nodes is needed in the full-scan path but it is enough to collect it after 
> finding
> that we do full-scan.

I

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-30 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> We are relying on the "fact" that the first lseek() call of a
> (startup) process tells the truth.  We added an assertion so that we
> make sure that the cached value won't be cleared during recovery.  A
> possible remaining danger would be closing of an smgr object of a live
> relation just after a file extension failure. I think we are thinking
> that that doesn't happen during recovery.  Although it seems to me
> true, I'm not confident.
> 
> If that's true, we don't even need to look at the "cached" flag at all
> and always be able to rely on the returned value from msgrnblocks()
> during recovery.  Otherwise, we need to avoid the danger situation.

Hmm, I've gotten to think that smgrnblocks() doesn't need the cached parameter, 
too.  DropRel*Buffers() can just check InRecovery.  Regarding the only concern 
about smgrclose() by startup process, I was afraid of the cache invalidation by 
CacheInvalidateSmgr(), but startup process doesn't receive shared inval 
messages.  So, it doesn't call smgrclose*() due to shared cache invalidation.

[InitRecoveryTransactionEnvironment()]
/* Initialize shared invalidation management for Startup process, being
 * Initialize shared invalidation management for Startup process, being
 * careful to register ourselves as a sendOnly process so we don't need to
 * read messages, nor will we get signaled when the queue starts filling
 * up.
 */
SharedInvalBackendInit(true);


Kirk-san,
Can you check to see if smgrclose() and its friends are not called during 
recovery using the regression test?



Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-27 Thread Alexander Korotkov

Hi!

I've found this patch is RFC on commitfest application.  I've quickly
checked if it's really ready for commit.  It seems there are still
unaddressed review notes.  I'm going to switch it to WFA.

--
Regards,
Alexander Korotkov

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-26 Thread Kyotaro Horiguchi

At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" 
 wrote in 
> > From: Kyotaro Horiguchi 
> > Hello, Kirk. Thank you for the new version.
> 
> Hi, Horiguchi-san. Thank you for your very helpful feedback.
> I'm updating the patches addressing those.
> 
> > +   if (!smgrexists(rels[i], j))
> > +   continue;
> > +
> > +   /* Get the number of blocks for a relation's fork */
> > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > NULL);
> > 
> > If we see a fork which its size is not cached we must give up this 
> > optimization
> > for all target relations.
> 
> I did not use the "cached" flag in DropRelFileNodesAllBuffers and use 
> InRecovery
> when deciding for optimization because of the following reasons:
> XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page
> contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called
> during VACUUM replay because VACUUM changes the page content.
> OTOH, TRUNCATE doesn't change the relation content, it just truncates 
> relation pages
> without changing the page contents. So XLogReadBufferExtended() is not 
> called, and
> the "cached" flag will always return false. I tested with "cached" flags 
> before, and this

A bit different from the point, but if some tuples have been inserted
to the truncated table, XLogReadBufferExtended() is called for the
table and the length is cached.

> always return false, at least in DropRelFileNodesAllBuffers. Due to this, we 
> cannot use
> the cached flag in DropRelFileNodesAllBuffers(). However, I think we can 
> still rely on
> smgrnblocks to get the file size as long as we're InRecovery. That cached 
> nblocks is still
> guaranteed to be the maximum in the shared buffer.
> Thoughts?

That means that we always think as if smgrnblocks returns "cached" (or
"safe") value during recovery, which is out of our current
consensus. If we go on that side, we don't need to consult the
"cached" returned from smgrnblocks at all and it's enough to see only
InRecovery.

I got confused..

We are relying on the "fact" that the first lseek() call of a
(startup) process tells the truth.  We added an assertion so that we
make sure that the cached value won't be cleared during recovery.  A
possible remaining danger would be closing of an smgr object of a live
relation just after a file extension failure. I think we are thinking
that that doesn't happen during recovery.  Although it seems to me
true, I'm not confident.

If that's true, we don't even need to look at the "cached" flag at all
and always be able to rely on the returned value from msgrnblocks()
during recovery.  Otherwise, we need to avoid the danger situation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-26 Thread k.jami...@fujitsu.com

> From: Kyotaro Horiguchi 
> Hello, Kirk. Thank you for the new version.

Hi, Horiguchi-san. Thank you for your very helpful feedback.
I'm updating the patches addressing those.

> + if (!smgrexists(rels[i], j))
> + continue;
> +
> + /* Get the number of blocks for a relation's fork */
> + blocks[i][numForks] = smgrnblocks(rels[i], j,
> NULL);
> 
> If we see a fork which its size is not cached we must give up this 
> optimization
> for all target relations.

I did not use the "cached" flag in DropRelFileNodesAllBuffers and use InRecovery
when deciding for optimization because of the following reasons:
XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page
contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called
during VACUUM replay because VACUUM changes the page content.
OTOH, TRUNCATE doesn't change the relation content, it just truncates relation 
pages
without changing the page contents. So XLogReadBufferExtended() is not called, 
and
the "cached" flag will always return false. I tested with "cached" flags 
before, and this
always return false, at least in DropRelFileNodesAllBuffers. Due to this, we 
cannot use
the cached flag in DropRelFileNodesAllBuffers(). However, I think we can still 
rely on
smgrnblocks to get the file size as long as we're InRecovery. That cached 
nblocks is still
guaranteed to be the maximum in the shared buffer.
Thoughts?

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread Kyotaro Horiguchi

At Thu, 26 Nov 2020 16:18:55 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> + /* Zero the array of blocks because these will all be dropped anyway */
> + MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));
> 
> We don't need to prepare nforks, forks and firstDelBlocks for all
> relations before looping over relations.  In other words, we can fill
> in the arrays for a relation at every iteration of relations.

Or even we could call FindAndDropRelFileNodeBuffers() for each
forks. It dones't matter in the performance perspective whether the
function loops over forks or the function is called for each forks.

regrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread Kyotaro Horiguchi

Hello, Kirk. Thank you for the new version.

At Thu, 26 Nov 2020 03:04:10 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > From: Andres Freund 
> > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
> > > which 3/4 doesn't address and 4/4 doesn't mention.
> > >
> > > 4/4 seems to address DropRelationFiles(), but only talks about
> > TRUNCATE?
> > 
> > Yes.  DropRelationFiles() is used in the following two paths:
> > 
> > [Replay of TRUNCATE during recovery]
> > xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall() ->
> > DropRelFileNodesAllBuffers()
> > 
> > [COMMIT/ROLLBACK PREPARED]
> > FinishPreparedTransaction() -> DropRelationFiles()  -> smgrdounlinkall()
> > -> DropRelFileNodesAllBuffers()
> 
> Yes. The concern is that it was not clear in the function descriptions and 
> commit logs
> what the optimizations for the DropRelFileNodeBuffers() and 
> DropRelFileNodesAllBuffers()
> are for. So I revised only the function description of 
> DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions 
> would suffice.

I read the commit message of 3/4.  (Though this is not involved
literally in the final commit.)

> While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or autovacuum are replayed, the buffers are dropped when the sizes
> of all involved forks of a relation are already "cached". We can get

This sentence seems missing "dropped by (or using) what".

> a reliable size of nblocks for supplied relation's fork at that time,
> and it's safe because DropRelFileNodeBuffers() relies on the behavior
> that cached nblocks will not be invalidated by file extension during
> recovery.  Otherwise, or if not in recovery, proceed to sequential
> search of the whole buffer pool.

This sentence seems involving confusion. It reads as if "we can rely
on it because we're relying on it".  And "the cached value won't be
invalidated" doesn't explain the reason precisely. The reason I think
is that the cached value is guaranteed to be the maximum page we have
in shared buffer at least while recovery, and that guarantee is holded
by not asking fseek once we cached the value.

> > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> > lookups a lot
> > > of times, once for each of nnodes relations?
> > 
> > So, the threshold value should be compared with the total number of blocks
> > of all target relations, not each relation.  You seem to be right, got it.
> 
> Fixed this in 0004 patch. Now we compare the total number of 
> buffers-to-be-invalidated
> For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD.

I didn't see the previous version, but the row of additional
palloc/pfree's in this version looks uneasy.


int i,
+   j,
+   *nforks,
n = 0;

Perhaps I think we don't define variable in different types at once.
(I'm not sure about defining multple variables at once.)


@@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, 
int nnodes)
DropRelFileNodeAllLocalBuffers(rnodes[i].node);
}
else
+   {
+   rels[n] = smgr_reln[i];
nodes[n++] = rnodes[i].node;
+   }
}

We don't need to remember nodes and rnodes here since rnodes[n] is
rels[n]->smgr_rnode here.  Or we don't even need to store rels since
we can scan the smgr_reln later again.

nodes is needed in the full-scan path but it is enough to collect it
after finding that we do full-scan.


/*
@@ -3120,6 +3138,68 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, 
int nnodes)
if (n == 0)
{
pfree(nodes);
+   pfree(rels);
+   pfree(rnodes);
+   return;
+   }
+
+   nforks = palloc(sizeof(int) * n);
+   forks = palloc(sizeof(ForkNumber *) * n);
+   blocks = palloc(sizeof(BlockNumber *) * n);
+   firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));
+   for (i = 0; i < n; i++)
+   {
+   forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM + 1));
+   blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM + 1));
+   }

We can allocate the whole array at once like this.

 BlockNumber (*blocks)[MAX_FORKNUM+1] =
  (BlockNumber (*)[MAX_FORKNUM+1])
  palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))

The elements of forks[][] and blocks[][]

1 2 3 >

1 - 100 of 268 matches

Mail list logo