Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Mar 12, 2021 at 12:07 PM tsunakawa.ta...@fujitsu.com wrote: > > From: Kyotaro Horiguchi > > About the patch, it would be better to change the type of > > BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current > > value > > doesn't harm. > > OK, attached, to be prepared for the distant future when NBuffers becomes > 64-bit. > Thanks for the patch. Pushed! -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
> From: Tsunakawa, Takayuki/綱川 貴之 > From: Kyotaro Horiguchi > > About the patch, it would be better to change the type of > > BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current > value > > doesn't harm. > > OK, attached, to be prepared for the distant future when NBuffers becomes > 64-bit. Thank you, Tsunakawa-san, for sending the quick fix. (I failed to notice to my thread.) Regards, Kirk Jamison
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > About the patch, it would be better to change the type of > BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current > value > doesn't harm. OK, attached, to be prepared for the distant future when NBuffers becomes 64-bit. Regards Takayuki Tsunakawa v2-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch Description: v2-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch
Re: [Patch] Optimize dropping of relation buffers using dlist
At Fri, 12 Mar 2021 05:26:02 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Thomas Munro > > > uint64 > > > > +1 > > Thank you, the patch is attached (we tend to forget how large our world is... > 64-bit) We're sorry to cause you trouble. BUF_DROP_FULL_SCAN_THRESHOLD cannot be larger than the size of int since Nbuffer is an int. but nBlocksToInvalidate being uint32 looks somewhat too tight. So +1 for changing it to uint64. We need fill all block[file][fork] array in DropRelFileNodesAllBuffers so we cannot bailing out from the counting loop. We could do that DropRelFileNodesAllBuffers but that doesn't seem effective so much. So I vote for uint64 and not bailing out. About the patch, it would be better to change the type of BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current value doesn't harm. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Thomas Munro > > uint64 > > +1 Thank you, the patch is attached (we tend to forget how large our world is... 64-bit) We're sorry to cause you trouble. Regards Takayuki Tsunakawa v1-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch Description: v1-0001-Fix-overflow-when-counting-the-number-of-buffers-.patch
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Thomas Munro > On Fri, Mar 12, 2021 at 5:20 PM Amit Kapila wrote: > > uint64 > > +1 +1 I'll send a patch later. Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Mar 12, 2021 at 5:20 PM Amit Kapila wrote: > uint64 +1
Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Mar 12, 2021 at 4:58 AM Thomas Munro wrote: > > While rebasing CF #2933 (which drops the _cached stuff and makes this > optimisation always available, woo), I happened to notice that we're > summing the size of many relations and forks into a variable > nBlocksToInvalidate of type BlockNumber. That could overflow. > I also think so. I think we have two ways to address that: (a) check immediately after each time we add blocks to nBlocksToInvalidate to see if it crosses the threshold value BUF_DROP_FULL_SCAN_THRESHOLD and if so, then just break the loop; (b) change the variable type to uint64. Any better ideas? -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
While rebasing CF #2933 (which drops the _cached stuff and makes this optimisation always available, woo), I happened to notice that we're summing the size of many relations and forks into a variable nBlocksToInvalidate of type BlockNumber. That could overflow.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wed, January 13, 2021 2:15 PM (JST), Amit Kapila wrote: > On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi > wrote: > > > > At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila > > wrote in > > > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi > > > wrote: > > > > > > > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" > wrote in: > > > > > > Thanks for the detailed tests. NBuffers/32 seems like an > > > > > > appropriate value for the threshold based on these results. I > > > > > > would like to slightly modify part of the commit message in > > > > > > the first patch as below [1], otherwise, I am fine with the > > > > > > changes. Unless you or anyone else has any more comments, I am > > > > > > planning to push the 0001 and 0002 sometime next week. > > > > > > > > > > > > [1] > > > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so > > > > > > that scanning of the whole buffer pool can be avoided when the > > > > > > number of blocks to be truncated in a relation is below a > > > > > > certain threshold. For such cases, we find the buffers by doing > lookups in BufMapping table. > > > > > > This improves the performance by more than 100 times in many > > > > > > cases when several small tables (tested with 1000 relations) > > > > > > are truncated and where the server is configured with a large > > > > > > value of shared buffers (greater than 100GB)." > > > > > > > > > > Thank you for taking a look at the results of the tests. And > > > > > it's also consistent with the results from Tang too. > > > > > The commit message LGTM. > > > > > > > > +1. > > > > > > > > > > I have pushed the 0001. > > > > Thank you for commiting this. > > > > Pushed 0002 as well. > Thank you very much for committing those two patches, and for everyone here who contributed in the simplifying the approaches, code reviews, testing, etc. I compile with the --enable-coverage and check if the newly-added code and updated parts were covered by tests. Yes, the lines were hit including the updated lines of DropRelFileNodeBuffers(), DropRelFileNodesAllBuffers(), smgrdounlinkall(), smgrnblocks(). Newly added APIs were covered too: FindAndDropRelFileNodeBuffers() and smgrnblocks_cached(). However, the parts where UnlockBufHdr(bufHdr, buf_state); is called is not hit. But I noticed that exists as well in previously existing functions in bufmgr.c. Thank you very much again. Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi wrote: > > At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila > wrote in > > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi > > wrote: > > > > > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" > > > wrote in: > > > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate > > > > > value for the threshold based on these results. I would like to > > > > > slightly modify part of the commit message in the first patch as below > > > > > [1], otherwise, I am fine with the changes. Unless you or anyone else > > > > > has any more comments, I am planning to push the 0001 and 0002 > > > > > sometime next week. > > > > > > > > > > [1] > > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that > > > > > scanning of the whole buffer pool can be avoided when the number of > > > > > blocks to be truncated in a relation is below a certain threshold. For > > > > > such cases, we find the buffers by doing lookups in BufMapping table. > > > > > This improves the performance by more than 100 times in many cases > > > > > when several small tables (tested with 1000 relations) are truncated > > > > > and where the server is configured with a large value of shared > > > > > buffers (greater than 100GB)." > > > > > > > > Thank you for taking a look at the results of the tests. And it's also > > > > consistent with the results from Tang too. > > > > The commit message LGTM. > > > > > > +1. > > > > > > > I have pushed the 0001. > > Thank you for commiting this. > Pushed 0002 as well. -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila wrote in > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi > wrote: > > > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" > > wrote in: > > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate > > > > value for the threshold based on these results. I would like to > > > > slightly modify part of the commit message in the first patch as below > > > > [1], otherwise, I am fine with the changes. Unless you or anyone else > > > > has any more comments, I am planning to push the 0001 and 0002 > > > > sometime next week. > > > > > > > > [1] > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that > > > > scanning of the whole buffer pool can be avoided when the number of > > > > blocks to be truncated in a relation is below a certain threshold. For > > > > such cases, we find the buffers by doing lookups in BufMapping table. > > > > This improves the performance by more than 100 times in many cases > > > > when several small tables (tested with 1000 relations) are truncated > > > > and where the server is configured with a large value of shared > > > > buffers (greater than 100GB)." > > > > > > Thank you for taking a look at the results of the tests. And it's also > > > consistent with the results from Tang too. > > > The commit message LGTM. > > > > +1. > > > > I have pushed the 0001. Thank you for commiting this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi wrote: > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" > wrote in: > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate > > > value for the threshold based on these results. I would like to > > > slightly modify part of the commit message in the first patch as below > > > [1], otherwise, I am fine with the changes. Unless you or anyone else > > > has any more comments, I am planning to push the 0001 and 0002 > > > sometime next week. > > > > > > [1] > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that > > > scanning of the whole buffer pool can be avoided when the number of > > > blocks to be truncated in a relation is below a certain threshold. For > > > such cases, we find the buffers by doing lookups in BufMapping table. > > > This improves the performance by more than 100 times in many cases > > > when several small tables (tested with 1000 relations) are truncated > > > and where the server is configured with a large value of shared > > > buffers (greater than 100GB)." > > > > Thank you for taking a look at the results of the tests. And it's also > > consistent with the results from Tang too. > > The commit message LGTM. > > +1. > I have pushed the 0001. -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" wrote in > On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote: > > > > On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com > > wrote: > > > > > > [Results for VACUUM on single relation] > > > Average of 5 runs. > > > > > > 1. % REGRESSION > > > % Regression: (patched - master)/master > > > > > > | rel_size | 128MB | 1GB| 20GB | 100GB| > > > |--||||--| > > > | NB/512 | 0.000% | 0.000% | 0.000% | -32.680% | > > > | NB/256 | 0.000% | 0.000% | 0.000% | 0.000% | > > > | NB/128 | 0.000% | 0.000% | 0.000% | -16.502% | > > > | NB/64| 0.000% | 0.000% | 0.000% | -9.841% | > > > | NB/32| 0.000% | 0.000% | 0.000% | -6.219% | > > > | NB/16| 0.000% | 0.000% | 0.000% | 3.323% | > > > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178% | > > > > > > For 100GB shared_buffers, we can observe regression > > > beyond NBuffers/32. So with this, we can conclude > > > that NBuffers/32 is the right threshold. > > > For NBuffers/16 and beyond, the patched performs > > > worse than master. In other words, the cost of for finding > > > to be invalidated buffers gets higher in the optimized path > > > than the traditional path. > > > > > > So in attached V39 patches, I have updated the threshold > > > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32. > > > > > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate > > value for the threshold based on these results. I would like to > > slightly modify part of the commit message in the first patch as below > > [1], otherwise, I am fine with the changes. Unless you or anyone else > > has any more comments, I am planning to push the 0001 and 0002 > > sometime next week. > > > > [1] > > "The recovery path of DropRelFileNodeBuffers() is optimized so that > > scanning of the whole buffer pool can be avoided when the number of > > blocks to be truncated in a relation is below a certain threshold. For > > such cases, we find the buffers by doing lookups in BufMapping table. > > This improves the performance by more than 100 times in many cases > > when several small tables (tested with 1000 relations) are truncated > > and where the server is configured with a large value of shared > > buffers (greater than 100GB)." > > Thank you for taking a look at the results of the tests. And it's also > consistent with the results from Tang too. > The commit message LGTM. +1. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote: > > On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com > wrote: > > > > [Results for VACUUM on single relation] > > Average of 5 runs. > > > > 1. % REGRESSION > > % Regression: (patched - master)/master > > > > | rel_size | 128MB | 1GB| 20GB | 100GB| > > |--||||--| > > | NB/512 | 0.000% | 0.000% | 0.000% | -32.680% | > > | NB/256 | 0.000% | 0.000% | 0.000% | 0.000% | > > | NB/128 | 0.000% | 0.000% | 0.000% | -16.502% | > > | NB/64| 0.000% | 0.000% | 0.000% | -9.841% | > > | NB/32| 0.000% | 0.000% | 0.000% | -6.219% | > > | NB/16| 0.000% | 0.000% | 0.000% | 3.323% | > > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178% | > > > > For 100GB shared_buffers, we can observe regression > > beyond NBuffers/32. So with this, we can conclude > > that NBuffers/32 is the right threshold. > > For NBuffers/16 and beyond, the patched performs > > worse than master. In other words, the cost of for finding > > to be invalidated buffers gets higher in the optimized path > > than the traditional path. > > > > So in attached V39 patches, I have updated the threshold > > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32. > > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate > value for the threshold based on these results. I would like to > slightly modify part of the commit message in the first patch as below > [1], otherwise, I am fine with the changes. Unless you or anyone else > has any more comments, I am planning to push the 0001 and 0002 > sometime next week. > > [1] > "The recovery path of DropRelFileNodeBuffers() is optimized so that > scanning of the whole buffer pool can be avoided when the number of > blocks to be truncated in a relation is below a certain threshold. For > such cases, we find the buffers by doing lookups in BufMapping table. > This improves the performance by more than 100 times in many cases > when several small tables (tested with 1000 relations) are truncated > and where the server is configured with a large value of shared > buffers (greater than 100GB)." Thank you for taking a look at the results of the tests. And it's also consistent with the results from Tang too. The commit message LGTM. Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com wrote: > > [Results for VACUUM on single relation] > Average of 5 runs. > > 1. % REGRESSION > % Regression: (patched - master)/master > > | rel_size | 128MB | 1GB| 20GB | 100GB| > |--||||--| > | NB/512 | 0.000% | 0.000% | 0.000% | -32.680% | > | NB/256 | 0.000% | 0.000% | 0.000% | 0.000% | > | NB/128 | 0.000% | 0.000% | 0.000% | -16.502% | > | NB/64| 0.000% | 0.000% | 0.000% | -9.841% | > | NB/32| 0.000% | 0.000% | 0.000% | -6.219% | > | NB/16| 0.000% | 0.000% | 0.000% | 3.323% | > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178% | > > For 100GB shared_buffers, we can observe regression > beyond NBuffers/32. So with this, we can conclude > that NBuffers/32 is the right threshold. > For NBuffers/16 and beyond, the patched performs > worse than master. In other words, the cost of for finding > to be invalidated buffers gets higher in the optimized path > than the traditional path. > > So in attached V39 patches, I have updated the threshold > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32. > Thanks for the detailed tests. NBuffers/32 seems like an appropriate value for the threshold based on these results. I would like to slightly modify part of the commit message in the first patch as below [1], otherwise, I am fine with the changes. Unless you or anyone else has any more comments, I am planning to push the 0001 and 0002 sometime next week. [1] "The recovery path of DropRelFileNodeBuffers() is optimized so that scanning of the whole buffer pool can be avoided when the number of blocks to be truncated in a relation is below a certain threshold. For such cases, we find the buffers by doing lookups in BufMapping table. This improves the performance by more than 100 times in many cases when several small tables (tested with 1000 relations) are truncated and where the server is configured with a large value of shared buffers (greater than 100GB)." -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
>I'd like take a look at them and redo some of the tests using my machine. I'll >send my test reults in a separate email after this. I did the same tests with Kirk's scripts using the latest patch on my own machine. The results look pretty good and similar with Kirk's. average of 5 runs. [VACUUM failover test for 1000 relations] Unit is second, %reg=(patched-master)/ master | s_b | Master| Patched | %reg | |--|---|--|--| | 128 MB| 0.408 | 0.308 | -24.44% | | 1 GB | 0.809 | 0.308 | -61.94% | | 20 GB | 12.529| 0.308 | -97.54% | | 100 GB| 59.310| 0.369 | -99.38% | [TRUNCATE failover test for 1000 relations] Unit is second, %reg=(patched-master)/ master | s_b | Master| Patched | %reg | |--|---|--|--| | 128 MB| 0.287 | 0.207 | -27.91% | | 1 GB | 0.688 | 0.208 | -69.84% | | 20 GB | 12.449| 0.208 | -98.33% | | 100 GB| 61.800| 0.207 | -99.66% | Besides, I did the test for threshold value again. (I rechecked my test process and found out that I forgot to check the data synchronization state on standby which may introduce some NOISE to my results.) The following results show we can't get optimize over NBuffers/32 just like Kirk's test results, so I do approve with Kirk on the threshold value. %regression: | rel_size |128MB|1GB|20GB| 100GB | |--||||---| | NB/512 | 0% | 0% | 0% | -48% | | NB/256 | 0% | 0% | 0% | -33% | | NB/128 | 0% | 0% | 0% | -9% | | NB/64| 0% | 0% | 0% | -5% | | NB/32| 0% | 0% |-4% | -3% | | NB/16| 0% | 0% |-4% | 1%| | NB/8 | 1% | 0% | 1% | 3%| Optimization details(unit: second): patched: shared_buffers NBuffers/512NBuffers/256NBuffers/128NBuffers/64 NBuffers/32 NBuffers/16 NBuffers/8 - 128M0.107 0.107 0.107 0.106 0.107 0.107 0.107 1G 0.107 0.107 0.107 0.107 0.107 0.107 0.107 20G 0.107 0.108 0.207 0.307 0.442 0.876 1.577 100G0.208 0.308 0.559 1.060 1.961 4.567 7.922 master: shared_buffers NBuffers/512NBuffers/256NBuffers/128NBuffers/64 NBuffers/32 NBuffers/16 NBuffers/8 - 128M0.107 0.107 0.107 0.107 0.107 0.107 0.106 1G 0.107 0.107 0.107 0.107 0.107 0.107 0.107 20G 0.107 0.107 0.208 0.308 0.457 0.910 1.560 100G0.308 0.409 0.608 1.110 2.011 4.516 7.721 [Specs] CPU : 40 processors (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz) Memory: 128G OS: CentOS 8 Any question to my test is welcome. Regards, Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Kirk, >And if you want to test, I have already indicated the detailed steps including >the scripts I used. Have fun testing! Thank you for your sharing of test steps and scripts. I'd like take a look at them and redo some of the tests using my machine. I'll send my test reults in a separate email after this. Regards, Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wed, January 6, 2021 7:04 PM (JST), I wrote: > I will resume the test similar to Tang, because she also executed the original > failover test which I have been doing before. > To avoid confusion and to check if the results from mine and Tang are > consistent, I also did the recovery/failover test for VACUUM on single > relation, > which I will send in a separate email after this. A. Test to find the right THRESHOLD So below are the procedures and results of the VACUUM recovery performance test on single relation. I followed the advice below and applied the supplementary patch on top of V39: Test-for-threshold.patch This will ensure that we'll always enter the optimized path. We're gonna use the threshold then as the relation size. > >One idea could be to remove "nBlocksToInvalidate < > >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && > >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it > >always use optimized path for the tests. Then use the relation size > >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of > >shared buffers as 128MB, 1GB, 20GB, 100GB. Each relation size is NBuffers/XXX, so I used the attached "rel.sh" script to test from NBuffers/512 until NBuffers/8 relation size per shared_buffers. I did not go further beyond 8 because it took too much time, and I could already observe significant results until that. [Vacuum Recovery Performance on Single Relation] 1. Setup synchronous streaming replication. I used the configuration written at the bottom of this email. 2. [Primary] Create 1 table. (rel.sh create) 3. [Primary] Insert data of NBuffers/XXX size. Make sure to use the correct size for the set shared_buffers by commenting out the right size in "insert" of rel.sh script. (rel.sh insert) 4. [Primary] Delete table. (rel.sh delete) 5. [Standby] Optional: To double-check that DELETE is reflected on standby. SELECT count(*) FROM tableXXX; Make sure it returns 0. 6. [Standby] Pause WAL replay. (rel.sh pause) (This script will execute SELECT pg_wal_replay_pause(); .) 7. [Primary] VACUUM the single relation. (rel.sh vacuum) 8. [Primary] After the vacuum finishes, stop the server. (rel.sh stop) (The script will execute pg_ctl stop -D $PGDATA -w -mi) 9. [Standby] Resume WAL replay and promote the standby. (rel.sh resume) It basically prints a timestamp when resuming WAL replay, and prints another timestamp when the promotion is done. Compute the time difference. [Results for VACUUM on single relation] Average of 5 runs. 1. % REGRESSION % Regression: (patched - master)/master | rel_size | 128MB | 1GB| 20GB | 100GB| |--||||--| | NB/512 | 0.000% | 0.000% | 0.000% | -32.680% | | NB/256 | 0.000% | 0.000% | 0.000% | 0.000% | | NB/128 | 0.000% | 0.000% | 0.000% | -16.502% | | NB/64| 0.000% | 0.000% | 0.000% | -9.841% | | NB/32| 0.000% | 0.000% | 0.000% | -6.219% | | NB/16| 0.000% | 0.000% | 0.000% | 3.323% | | NB/8 | 0.000% | 0.000% | 0.000% | 8.178% | For 100GB shared_buffers, we can observe regression beyond NBuffers/32. So with this, we can conclude that NBuffers/32 is the right threshold. For NBuffers/16 and beyond, the patched performs worse than master. In other words, the cost of for finding to be invalidated buffers gets higher in the optimized path than the traditional path. So in attached V39 patches, I have updated the threshold BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32. 2. [PATCHED] Units: Seconds | rel_size | 128MB | 1GB | 20GB | 100GB | |--|---|---|---|---| | NB/512 | 0.106 | 0.106 | 0.106 | 0.206 | | NB/256 | 0.106 | 0.106 | 0.106 | 0.306 | | NB/128 | 0.106 | 0.106 | 0.206 | 0.506 | | NB/64| 0.106 | 0.106 | 0.306 | 0.907 | | NB/32| 0.106 | 0.106 | 0.406 | 1.508 | | NB/16| 0.106 | 0.106 | 0.706 | 3.109 | | NB/8 | 0.106 | 0.106 | 1.307 | 6.614 | 3. MASTER Units: Seconds | rel_size | 128MB | 1GB | 20GB | 100GB | |--|---|---|---|---| | NB/512 | 0.106 | 0.106 | 0.106 | 0.306 | | NB/256 | 0.106 | 0.106 | 0.106 | 0.306 | | NB/128 | 0.106 | 0.106 | 0.206 | 0.606 | | NB/64| 0.106 | 0.106 | 0.306 | 1.006 | | NB/32| 0.106 | 0.106 | 0.406 | 1.608 | | NB/16| 0.106 | 0.106 | 0.706 | 3.009 | | NB/8 | 0.106 | 0.106 | 1.307 | 6.114 | I used the following configurations: [postgesql.conf] shared_buffers = 100GB #20GB,1GB,128MB autovacuum = off full_page_writes = off checkpoint_timeout = 30min max_locks_per_transaction = 1 max_wal_size = 20GB # For streaming replication from primary. Don't uncomment on Standby. synchronous_commit = remote_write synchronous_standby_names = 'walreceiver' # For Standby. Don't uncomment on Primary. # hot_standby = on #primary_conninfo = 'host=... user=... port=... application_name=walreceiver' -- B. Regression Test using the NBuffers/32 Threshold (V39 Patches) For this
RE: [Patch] Optimize dropping of relation buffers using dlist
On Sunday, January 3, 2021 10:35 PM (JST), Amit Kapila wrote: > On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com > wrote: > > > > Happy new year. The V38 LGTM. > > Apologies for a bit of delay on posting the test results, but since > > it's the start of commitfest, here it goes and the results were interesting. > > > > I executed a VACUUM test using the same approach that Tsunakawa-san > > did in [1], but only this time, the total time that DropRelFileNodeBuffers() > took. > > > > Please specify the exact steps like did you deleted all the rows from a table > or > some of it or none before performing Vacuum? How did you measure this > time, did you removed the cached check? It would be better if you share the > scripts and or the exact steps so that the same can be used by others to > reproduce. Basically, I used the TimestampDifference function in DropRelFileNodeBuffers(). I also executed DELETE before VACUUM. I also removed nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD And used the threshold as the relation size. > Hmm, in the tests done by Tang, the results indicate that in some cases the > patched version is slower at even NBuffers/32, so not sure if we can go to > values shown by you unless she is doing something wrong. I think the > difference in results could be because both of you are using different > techniques to measure the timings, so it might be better if both of you can > share scripts or exact steps used to measure the time and other can use the > same technique and see if we are getting consistent results. Right, since we want consistent results, please disregard the approach that I did. I will resume the test similar to Tang, because she also executed the original failover test which I have been doing before. To avoid confusion and to check if the results from mine and Tang are consistent, I also did the recovery/failover test for VACUUM on single relation, which I will send in a separate email after this. Regards, Kirk Jamison
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Amit, Sorry for my late reply. Here are my answers for your earlier questions. >BTW, it is not clear why the advantage for single table is not as big as >multiple tables with the Truncate command I guess it's the amount of table blocks caused this difference. For single table I tested the amount of block is threshold. For multiple tables I test the amount of block is a value(like: one or dozens or hundreds) which far below threshold. The closer table blocks to the threshold, the less advantage raised. I tested below 3 situations of 50 tables when shared buffers=20G / 100G. 1. For multiple tables which had one or dozens or hundreds blocks(far below threshold) per table, we got significant improve, like [1]. 2. For multiple tables which has half threshold blocks per table, advantage become less, like [2]. 3. For multiple tables which has threshold blocks per table, advantage become more less, like [3]. [1]. 247 blocks per table s_b master patched %reg((patched-master)/patched) 20GB1.109 0.108 -927% 100GB 3.113 0.108 -2782% [2]. NBuffers/256/2 blocks per table s_b master patched %reg 20GB2.012 1.210 -66% 100GB 10.226 6.4 -60% [3]. NBuffers/256 blocks per table s_b master patched %reg 20GB3.868 2.412 -60% 100GB 14.977 10.591 -41% >Can you share your exact test steps for any one of the tests? Also, did you >change autovacumm = off for these tests? Yes, I configured streaming replication environment as Kirk did before. autovacumm = off. full_page_writes = off. checkpoint_timeout = 30min Test steps: e.g. shared_buffers=20G, NBuffers/512, table blocks= 20*1024*1024/8/512=5120 . table size(kB)= 20*1024*1024/512=40960kB 1. (Master) create table test(id int, v_ch varchar, v_ch1 varchar); 2. (Master) insert about 40MB data to table. 3. (Master) delete from table (all rows of table) 4. (Standby) To test with failover, pause the WAL replay on standby server. SELECT pg_wal_replay_pause(); 5. (Master) VACUUM; 6. (Master) Stop primary server. pg_ctl stop -D $PGDATA -w 7. (Standby) Resume wal replay and promote standby. (get the recovery time from this step) Regards Tang
Re: [Patch] Optimize dropping of relation buffers using dlist
On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com wrote: > > Happy new year. The V38 LGTM. > Apologies for a bit of delay on posting the test results, but since it's the > start of commitfest, here it goes and the results were interesting. > > I executed a VACUUM test using the same approach that Tsunakawa-san did in > [1], > but only this time, the total time that DropRelFileNodeBuffers() took. > Please specify the exact steps like did you deleted all the rows from a table or some of it or none before performing Vacuum? How did you measure this time, did you removed the cached check? It would be better if you share the scripts and or the exact steps so that the same can be used by others to reproduce. > I used only a single relation, tried with various sizes using the values of > threshold: > NBuffers/512..NBuffers/1, as advised by Amit. > > Example of relation sizes for NBuffers/512. > 100GB shared_buffers: 200 MB > 20GB shared_buffers: 40 MB > 1GB shared_buffers:2 MB > 128MB shared_buffers: 0.25 MB > .. > > Although the above table shows that NBuffers/2 would be the > threshold, I know that the cost would vary depending on the machine > specs. I think I can suggest the threshold and pick one from among > NBuffers/2, NBuffers/4 or NBuffers/8, because their values are closer > to the InvalidatedBuffers. > Hmm, in the tests done by Tang, the results indicate that in some cases the patched version is slower at even NBuffers/32, so not sure if we can go to values shown by you unless she is doing something wrong. I think the difference in results could be because both of you are using different techniques to measure the timings, so it might be better if both of you can share scripts or exact steps used to measure the time and other can use the same technique and see if we are getting consistent results. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wednesday, December 30, 2020 8:58 PM, Amit Kapila wrote: > On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying > wrote: > > > > Hi Amit, > > > > In last > > > mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e625718 > 2 > > 564b6%40G08CNEXMBPEKD05.g08.fujitsu.local), > > I've sent you the performance test results(run only 1 time) on single table. > Here is my the retested results(average by 15 times) which I think is more > accurate. > > > > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is > nonlinear(also include test results on shared buffers of 50G/60G), so it's a > little difficult to decide the threshold from the two for me. > > If just consider 100G, I think NBuffers/32 is the optimized max relation > > size. > But I don't know how to judge for 20G. If you have any suggestion, kindly let > me know. > > > > Considering these results NBuffers/64 seems a good threshold as beyond > that there is no big advantage. BTW, it is not clear why the advantage for > single table is not as big as multiple tables with the Truncate command. Can > you share your exact test steps for any one of the tests? > Also, did you change autovacumm = off for these tests, if not then the results > might not be reliable because before you run the test via Vacuum command > autovacuum would have done that work? Happy new year. The V38 LGTM. Apologies for a bit of delay on posting the test results, but since it's the start of commitfest, here it goes and the results were interesting. I executed a VACUUM test using the same approach that Tsunakawa-san did in [1], but only this time, the total time that DropRelFileNodeBuffers() took. I used only a single relation, tried with various sizes using the values of threshold: NBuffers/512..NBuffers/1, as advised by Amit. Example of relation sizes for NBuffers/512. 100GB shared_buffers: 200 MB 20GB shared_buffers: 40 MB 1GB shared_buffers:2 MB 128MB shared_buffers: 0.25 MB The regression, which means the patch performs worse than master, only happens for relation size NBuffers/2 and below for all shared_buffers. The fastest performance on a single relation was using the relation size NBuffers/512. [VACUUM Recovery Performance on Single Relation] Legend: P_XXX (Patch, NBuffers/XXX relation size), M_XXX (Master, Nbuffers/XXX relation size) Unit: seconds | Rel Size | 100 GB s_b | 20 GB s_b | 1 GB s_b | 128 MB s_b | |--||||| | P_512| 0.012594 | 0.001989 | 0.81 | 0.12 | | M_512| 0.208757 | 0.046212 | 0.002013 | 0.000295 | | P_256| 0.026311 | 0.004416 | 0.000129 | 0.21 | | M_256| 0.241017 | 0.047234 | 0.002363 | 0.000298 | | P_128| 0.044684 | 0.009784 | 0.000290 | 0.42 | | M_128| 0.253588 | 0.047952 | 0.002454 | 0.000319 | | P_64 | 0.065806 | 0.017444 | 0.000521 | 0.75 | | M_64 | 0.268311 | 0.050361 | 0.002730 | 0.000339 | | P_32 | 0.121441 | 0.033431 | 0.001646 | 0.000112 | | M_32 | 0.285254 | 0.061486 | 0.003640 | 0.000364 | | P_16 | 0.255503 | 0.065492 | 0.001663 | 0.000144 | | M_16 | 0.377013 | 0.081613 | 0.003731 | 0.000372 | | P_8 | 0.560616 | 0.109509 | 0.005954 | 0.000465 | | M_8 | 0.692596 | 0.112178 | 0.006667 | 0.000553 | | P_4 | 1.109437 | 0.162924 | 0.011229 | 0.000861 | | M_4 | 1.162125 | 0.178764 | 0.011635 | 0.000935 | | P_2 | 2.202231 | 0.317832 | 0.020783 | 0.002646 | | M_2 | 1.583959 | 0.306269 | 0.015705 | 0.002021 | | P_1 | 3.080032 | 0.632747 | 0.032183 | 0.002660 | | M_1 | 2.705485 | 0.543970 | 0.030658 | 0.001941 | %reg = (Patched/Master)/Patched | %reg_relsize | 100 GB s_b | 20 GB s_b | 1 GB s_b | 128 MB s_b | |--||||| | %reg_512 | -1557.587% | -2223.006% | -2385.185% | -2354.167% | | %reg_256 | -816.041% | -969.691% | -1731.783% | -1319.048% | | %reg_128 | -467.514% | -390.123% | -747.008% | -658.333% | | %reg_64 | -307.727% | -188.704% | -423.992% | -352.000% | | %reg_32 | -134.891% | -83.920% | -121.097% | -225.970% | | %reg_16 | -47.557% | -24.614% | -124.279% | -157.390% | | %reg_8 | -23.542% | -2.437%| -11.967% | -19.010% | | %reg_4 | -4.749%| -9.722%| -3.608%| -8.595%| | %reg_2 | 28.075%| 3.638% | 24.436%| 23.615%| | %reg_1 | 12.160%| 14.030%| 4.739% | 27.010%| Since our goal is to get the approximate threshold where the cost for finding to be invalidated buffers gets higher in optimized path than the traditional path: A. Traditional Path 1. For each shared_buffers, compare the relfilenode. 2. LockBufHdr() 3. Compare block number, InvalidateBuffers() if
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying wrote: > > Hi Amit, > > In last > mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e6257182564b6%40G08CNEXMBPEKD05.g08.fujitsu.local), > I've sent you the performance test results(run only 1 time) on single table. > Here is my the retested results(average by 15 times) which I think is more > accurate. > > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is > nonlinear(also include test results on shared buffers of 50G/60G), so it's a > little difficult to decide the threshold from the two for me. > If just consider 100G, I think NBuffers/32 is the optimized max relation > size. But I don't know how to judge for 20G. If you have any suggestion, > kindly let me know. > Considering these results NBuffers/64 seems a good threshold as beyond that there is no big advantage. BTW, it is not clear why the advantage for single table is not as big as multiple tables with the Truncate command. Can you share your exact test steps for any one of the tests? Also, did you change autovacumm = off for these tests, if not then the results might not be reliable because before you run the test via Vacuum command autovacuum would have done that work? -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Amit, In last mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e6257182564b6%40G08CNEXMBPEKD05.g08.fujitsu.local), I've sent you the performance test results(run only 1 time) on single table. Here is my the retested results(average by 15 times) which I think is more accurate. In terms of 20G and 100G, the optimization on 100G is linear, but 20G is nonlinear(also include test results on shared buffers of 50G/60G), so it's a little difficult to decide the threshold from the two for me. If just consider 100G, I think NBuffers/32 is the optimized max relation size. But I don't know how to judge for 20G. If you have any suggestion, kindly let me know. #%reg 128M1G 20G 100G --- %reg(NBuffers/512) 0% -1% -5% -26% %reg(NBuffers/256) 0% 0% 5% -20% %reg(NBuffers/128) -1% -1% -10%-16% %reg(NBuffers/64) -1% 0% 0% -8% %reg(NBuffers/32)0% 0% -2% -4% %reg(NBuffers/16)0% 0% -6% 4% %reg(NBuffers/8) 1% 0% 2% -2% %reg(NBuffers/4) 0% 0% 2% 2% Optimization details(unit: second): patched (sec) shared_buffers NBuffers/512NBuffers/256NBuffers/128NBuffers/64 NBuffers/32 NBuffers/16 NBuffers/8 NBuffers/4 -- 128M0.107 0.107 0.107 0.107 0.107 0.107 0.108 0.208 1G 0.107 0.108 0.107 0.108 0.208 0.208 0.308 0.409 20G 0.199 0.299 0.317 0.408 0.591 0.900 1.561 2.866 100G0.318 0.381 0.645 0.992 1.913 3.640 6.615 13.389 master(HEAD) (sec) shared_buffers NBuffers/512NBuffers/256NBuffers/128NBuffers/64 NBuffers/32 NBuffers/16 NBuffers/8 NBuffers/4 -- 128M0.107 0.107 0.108 0.108 0.107 0.107 0.107 0.208 1G 0.108 0.108 0.108 0.108 0.208 0.207 0.308 0.409 20G 0.208 0.283 0.350 0.408 0.601 0.955 1.529 2.806 100G0.400 0.459 0.751 1.068 1.984 3.506 6.735 13.101 Regards Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Amit, >I think one table with a varying amount of data is sufficient for the vacuum >test. >I think with more number of tables there is a greater chance of variation. >We have previously used multiple tables in one of the tests because of >the Truncate operation (which uses DropRelFileNodesAllBuffers that >takes multiple relations as input) and that is not true for Vacuum operation >which I suppose you are testing here. I retested performance on single table for several times, the table size is varying with the BUF_DROP_FULL_SCAN_THRESHOLD for different shared buffers. When shared_buffers is below 20G, there were no significant changes between master(HEAD) and patched. And according to the results compared between 20G and 100G, we can get optimization up to NBuffers/128, but there is no benefit from NBuffers/256. I've tested many times, most times the same results came out, I don't know why. But If I used 5 tables(each table size is set as BUF_DROP_FULL_SCAN_THRESHOLD), then we can get benefit from it(NBuffers/256). Here is my test results for single table. If you have any question or suggestion, kindly let me know. %reg= (patched- master(HEAD))/ patched Optimized percentage: shared_buffers %reg(NBuffers/512) %reg(NBuffers/256) %reg(NBuffers/128) %reg(NBuffers/64) %reg(NBuffers/32) %reg(NBuffers/16) %reg(NBuffers/8)%reg(NBuffers/4) -- 128M0% 0% -1% 0% 1% 0% 0% 0% 1G -1% 0% -1% 0% 0% 0% 0% 0% 20G 0% 0% -33% 0% 0% -13% 0% 0% 100G-32%0% -49% 0% 10% 30% 0% 6% Result details(unit: second): patched (sec) shared_buffers NBuffers/512NBuffers/256NBuffers/128NBuffers/64 NBuffers/32 NBuffers/16 NBuffers/8 NBuffers/4 128M0.107 0.107 0.107 0.107 0.108 0.107 0.108 0.208 1G 0.107 0.107 0.107 0.108 0.208 0.208 0.308 0.409 20G 0.208 0.308 0.308 0.409 0.609 0.808 1.511 2.713 100G0.309 0.408 0.609 1.010 2.011 5.017 6.620 13.931 master(HEAD) (sec) shared_buffers NBuffers/512NBuffers/256NBuffers/128NBuffers/64 NBuffers/32 NBuffers/16 NBuffers/8 NBuffers/4 128M0.107 0.107 0.108 0.107 0.107 0.107 0.108 0.208 1G 0.108 0.107 0.108 0.108 0.208 0.207 0.308 0.409 20G 0.208 0.309 0.409 0.409 0.609 0.910 1.511 2.712 100G0.408 0.408 0.909 1.010 1.811 3.515 6.619 13.032 Regards Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Amit, >I think one table with a varying amount of data is sufficient for the vacuum >test. >I think with more number of tables there is a greater chance of variation. >We have previously used multiple tables in one of the tests because of the >Truncate operation (which uses DropRelFileNodesAllBuffers that takes multiple >relations as input) >and that is not true for Vacuum operation which I suppose you are testing here. Thanks for your advice and kindly explanation. I'll continue the threshold test with one single table. Regards, Tang
Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Dec 25, 2020 at 9:28 AM Tang, Haiying wrote: > > Hi Amit, > > >But how can we conclude NBuffers/128 is the maximum relation size? > >Because the maximum size would be where the performance is worse than > >the master, no? I guess we need to try by NBuffers/64, NBuffers/32, > > till we get the threshold where master performs better. > > You are right, we should keep on testing until no optimization. > > >I think we should find a better way to display these numbers because in > >cases like where master takes 537.978s and patch takes 3.815s > > Yeah, I think we can change the %reg formula from (patched- master)/ master > to (patched- master)/ patched. > > >Table size should be more than 8k to get all this data because 8k means > >just one block. I guess either it is a typo or some other mistake. > > 8k here is the relation size, not data size. > For example, when I tested recovery performance of 400M relation size, I used > 51200 tables(8k per table). > Please let me know if you think this is not appropriate. > I think one table with a varying amount of data is sufficient for the vacuum test. I think with more number of tables there is a greater chance of variation. We have previously used multiple tables in one of the tests because of the Truncate operation (which uses DropRelFileNodesAllBuffers that takes multiple relations as input) and that is not true for Vacuum operation which I suppose you are testing here. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Amit, >But how can we conclude NBuffers/128 is the maximum relation size? >Because the maximum size would be where the performance is worse than >the master, no? I guess we need to try by NBuffers/64, NBuffers/32, > till we get the threshold where master performs better. You are right, we should keep on testing until no optimization. >I think we should find a better way to display these numbers because in >cases like where master takes 537.978s and patch takes 3.815s Yeah, I think we can change the %reg formula from (patched- master)/ master to (patched- master)/ patched. >Table size should be more than 8k to get all this data because 8k means >just one block. I guess either it is a typo or some other mistake. 8k here is the relation size, not data size. For example, when I tested recovery performance of 400M relation size, I used 51200 tables(8k per table). Please let me know if you think this is not appropriate. Regards Tang -Original Message- From: Amit Kapila Sent: Thursday, December 24, 2020 9:11 PM To: Tang, Haiying/唐 海英 Cc: Tsunakawa, Takayuki/綱川 貴之 ; Jamison, Kirk/ジャミソン カーク ; Kyotaro Horiguchi ; Andres Freund ; Tom Lane ; Thomas Munro ; Robert Haas ; Tomas Vondra ; pgsql-hackers Subject: Re: [Patch] Optimize dropping of relation buffers using dlist On Thu, Dec 24, 2020 at 2:31 PM Tang, Haiying wrote: > > Hi Amit, Kirk > > >One idea could be to remove "nBlocksToInvalidate < > >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && > >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it > >always use optimized path for the tests. Then use the relation size > >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of > >shared buffers as 128MB, 1GB, 20GB, 100GB. > > I followed your idea to remove check and use different relation size for > different shared buffers as 128M,1G,20G,50G(my environment can't support > 100G, so I choose 50G). > According to results, all three thresholds can get optimized, even > NBuffers/128 when shared_buffers > 128M. > IMHO, I think NBuffers/128 is the maximum relation size we can get > optimization in the three thresholds, Please let me know if I made something > wrong. > But how can we conclude NBuffers/128 is the maximum relation size? Because the maximum size would be where the performance is worse than the master, no? I guess we need to try by NBuffers/64, NBuffers/32, till we get the threshold where master performs better. > Recovery after vacuum test results as below ' Optimized percentage' and ' > Optimization details(unit: second)' shows: > (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, > NBuffers/128 > %reg: means (patched(512)- master(512))/ master(512) > > Optimized percentage: > shared_buffers %reg(512) %reg(256) %reg(128) > - > 128M0% -1% -1% > 1G -65%-49%-62% > 20G -98%-98%-98% > 50G -99%-99%-99% > > Optimization details(unit: second): > shared_buffers master(512) patched(512)master(256) patched(256) > master(128) patched(128) > - > 128M0.108 0.108 0.109 0.108 > 0.109 0.108 > 1G 0.310 0.107 0.410 0.208 > 0.811 0.309 > 20G 94.493 1.511 188.777 3.014 > 380.633 6.020 > 50G 537.978 3.815 867.453 7.524 > 1559.07615.541 > I think we should find a better way to display these numbers because in cases like where master takes 537.978s and patch takes 3.815s, it is clear that patch has reduced the time by more than 100 times whereas in your table it shows 99%. > Test prepare: > Below is test table amount for different shared buffers. Each table > size is 8k, > Table size should be more than 8k to get all this data because 8k means just one block. I guess either it is a typo or some other mistake. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Kirk, >Perhaps there is a confusing part in the presented table where you indicated >master(512), master(256), master(128). >Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD and >just execute the existing default full scan of NBuffers. >Or I may have misunderstood something? Sorry for your confusion, I didn't make it clear. I didn't use BUF_DROP_FULL_SCAN_THRESHOLD for master. Master(512) means the test table amount in master is same with patched(512), so does master(256) and master(128). I meant to mark 512/256/128 to distinguish results in master for the three threshold(applied in patches) . Regards Tang
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Dec 23, 2020 at 6:27 PM k.jami...@fujitsu.com wrote: > > > It compiles. Passes the regression tests too. > Your feedbacks are definitely welcome. > Thanks, the patches look good to me now. I have slightly edited the patches for comments, commit messages, and removed the duplicate code/check in smgrnblocks. I have changed the order of patches (moved Assert related patch to last because as mentioned earlier, I am not sure if we want to commit it.). We might still have to change the scan threshold value based on your and Tang-San's results. -- With Regards, Amit Kapila. v38-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch Description: Binary data v38-0002-Optimize-DropRelFileNodesAllBuffers-for-recovery.patch Description: Binary data v38-0003-Prevent-invalidating-blocks-in-smgrextend-during.patch Description: Binary data
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thu, December 24, 2020 6:02 PM JST, Tang, Haiying wrote: > Hi Amit, Kirk > > >One idea could be to remove "nBlocksToInvalidate < > >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && > >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it > always > >use optimized path for the tests. Then use the relation size as > >NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared > >buffers as 128MB, 1GB, 20GB, 100GB. > > I followed your idea to remove check and use different relation size for > different shared buffers as 128M,1G,20G,50G(my environment can't support > 100G, so I choose 50G). > According to results, all three thresholds can get optimized, even > NBuffers/128 when shared_buffers > 128M. > IMHO, I think NBuffers/128 is the maximum relation size we can get > optimization in the three thresholds, Please let me know if I made something > wrong. Hello Tang, Thank you very much again for testing. Perhaps there is a confusing part in the presented table where you indicated master(512), master(256), master(128). Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD and just execute the existing default full scan of NBuffers. Or I may have misunderstood something? > Recovery after vacuum test results as below ' Optimized percentage' and ' > Optimization details(unit: second)' shows: > (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, > NBuffers/128 > %reg: means (patched(512)- master(512))/ master(512) > > Optimized percentage: > shared_buffers%reg(512)%reg(256)%reg(128) > - > 128M0%-1%-1% > 1G -65%-49%-62% > 20G -98%-98%-98% > 50G -99%-99%-99% > > Optimization details(unit: second): > shared_buffersmaster(512)patched(512)master(256)patched(256)master(12 > 8)patched(128) > - > > 128M0.1080.1080.1090.1080.1090.108 > 1G0.310 0.107 0.410 0.208 0.811 0.309 > 20G 94.493 1.511 188.777 3.014 380.633 6.020 > 50G537.9783.815867.4537.5241559.07615.541 > > Test prepare: > Below is test table amount for different shared buffers. Each table size is > 8k, > so I use table amount = NBuffers/(512 or 256 or 128): > shared_buffersNBuffersNBuffers/512NBuffers/256NBuffers/128 > - > -- > 128M163843264128 > 1G1310722565121024 > 20G2621440 51201024020480 > 50G6553600 128002560051200 > > Besides, I also did single table performance test. > Still, NBuffers/128 is the max relation size which we can get optimization. > > Optimized percentage: > shared_buffers%reg(512)%reg(256)%reg(128) > - > 128M0%0%-1% > 1G 0%1%0% > 20G 0%-24%-25% > 50G 0%-24%-20% > > Optimization details(unit: second): > shared_buffersmaster(512)patched(512)master(256)patched(256)master(12 > 8)patched(128) > - > > 128M0.1070.1070.1080.1080.1080.107 > 1G0.108 0.108 0.107 0.108 0.108 0.108 > 20G0.208 0.208 0.409 0.309 0.409 0.308 > 50G0.309 0.308 0.408 0.309 0.509 0.408 I will also post results from my machine in the next email. Adding what Amit mentioned that we should also test for NBuffers/64, etc. until we determine which of the threshold performs worse than master. Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
On Thu, Dec 24, 2020 at 2:31 PM Tang, Haiying wrote: > > Hi Amit, Kirk > > >One idea could be to remove "nBlocksToInvalidate < > >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && > >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always > >use optimized path for the tests. Then use the relation size as > >NBuffers/128, NBuffers/256, NBuffers/512 for different values of > >shared buffers as 128MB, 1GB, 20GB, 100GB. > > I followed your idea to remove check and use different relation size for > different shared buffers as 128M,1G,20G,50G(my environment can't support > 100G, so I choose 50G). > According to results, all three thresholds can get optimized, even > NBuffers/128 when shared_buffers > 128M. > IMHO, I think NBuffers/128 is the maximum relation size we can get > optimization in the three thresholds, Please let me know if I made something > wrong. > But how can we conclude NBuffers/128 is the maximum relation size? Because the maximum size would be where the performance is worse than the master, no? I guess we need to try by NBuffers/64, NBuffers/32, till we get the threshold where master performs better. > Recovery after vacuum test results as below ' Optimized percentage' and ' > Optimization details(unit: second)' shows: > (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, > NBuffers/128 > %reg: means (patched(512)- master(512))/ master(512) > > Optimized percentage: > shared_buffers %reg(512) %reg(256) %reg(128) > - > 128M0% -1% -1% > 1G -65%-49%-62% > 20G -98%-98%-98% > 50G -99%-99%-99% > > Optimization details(unit: second): > shared_buffers master(512) patched(512)master(256) patched(256) > master(128) patched(128) > - > 128M0.108 0.108 0.109 0.108 > 0.109 0.108 > 1G 0.310 0.107 0.410 0.208 > 0.811 0.309 > 20G 94.493 1.511 188.777 3.014 > 380.633 6.020 > 50G 537.978 3.815 867.453 7.524 > 1559.07615.541 > I think we should find a better way to display these numbers because in cases like where master takes 537.978s and patch takes 3.815s, it is clear that patch has reduced the time by more than 100 times whereas in your table it shows 99%. > Test prepare: > Below is test table amount for different shared buffers. Each table size is > 8k, > Table size should be more than 8k to get all this data because 8k means just one block. I guess either it is a typo or some other mistake. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
Hi Amit, Kirk >One idea could be to remove "nBlocksToInvalidate < >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always >use optimized path for the tests. Then use the relation size as >NBuffers/128, NBuffers/256, NBuffers/512 for different values of >shared buffers as 128MB, 1GB, 20GB, 100GB. I followed your idea to remove check and use different relation size for different shared buffers as 128M,1G,20G,50G(my environment can't support 100G, so I choose 50G). According to results, all three thresholds can get optimized, even NBuffers/128 when shared_buffers > 128M. IMHO, I think NBuffers/128 is the maximum relation size we can get optimization in the three thresholds, Please let me know if I made something wrong. Recovery after vacuum test results as below ' Optimized percentage' and ' Optimization details(unit: second)' shows: (512),(256),(128): means relation size is NBuffers/512, NBuffers/256, NBuffers/128 %reg: means (patched(512)- master(512))/ master(512) Optimized percentage: shared_buffers %reg(512) %reg(256) %reg(128) - 128M0% -1% -1% 1G -65%-49%-62% 20G -98%-98%-98% 50G -99%-99%-99% Optimization details(unit: second): shared_buffers master(512) patched(512)master(256) patched(256) master(128) patched(128) - 128M0.108 0.108 0.109 0.108 0.109 0.108 1G 0.310 0.107 0.410 0.208 0.811 0.309 20G 94.493 1.511 188.777 3.014 380.633 6.020 50G 537.978 3.815 867.453 7.524 1559.07615.541 Test prepare: Below is test table amount for different shared buffers. Each table size is 8k, so I use table amount = NBuffers/(512 or 256 or 128): shared_buffers NBuffersNBuffers/512NBuffers/256NBuffers/128 --- 128M16384 32 64 128 1G 131072 256 512 1024 20G 26214405120 10240 20480 50G 65536001280025600 51200 Besides, I also did single table performance test. Still, NBuffers/128 is the max relation size which we can get optimization. Optimized percentage: shared_buffers %reg(512) %reg(256) %reg(128) - 128M0% 0% -1% 1G 0% 1% 0% 20G 0% -24%-25% 50G 0% -24%-20% Optimization details(unit: second): shared_buffers master(512) patched(512)master(256) patched(256) master(128) patched(128) - 128M0.107 0.107 0.108 0.108 0.108 0.107 1G 0.108 0.108 0.107 0.108 0.108 0.108 20G 0.208 0.208 0.409 0.309 0.409 0.308 50G 0.309 0.308 0.408 0.309 0.509 0.408 Any question on my test results is welcome. Regards, Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク compiles. Passes the regression tests too. > Your feedbacks are definitely welcome. The code looks correct and has become further compact. Remains ready for committer. Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wed, December 23, 2020 5:57 PM (GMT+9), Amit Kapila wrote: > > > > At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" > > wrote in > > > From: Amit Kapila > > > > + /* Get the number of blocks for a relation's fork */ block[i][j] > > > > + = smgrnblocks(smgr_reln[i], j, ); > > > > + > > > > + if (!cached) > > > > + goto buffer_full_scan; > > > > > > > > Why do we need to use goto here? We can simply break from the loop > > > > and then check if (cached && nBlocksToInvalidate < > > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid > goto > > > > if possible without much complexity. > > > > > > That's because two for loops are nested -- breaking there only exits > > > the inner loop. (I thought the same as you at first... And I > > > understand some people have alergy to goto, I think modest use of > > > goto makes the code readable.) > > > > I don't strongly oppose to goto's but in this case the outer loop can > > break on the same condition with the inner loop, since cached is true > > whenever the inner loop runs to the end. It is needed to initialize > > the variable cache with true, instead of false, though. > > > > +1. I think it is better to avoid goto here as it can be done without > introducing any complexity or making code any less readable. I also do not mind, so I have removed the goto and followed the advice of all reviewers. It works fine in the latest attached patch 0003. Attached herewith are the sets of patches. 0002 & 0003 have the following changes: 1. I have removed the modifications in smgrnblocks(). So the modifications of other functions that uses smgrnblocks() in the previous patch versions were also reverted. 2. Introduced a new API smgrnblocks_cached() instead which returns either a cached size for the specified fork or an InvalidBlockNumber. Since InvalidBlockNumber is used, I think it is logical not to use the additional boolean parameter "cached" in the function as it will be redundant. Although in 0003, I only used the "cached" as a Boolean variable to do the trick of not using goto. This function is called both in DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers(). 3. Modified some minor comments from the patch and commit logs. It compiles. Passes the regression tests too. Your feedbacks are definitely welcome. Regards, Kirk Jamison v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch Description: v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch Description: v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch Description: v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Dec 23, 2020 at 10:42 AM Kyotaro Horiguchi wrote: > > At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" > wrote in > > From: Amit Kapila > > > + /* Get the number of blocks for a relation's fork */ > > > + block[i][j] = smgrnblocks(smgr_reln[i], j, ); > > > + > > > + if (!cached) > > > + goto buffer_full_scan; > > > > > > Why do we need to use goto here? We can simply break from the loop and > > > then check if (cached && nBlocksToInvalidate < > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if > > > possible without much complexity. > > > > That's because two for loops are nested -- breaking there only exits the > > inner loop. (I thought the same as you at first... And I understand some > > people have alergy to goto, I think modest use of goto makes the code > > readable.) > > I don't strongly oppose to goto's but in this case the outer loop can > break on the same condition with the inner loop, since cached is true > whenever the inner loop runs to the end. It is needed to initialize > the variable cache with true, instead of false, though. > +1. I think it is better to avoid goto here as it can be done without introducing any complexity or making code any less readable. -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Dec 23, 2020 at 1:07 PM k.jami...@fujitsu.com wrote: > > On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote: > > > In this code, I am slightly worried about the additional cost of each time > > checking smgrexists. Consider a case where there are many relations and only > > one or few of them have not cached the information, in such a case we will > > pay the cost of smgrexists for many relations without even going to the > > optimized path. Can we avoid that in some way or at least reduce its usage > > to > > only when it is required? One idea could be that we first check if the > > nblocks > > information is cached and if so then we don't need to call smgrnblocks, > > otherwise, check if it exists. For this, we need an API like > > smgrnblocks_cahced, something we discussed earlier but preferred the > > current API. Do you have any better ideas? > > Right. I understand the point that let's say there are 100 relations, and > the first 99 non-local relations happen to enter the optimization path, but > the last > one does not, calling smgrexist() would be too costly and waste of time in > that case. > The only solution I could think of as you mentioned is to reintroduce the new > API > which we discussed before: smgrnblocks_cached(). > It's possible that we call smgrexist() only if smgrnblocks_cached() returns > InvalidBlockNumber. > So if everyone agrees, we can add that API smgrnblocks_cached() which will > Include the "cached" flag parameter, and remove the additional flag > modifications > from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and > DropRelFileNodesAllBuffers() will use the new API. > Yeah, let's do it that way unless anyone has a better idea to suggest. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote: > On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila > wrote: > > Next, I'll look into DropRelFileNodesAllBuffers() > > optimization patch. > > > > Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1] > = > === > 1. > DropRelFileNodesAllBuffers() > { > .. > +buffer_full_scan: > + pfree(block); > + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */ > +for (i = 0; i < n; i++) nodes[i] = smgr_reln[i]->smgr_rnode.node; > + > .. > } > > How is it correct to assign nodes array directly from smgr_reln? There is no > one-to-one correspondence. If you see the code before patch, the passed > array can have mixed of temp and non-temp relation information. You are right. I mistakenly removed the array node that should have been allocated for non-local relations. So I fixed that by doing: SMgrRelation*rels; rels = palloc(sizeof(SMgrRelation) * nnodes); /* non-local relations */ /* If it's a local relation, it's localbuf.c's problem. */ for (i = 0; i < nnodes; i++) { if (RelFileNodeBackendIsTemp(smgr_reln[i]->smgr_rnode)) { if (smgr_reln[i]->smgr_rnode.backend == MyBackendId) DropRelFileNodeAllLocalBuffers(smgr_reln[i]->smgr_rnode.node); } else rels[n++] = smgr_reln[i]; } ... if (n == 0) { pfree(rels); return; } ... //traditional path: pfree(block); nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */ for (i = 0; i < n; i++) nodes[i] = rels[i]->smgr_rnode.node; > 2. > + for (i = 0; i < n; i++) > { > - pfree(nodes); > + for (j = 0; j <= MAX_FORKNUM; j++) > + { > + /* > + * Assign InvalidblockNumber to a block if a relation > + * fork does not exist, so that we can skip it later > + * when dropping the relation buffers. > + */ > + if (!smgrexists(smgr_reln[i], j)) > + { > + block[i][j] = InvalidBlockNumber; > + continue; > + } > + > + /* Get the number of blocks for a relation's fork */ block[i][j] = > + smgrnblocks(smgr_reln[i], j, ); > > Similar to above, how can we assume smgr_reln array has all non-local > relations? Have we tried the case with mix of temp and non-temp relations? Similar to above reply. > In this code, I am slightly worried about the additional cost of each time > checking smgrexists. Consider a case where there are many relations and only > one or few of them have not cached the information, in such a case we will > pay the cost of smgrexists for many relations without even going to the > optimized path. Can we avoid that in some way or at least reduce its usage to > only when it is required? One idea could be that we first check if the nblocks > information is cached and if so then we don't need to call smgrnblocks, > otherwise, check if it exists. For this, we need an API like > smgrnblocks_cahced, something we discussed earlier but preferred the > current API. Do you have any better ideas? Right. I understand the point that let's say there are 100 relations, and the first 99 non-local relations happen to enter the optimization path, but the last one does not, calling smgrexist() would be too costly and waste of time in that case. The only solution I could think of as you mentioned is to reintroduce the new API which we discussed before: smgrnblocks_cached(). It's possible that we call smgrexist() only if smgrnblocks_cached() returns InvalidBlockNumber. So if everyone agrees, we can add that API smgrnblocks_cached() which will Include the "cached" flag parameter, and remove the additional flag modifications from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() will use the new API. Thoughts? Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
Hi, It is possible to come out of the nested loop without goto. + boolcached = true; ... +* to that fork during recovery. +*/ + for (i = 0; i < n && cached; i++) ... + if (!cached) +. break; Here I changed the initial value for cached to true so that we enter the outer loop. In place of the original goto, we break out of inner loop and exit outer loop. Cheers On Tue, Dec 22, 2020 at 8:22 PM tsunakawa.ta...@fujitsu.com < tsunakawa.ta...@fujitsu.com> wrote: > From: Amit Kapila > > + /* Get the number of blocks for a relation's fork */ > > + block[i][j] = smgrnblocks(smgr_reln[i], j, ); > > + > > + if (!cached) > > + goto buffer_full_scan; > > > > Why do we need to use goto here? We can simply break from the loop and > > then check if (cached && nBlocksToInvalidate < > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if > > possible without much complexity. > > That's because two for loops are nested -- breaking there only exits the > inner loop. (I thought the same as you at first... And I understand some > people have alergy to goto, I think modest use of goto makes the code > readable.) > > > Regards > Takayuki Tsunakawa > > > > >
Re: [Patch] Optimize dropping of relation buffers using dlist
At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Amit Kapila > > + /* Get the number of blocks for a relation's fork */ > > + block[i][j] = smgrnblocks(smgr_reln[i], j, ); > > + > > + if (!cached) > > + goto buffer_full_scan; > > > > Why do we need to use goto here? We can simply break from the loop and > > then check if (cached && nBlocksToInvalidate < > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if > > possible without much complexity. > > That's because two for loops are nested -- breaking there only exits the > inner loop. (I thought the same as you at first... And I understand some > people have alergy to goto, I think modest use of goto makes the code > readable.) I don't strongly oppose to goto's but in this case the outer loop can break on the same condition with the inner loop, since cached is true whenever the inner loop runs to the end. It is needed to initialize the variable cache with true, instead of false, though. The same pattern is seen in the tree. Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Amit Kapila > + /* Get the number of blocks for a relation's fork */ > + block[i][j] = smgrnblocks(smgr_reln[i], j, ); > + > + if (!cached) > + goto buffer_full_scan; > > Why do we need to use goto here? We can simply break from the loop and > then check if (cached && nBlocksToInvalidate < > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if > possible without much complexity. That's because two for loops are nested -- breaking there only exits the inner loop. (I thought the same as you at first... And I understand some people have alergy to goto, I think modest use of goto makes the code readable.) Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 22, 2020 at 5:41 PM Amit Kapila wrote: > > On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila wrote: > > > > Apart from tests, do let me know if you are happy with the changes in > > the patch? Next, I'll look into DropRelFileNodesAllBuffers() > > optimization patch. > > > > Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1] > > > In this code, I am slightly worried about the additional cost of each > time checking smgrexists. Consider a case where there are many > relations and only one or few of them have not cached the information, > in such a case we will pay the cost of smgrexists for many relations > without even going to the optimized path. Can we avoid that in some > way or at least reduce its usage to only when it is required? One idea > could be that we first check if the nblocks information is cached and > if so then we don't need to call smgrnblocks, otherwise, check if it > exists. For this, we need an API like smgrnblocks_cahced, something we > discussed earlier but preferred the current API. Do you have any > better ideas? > One more idea which is not better than what I mentioned above is that we completely avoid calling smgrexists and rely on smgrnblocks. It will throw an error in case the particular fork doesn't exist and we can use try .. catch to handle it. I just mentioned it as it came across my mind but I don't think it is better than the previous one. One more thing about patch: + /* Get the number of blocks for a relation's fork */ + block[i][j] = smgrnblocks(smgr_reln[i], j, ); + + if (!cached) + goto buffer_full_scan; Why do we need to use goto here? We can simply break from the loop and then check if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if possible without much complexity. -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Dec 23, 2020 at 6:30 AM k.jami...@fujitsu.com wrote: > > On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: > > > Apart from tests, do let me know if you are happy with the changes in the > > patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch. > > Thank you, Amit. > That looks more neat, combining the previous patches 0002-0003, so I am +1 > with the changes because of the clearer explanations for the threshold and > optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch > sets. > Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure > that we > do it safely too and that we are not InRecovery. > I think the 0001 is mostly for test purposes but we will see once the main patches are ready. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: > Attached, please find the updated patch with the following modifications, (a) > updated comments at various places especially to tell why this is a safe > optimization, (b) merged the patch for extending the smgrnblocks and > vacuum optimization patch, (c) made minor cosmetic changes and ran > pgindent, and (d) updated commit message. BTW, this optimization will help > not only vacuum but also truncate when it is done in the same transaction in > which the relation is created. I would like to see certain tests to ensure > that > the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I > see that some testing has been done earlier [1] for this threshold but I am > not > still able to conclude. The criteria to find the right threshold should be > what is > the maximum size of relation to be truncated above which we don't get > benefit with this optimization. > > One idea could be to remove "nBlocksToInvalidate < > BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && > nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it > always use optimized path for the tests. Then use the relation size as > NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared > buffers as 128MB, 1GB, 20GB, 100GB. Alright. I will also repeat the tests with the different threshold settings, and thank you for the tip. > Apart from tests, do let me know if you are happy with the changes in the > patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch. Thank you, Amit. That looks more neat, combining the previous patches 0002-0003, so I am +1 with the changes because of the clearer explanations for the threshold and optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets. Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that we do it safely too and that we are not InRecovery. > [1] - > https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9 > FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila wrote: > > Apart from tests, do let me know if you are happy with the changes in > the patch? Next, I'll look into DropRelFileNodesAllBuffers() > optimization patch. > Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1] 1. DropRelFileNodesAllBuffers() { .. +buffer_full_scan: + pfree(block); + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */ + for (i = 0; i < n; i++) + nodes[i] = smgr_reln[i]->smgr_rnode.node; + .. } How is it correct to assign nodes array directly from smgr_reln? There is no one-to-one correspondence. If you see the code before patch, the passed array can have mixed of temp and non-temp relation information. 2. + for (i = 0; i < n; i++) { - pfree(nodes); + for (j = 0; j <= MAX_FORKNUM; j++) + { + /* + * Assign InvalidblockNumber to a block if a relation + * fork does not exist, so that we can skip it later + * when dropping the relation buffers. + */ + if (!smgrexists(smgr_reln[i], j)) + { + block[i][j] = InvalidBlockNumber; + continue; + } + + /* Get the number of blocks for a relation's fork */ + block[i][j] = smgrnblocks(smgr_reln[i], j, ); Similar to above, how can we assume smgr_reln array has all non-local relations? Have we tried the case with mix of temp and non-temp relations? In this code, I am slightly worried about the additional cost of each time checking smgrexists. Consider a case where there are many relations and only one or few of them have not cached the information, in such a case we will pay the cost of smgrexists for many relations without even going to the optimized path. Can we avoid that in some way or at least reduce its usage to only when it is required? One idea could be that we first check if the nblocks information is cached and if so then we don't need to call smgrnblocks, otherwise, check if it exists. For this, we need an API like smgrnblocks_cahced, something we discussed earlier but preferred the current API. Do you have any better ideas? [1] - https://www.postgresql.org/message-id/OSBPR01MB2341882F416A282C3F7D769DEFC70%40OSBPR01MB2341.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 22, 2020 at 8:30 AM Kyotaro Horiguchi wrote: > > At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" > wrote in > > From: Amit Kapila > > > Why would all client backends wait for AccessExclusive lock on this > > > relation? Say, a client needs a buffer for some other relation and > > > that might evict this buffer after we release the lock on the > > > partition. In StrategyGetBuffer, it is important to either have a pin > > > on the buffer or the buffer header itself must be locked to avoid > > > getting picked as victim buffer. Am I missing something? > > > > Ouch, right. (The year-end business must be making me crazy...) > > > > So, there are two choices here: > > > > 1) The current patch. > > 2) Acquire the buffer header spinlock before releasing the buffer mapping > > lwlock, and eliminate the buffer tag comparison as follows: > > > > BufTableLookup(); > > LockBufHdr(); > > LWLockRelease(); > > InvalidateBuffer(); > > > > I think both are okay. If I must choose either, I kind of prefer 1), > > because LWLockRelease() could take longer time to wake up other processes > > waiting on the lwlock, which is not very good to do while holding a > > spinlock. > > I like, as said before, the current patch. > Attached, please find the updated patch with the following modifications, (a) updated comments at various places especially to tell why this is a safe optimization, (b) merged the patch for extending the smgrnblocks and vacuum optimization patch, (c) made minor cosmetic changes and ran pgindent, and (d) updated commit message. BTW, this optimization will help not only vacuum but also truncate when it is done in the same transaction in which the relation is created. I would like to see certain tests to ensure that the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I see that some testing has been done earlier [1] for this threshold but I am not still able to conclude. The criteria to find the right threshold should be what is the maximum size of relation to be truncated above which we don't get benefit with this optimization. One idea could be to remove "nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always use optimized path for the tests. Then use the relation size as NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared buffers as 128MB, 1GB, 20GB, 100GB. Apart from tests, do let me know if you are happy with the changes in the patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch. [1] - https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila. v36-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch Description: Binary data
RE: [Patch] Optimize dropping of relation buffers using dlist
On Monday, December 21, 2020 10:25 PM, Amit Kapila wrote: > I have started doing minor edits to the patch especially planning to write a > theory why is this optimization safe and here is what I can come up with: > "To > remove all the pages of the specified relation forks from the buffer pool, we > need to scan the entire buffer pool but we can optimize it by finding the > buffers from BufMapping table provided we know the exact size of each fork > of the relation. The exact size is required to ensure that we don't leave any > buffer for the relation being dropped as otherwise the background writer or > checkpointer can lead to a PANIC error while flushing buffers corresponding > to files that don't exist. > > To know the exact size, we rely on the size cached for each fork by us during > recovery which limits the optimization to recovery and on standbys but we > can easily extend it once we have shared cache for relation size. > > In recovery, we cache the value returned by the first lseek(SEEK_END) and > the future writes keeps the cached value up-to-date. See smgrextend. It is > possible that the value of the first lseek is smaller than the actual number > of > existing blocks in the file due to buggy Linux kernels that might not have > accounted for the recent write. But that should be fine because there must > not be any buffers after that file size. > > XXX We would make the extra lseek call for the unoptimized paths but that is > okay because we do it just for the first fork and we anyway have to scan the > entire buffer pool the cost of which is so high that the extra lseek call > won't > make any visible difference. However, we can use InRecovery flag to avoid the > additional cost but that doesn't seem worth it." > > Thoughts? +1 Thank you very much for expanding the comments to carefully explain the reason on why the optimization is safe. I was also struggling to explain it completely but your description also covers the possibility of extending the optimization in the future once we have shared cache for rel size. So I like this addition. (Also, it seems that we have concluded to retain the locking mechanism of the existing patch based from the recent email exchanges. Both the traditional path and the optimized path do the rechecking. So there seems to be no problem, I'm definitely fine with it.) Regards, Kirk Jamison
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > Mmm. If that is true, doesn't the unoptimized path also need the > rechecking? Yes, the traditional processing does the recheck after acquiring the buffer header spinlock. Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Amit Kapila > > Why would all client backends wait for AccessExclusive lock on this > > relation? Say, a client needs a buffer for some other relation and > > that might evict this buffer after we release the lock on the > > partition. In StrategyGetBuffer, it is important to either have a pin > > on the buffer or the buffer header itself must be locked to avoid > > getting picked as victim buffer. Am I missing something? > > Ouch, right. (The year-end business must be making me crazy...) > > So, there are two choices here: > > 1) The current patch. > 2) Acquire the buffer header spinlock before releasing the buffer mapping > lwlock, and eliminate the buffer tag comparison as follows: > > BufTableLookup(); > LockBufHdr(); > LWLockRelease(); > InvalidateBuffer(); > > I think both are okay. If I must choose either, I kind of prefer 1), because > LWLockRelease() could take longer time to wake up other processes waiting on > the lwlock, which is not very good to do while holding a spinlock. I like, as said before, the current patch. regareds. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 22, 2020 at 8:12 AM Kyotaro Horiguchi wrote: > > At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila > wrote in > > > Why would all client backends wait for AccessExclusive lock on this > > relation? Say, a client needs a buffer for some other relation and > > that might evict this buffer after we release the lock on the > > partition. In StrategyGetBuffer, it is important to either have a pin > > on the buffer or the buffer header itself must be locked to avoid > > getting picked as victim buffer. Am I missing something? > > I think exactly like that. If we acquire the bufHdr lock before > releasing the partition lock, that steal doesn't happen but it doesn't > seem good as a locking protocol. > Right, so let's keep the code as it is but I feel it is better to add some comments explaining the rationale behind this code. -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 22, 2020 at 8:18 AM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > Why would all client backends wait for AccessExclusive lock on this > > relation? Say, a client needs a buffer for some other relation and > > that might evict this buffer after we release the lock on the > > partition. In StrategyGetBuffer, it is important to either have a pin > > on the buffer or the buffer header itself must be locked to avoid > > getting picked as victim buffer. Am I missing something? > > Ouch, right. (The year-end business must be making me crazy...) > > So, there are two choices here: > > 1) The current patch. > 2) Acquire the buffer header spinlock before releasing the buffer mapping > lwlock, and eliminate the buffer tag comparison as follows: > > BufTableLookup(); > LockBufHdr(); > LWLockRelease(); > InvalidateBuffer(); > > I think both are okay. If I must choose either, I kind of prefer 1), because > LWLockRelease() could take longer time to wake up other processes waiting on > the lwlock, which is not very good to do while holding a spinlock. > > I also prefer (1). I will add some comments about the locking protocol in the next version of the patch. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Amit Kapila > Why would all client backends wait for AccessExclusive lock on this > relation? Say, a client needs a buffer for some other relation and > that might evict this buffer after we release the lock on the > partition. In StrategyGetBuffer, it is important to either have a pin > on the buffer or the buffer header itself must be locked to avoid > getting picked as victim buffer. Am I missing something? Ouch, right. (The year-end business must be making me crazy...) So, there are two choices here: 1) The current patch. 2) Acquire the buffer header spinlock before releasing the buffer mapping lwlock, and eliminate the buffer tag comparison as follows: BufTableLookup(); LockBufHdr(); LWLockRelease(); InvalidateBuffer(); I think both are okay. If I must choose either, I kind of prefer 1), because LWLockRelease() could take longer time to wake up other processes waiting on the lwlock, which is not very good to do while holding a spinlock. Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila wrote in > On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com > wrote: > > > > From: Amit Kapila > > > This answers the second part of the question but what about the first > > > part (We hold a buffer partition lock, and have done a lookup in th > > > mapping table. Why are we then rechecking the > > > relfilenode/fork/blocknum?) > > > > > > I think we don't need such a check, rather we can have an Assert > > > corresponding to that if-condition in the patch. I understand it is > > > safe to compare relfilenode/fork/blocknum but it might confuse readers > > > of the code. > > > > Hmm, you're right. I thought someone else could steal the found buffer and > > use it for another block because the buffer mapping lwlock is released > > without pinning the buffer or acquiring the buffer header spinlock. > > > > Okay, I see your point. > > > However, in this case (replay of TRUNCATE during recovery), nobody steals > > the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, > > and the client backend waits for AccessExclusive lock. > > > > I understood that you are thinking that the rechecking is useless. > Why would all client backends wait for AccessExclusive lock on this > relation? Say, a client needs a buffer for some other relation and > that might evict this buffer after we release the lock on the > partition. In StrategyGetBuffer, it is important to either have a pin > on the buffer or the buffer header itself must be locked to avoid > getting picked as victim buffer. Am I missing something? I think exactly like that. If we acquire the bufHdr lock before releasing the partition lock, that steal doesn't happen but it doesn't seem good as a locking protocol. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 22 Dec 2020 01:42:55 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Amit Kapila > > This answers the second part of the question but what about the first > > part (We hold a buffer partition lock, and have done a lookup in th > > mapping table. Why are we then rechecking the > > relfilenode/fork/blocknum?) > > > > I think we don't need such a check, rather we can have an Assert > > corresponding to that if-condition in the patch. I understand it is > > safe to compare relfilenode/fork/blocknum but it might confuse readers > > of the code. > > Hmm, you're right. I thought someone else could steal the found > buffer and use it for another block because the buffer mapping > lwlock is released without pinning the buffer or acquiring the > buffer header spinlock. However, in this case (replay of TRUNCATE > during recovery), nobody steals the buffer: bgwriter or checkpointer > doesn't use a buffer for a new block, and the client backend waits > for AccessExclusive lock. Mmm. If that is true, doesn't the unoptimized path also need the rechecking? The AEL doesn't work for a buffer block. No new block can be allocted for the relation but still BufferAlloc can steal the block for other relations since the AEL doesn't work for each buffer block. Am I still missing something? > > I have started doing minor edits to the patch especially planning to > > write a theory why is this optimization safe and here is what I can > > come up with: > > Thank you, that's fluent and easier to understand. +1 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > This answers the second part of the question but what about the first > > part (We hold a buffer partition lock, and have done a lookup in th > > mapping table. Why are we then rechecking the > > relfilenode/fork/blocknum?) > > > > I think we don't need such a check, rather we can have an Assert > > corresponding to that if-condition in the patch. I understand it is > > safe to compare relfilenode/fork/blocknum but it might confuse readers > > of the code. > > Hmm, you're right. I thought someone else could steal the found buffer and > use it for another block because the buffer mapping lwlock is released > without pinning the buffer or acquiring the buffer header spinlock. > Okay, I see your point. > However, in this case (replay of TRUNCATE during recovery), nobody steals > the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, > and the client backend waits for AccessExclusive lock. > > Why would all client backends wait for AccessExclusive lock on this relation? Say, a client needs a buffer for some other relation and that might evict this buffer after we release the lock on the partition. In StrategyGetBuffer, it is important to either have a pin on the buffer or the buffer header itself must be locked to avoid getting picked as victim buffer. Am I missing something? -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Amit Kapila > This answers the second part of the question but what about the first > part (We hold a buffer partition lock, and have done a lookup in th > mapping table. Why are we then rechecking the > relfilenode/fork/blocknum?) > > I think we don't need such a check, rather we can have an Assert > corresponding to that if-condition in the patch. I understand it is > safe to compare relfilenode/fork/blocknum but it might confuse readers > of the code. Hmm, you're right. I thought someone else could steal the found buffer and use it for another block because the buffer mapping lwlock is released without pinning the buffer or acquiring the buffer header spinlock. However, in this case (replay of TRUNCATE during recovery), nobody steals the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, and the client backend waits for AccessExclusive lock. > I have started doing minor edits to the patch especially planning to > write a theory why is this optimization safe and here is what I can > come up with: Thank you, that's fluent and easier to understand. Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
On Thu, Nov 19, 2020 at 12:37 PM tsunakawa.ta...@fujitsu.com wrote: > > From: Andres Freund > > > Smaller comment: > > > > +static void > > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum, > > int nforks, > > + BlockNumber > > *nForkBlocks, BlockNumber *firstDelBlock) > > ... > > + /* Check that it is in the buffer pool. If not, do > > nothing. > > */ > > + LWLockAcquire(bufPartitionLock, LW_SHARED); > > + buf_id = BufTableLookup(, bufHash); > > ... > > + bufHdr = GetBufferDescriptor(buf_id); > > + > > + buf_state = LockBufHdr(bufHdr); > > + > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && > > + bufHdr->tag.forkNum == forkNum[i] && > > + bufHdr->tag.blockNum >= firstDelBlock[i]) > > + InvalidateBuffer(bufHdr); /* releases > > spinlock */ > > + else > > + UnlockBufHdr(bufHdr, buf_state); > > > > I'm a bit confused about the check here. We hold a buffer partition lock, > > and > > have done a lookup in the mapping table. Why are we then rechecking the > > relfilenode/fork/blocknum? And why are we doing so holding the buffer header > > lock, which is essentially a spinlock, so should only ever be held for very > > short > > portions? > > > > This looks like it's copying logic from DropRelFileNodeBuffers() etc, but > > there > > the situation is different: We haven't done a buffer mapping lookup, and we > > don't hold a partition lock! > > That's because the buffer partition lock is released immediately after the > hash table has been looked up. As an aside, InvalidateBuffer() requires the > caller to hold the buffer header spinlock and doesn't hold the buffer > partition lock. > This answers the second part of the question but what about the first part (We hold a buffer partition lock, and have done a lookup in the mapping table. Why are we then rechecking the relfilenode/fork/blocknum?) I think we don't need such a check, rather we can have an Assert corresponding to that if-condition in the patch. I understand it is safe to compare relfilenode/fork/blocknum but it might confuse readers of the code. I have started doing minor edits to the patch especially planning to write a theory why is this optimization safe and here is what I can come up with: "To remove all the pages of the specified relation forks from the buffer pool, we need to scan the entire buffer pool but we can optimize it by finding the buffers from BufMapping table provided we know the exact size of each fork of the relation. The exact size is required to ensure that we don't leave any buffer for the relation being dropped as otherwise the background writer or checkpointer can lead to a PANIC error while flushing buffers corresponding to files that don't exist. To know the exact size, we rely on the size cached for each fork by us during recovery which limits the optimization to recovery and on standbys but we can easily extend it once we have shared cache for relation size. In recovery, we cache the value returned by the first lseek(SEEK_END) and the future writes keeps the cached value up-to-date. See smgrextend. It is possible that the value of the first lseek is smaller than the actual number of existing blocks in the file due to buggy Linux kernels that might not have accounted for the recent write. But that should be fine because there must not be any buffers after that file size. XXX We would make the extra lseek call for the unoptimized paths but that is okay because we do it just for the first fork and we anyway have to scan the entire buffer pool the cost of which is so high that the extra lseek call won't make any visible difference. However, we can use InRecovery flag to avoid the additional cost but that doesn't seem worth it." Thoughts? -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
Hello Kirk, I noticed you have pushed a new version for your patch which has some changes on TRUNCATE on TOAST relation. Although you've done performance test for your changed part. I'd like to do a double check for your patch(hope you don't mind). Below is the updated recovery performance test results for your new patch. All seems good. *TOAST relation with PLAIN strategy like integer : 1. Recovery after VACUUM test results(average of 15 times) shared_buffers master(sec)patched(sec) %reg=((patched-master)/patched) -- 128M2.111 1.604 -24% 10G 57.135 1.878 -97% 20G 167.122 1.932 -99% 2. Recovery after TRUNCATE test results(average of 15 times) shared_buffers master(sec) patched(sec) %reg=((patched-master)/patched) -- 128M2.326 1.718 -26% 10G 82.397 1.738 -98% 20G 169.275 1.718 -99% *TOAST relation with NON-PLAIN strategy like text/varchar: 1. Recovery after VACUUM test results(average of 15 times) shared_buffers master(sec)patched(sec) %reg=((patched-master)/patched) -- 128M3.174 2.493 -21% 10G 72.716 2.246 -97% 20G 163.660 2.474 -98% 2. Recovery after TRUNCATE test results(average of 15 times): Although it looks like there are some improvements after patch applied. I think that's because of the average calculation. TRUNCATE results should be similar between master and patched because they all do full scan. shared_buffers master(sec) patched(sec) %reg=((patched-master)/patched) -- 128M4.978 4.958 0% 10G 97.048 88.751 -9% 20G 183.230 173.226 -5% [Machine spec] CPU : 40 processors (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz) Memory: 128G OS: CentOS 8 [Failover test data] Total table Size: 600M Table: 1 tables (1000 rows per table) [Configure in postgresql.conf] autovacuum = off wal_level = replica max_wal_senders = 5 max_locks_per_transaction = 1 If you have any questions on my test results, please let me know. Regards Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > Attached are the final updated patches. Looks good, and the patch remains ready for committer. (Personally, I wanted the code comment to touch upon the TOAST and FSM/VM for the reader, because we couldn't think of those possibilities and took some time to find why the optimization path wasn't taken.) Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Friday, December 11, 2020 10:27 AM, Amit Kapila wrote: > On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com > wrote: > > So should I still not include that information? > > > > I think we can extend your existing comment like: "Otherwise if the size of a > relation fork is not cached, we proceed to a full scan of the whole buffer > pool. > This can happen if there is no update to a particular fork during recovery." Attached are the final updated patches. I followed this advice and updated the source code comment a little bit. There are no changes from the previous except that and the unnecessary "cached" condition which Tsunakawa-san mentioned. Below is also the updated recovery performance test results for TRUNCATE. (1000 tables, 1MB per table, results measured in seconds) | s_b | Master | Patched | % Reg | |---||-|-| | 128MB | 0.406 | 0.406 | 0% | | 512MB | 0.506 | 0.406 | -25%| | 1GB | 0.806 | 0.406 | -99%| | 20GB | 15.224 | 0.406 | -3650% | | 100GB | 81.506 | 0.406 | -19975% | Because of the size of relation, it is expected to enter full-scan for the 128MB shared_buffers setting. And there was no regression. Similar to previous test results, the recovery time was constant for all shared_buffers setting with the patches applied. Regards, Kirk Jamison v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch Description: v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch Description: v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch Description: v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
RE: [Patch] Optimize dropping of relation buffers using dlist
From: tsunakawa.ta...@fujitsu.com > What's valuable as a code comment to describe the remaining issue is that the You can attach XXX or FIXME in front of the issue description for easier search. (XXX appears to be used much more often in Postgres.) Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com wrote: > > On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote: > > On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com > > wrote: > > > > > > Yes, I have tested that optimization works for index relations. > > > > > > I have attached the V34, following the conditions that we use "cached" > > > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() > > > for consistency. > > > I added comment in 0004 the limitation of optimization when there are > > > TOAST relations that use NON-PLAIN strategy. i.e. The optimization > > > works if the data types used are integers, OID, bytea, etc. But for > > > TOAST-able data types like text, the optimization will be skipped and > > > force a > > full scan during recovery. > > > > > > > AFAIU, it won't take optimization path only when we have TOAST relation but > > there is no insertion corresponding to it. If so, then we don't need to > > mention > > it specifically because there are other similar cases where the optimization > > won't work like when during recovery we have to just perform TRUNCATE. > > > > Right, I forgot to add that there should be an update like insert to the TOAST > relation for truncate optimization to work. However, that is only limited to > TOAST relations with PLAIN strategy. I have tested with text data type, with > Inserts before truncate, and it did not enter the optimization path. > I think you are seeing because text datatype allows creating toast storage and your data is small enough to be toasted. > OTOH, > It worked for data type like integer. > It is not related to any datatype, it can happen whenever we don't have any operation on any of the forks after recovery. > So should I still not include that information? > I think we can extend your existing comment like: "Otherwise if the size of a relation fork is not cached, we proceed to a full scan of the whole buffer pool. This can happen if there is no update to a particular fork during recovery." -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote: > > AFAIU, it won't take optimization path only when we have TOAST relation but > > there is no insertion corresponding to it. If so, then we don't need to > > mention > > it specifically because there are other similar cases where the optimization > > won't work like when during recovery we have to just perform TRUNCATE. > > > > Right, I forgot to add that there should be an update like insert to the TOAST > relation for truncate optimization to work. However, that is only limited to > TOAST relations with PLAIN strategy. I have tested with text data type, with > Inserts before truncate, and it did not enter the optimization path. OTOH, > It worked for data type like integer. So should I still not include that > information? What's valuable as a code comment to describe the remaining issue is that the reader can find clues to if this is related to the problem he/she has hit, and/or how to solve the issue. I don't think the current comment is so bad in that regard, but it seems better to add: * The condition of the issue: the table's ancillary storage (index, TOAST table, FSM, VM, etc.) was not updated during recovery. (As an aside, "during recovery" here does not mean "after the last checkpoint" but "from the start of recovery", because the standby experiences many checkpoints (the correct term is restartpoints in case of standby).) * The cause as a hint to solve the issue: The startup process does not find page modification WAL records. As a result, it won't call XLogReadBufferExtended() and smgrnblocks() called therein, so the relation/fork size is not cached. Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote: > On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com > wrote: > > > > Yes, I have tested that optimization works for index relations. > > > > I have attached the V34, following the conditions that we use "cached" > > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() > > for consistency. > > I added comment in 0004 the limitation of optimization when there are > > TOAST relations that use NON-PLAIN strategy. i.e. The optimization > > works if the data types used are integers, OID, bytea, etc. But for > > TOAST-able data types like text, the optimization will be skipped and force > > a > full scan during recovery. > > > > AFAIU, it won't take optimization path only when we have TOAST relation but > there is no insertion corresponding to it. If so, then we don't need to > mention > it specifically because there are other similar cases where the optimization > won't work like when during recovery we have to just perform TRUNCATE. > Right, I forgot to add that there should be an update like insert to the TOAST relation for truncate optimization to work. However, that is only limited to TOAST relations with PLAIN strategy. I have tested with text data type, with Inserts before truncate, and it did not enter the optimization path. OTOH, It worked for data type like integer. So should I still not include that information? Also, I will remove the unnecessary "cached" from the line that Tsunakawa-san mentioned. I will wait for a few more comments before reuploading, hopefully, the final version & including the test for truncate, Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com wrote: > > Yes, I have tested that optimization works for index relations. > > I have attached the V34, following the conditions that we use "cached" flag > for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for > consistency. > I added comment in 0004 the limitation of optimization when there are TOAST > relations that use NON-PLAIN strategy. i.e. The optimization works if the data > types used are integers, OID, bytea, etc. But for TOAST-able data types like > text, > the optimization will be skipped and force a full scan during recovery. > AFAIU, it won't take optimization path only when we have TOAST relation but there is no insertion corresponding to it. If so, then we don't need to mention it specifically because there are other similar cases where the optimization won't work like when during recovery we have to just perform TRUNCATE. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > I added comment in 0004 the limitation of optimization when there are TOAST > relations that use NON-PLAIN strategy. i.e. The optimization works if the data > types used are integers, OID, bytea, etc. But for TOAST-able data types like > text, > the optimization will be skipped and force a full scan during recovery. bytea is a TOAST-able type. + /* +* Enter the optimization if the total number of blocks to be +* invalidated for all relations is below the full scan threshold. +*/ + if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD) Checking cached here doesn't seem to be necessary, because if cached is false, the control goes to the full scan path as below: + if (!cached) + goto buffer_full_scan; + Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thursday, December 10, 2020 12:27 PM, Amit Kapila wrote: > On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi > wrote: > > > > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila > > wrote in > > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi > > > > Mmm. At least btree doesn't need to call smgrnblocks except at > > > > expansion, so we cannot get to the optimized path in major cases > > > > of truncation involving btree (and/or maybe other indexes). > > > > > > > > > > AFAICS, btree insert should call smgrnblocks via > > > > btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExte > nded->XLogReadBufferExtended->smgrnblocks. > > > Similarly delete should also call smgrnblocks. Can you be bit more > > > specific related to the btree case you have in mind? > > > > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is > > called during buffer loading while recovery. So, smgrnblock is called > > for indexes if any update happens on the heap relation. > > > > Okay, so this means that we can get the benefit of optimization in many cases > in the Truncate code path as well even if we use 'cached' > flag? If so, then I would prefer to keep the code consistent for both vacuum > and truncate recovery code path. Yes, I have tested that optimization works for index relations. I have attached the V34, following the conditions that we use "cached" flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for consistency. I added comment in 0004 the limitation of optimization when there are TOAST relations that use NON-PLAIN strategy. i.e. The optimization works if the data types used are integers, OID, bytea, etc. But for TOAST-able data types like text, the optimization will be skipped and force a full scan during recovery. Regards, Kirk Jamison v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch Description: v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch Description: v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch Description: v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Re: [Patch] Optimize dropping of relation buffers using dlist
On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi wrote: > > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila > wrote in > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi > > > Mmm. At least btree doesn't need to call smgrnblocks except at > > > expansion, so we cannot get to the optimized path in major cases of > > > truncation involving btree (and/or maybe other indexes). > > > > > > > AFAICS, btree insert should call smgrnblocks via > > btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks. > > Similarly delete should also call smgrnblocks. Can you be bit more > > specific related to the btree case you have in mind? > > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is > called during buffer loading while recovery. So, smgrnblock is called > for indexes if any update happens on the heap relation. > Okay, so this means that we can get the benefit of optimization in many cases in the Truncate code path as well even if we use 'cached' flag? If so, then I would prefer to keep the code consistent for both vacuum and truncate recovery code path. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is > called during buffer loading while recovery. So, smgrnblock is called > for indexes if any update happens on the heap relation. I misunderstood that you said there's no problem with the TOAST index because TRUNCATE creates the meta page, resulting in the caching of the page and size of the relation. Anyway, I'm relieved the concern disappeared. Then, I'd like to hear your vote in my previous mail... Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila wrote in > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi > wrote: > > > > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila > > wrote in > > > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com > > > wrote: > > > > > > > > From: Jamison, Kirk/ジャミソン カーク > > > > > Because one of the rel's cached value was false, it forced the > > > > > full-scan path for TRUNCATE. > > > > > Is there a possible workaround for this? > > > > > > > > Hmm, the other two relfilenodes are for the TOAST table and index of > > > > the target table. I think the INSERT didn't access those TOAST > > > > relfilenodes because the inserted data was stored in the main storage. > > > > But TRUNCATE always truncates all the three relfilenodes. So, the > > > > standby had not opened the relfilenode for the TOAST stuff or cached > > > > its size when replaying the TRUNCATE. > > > > > > > > I'm afraid this is more common than we can ignore and accept the slow > > > > traditional path, but I don't think of a good idea to use the cached > > > > flag. > > > > > > > > > > I also can't think of a way to use an optimized path for such cases > > > but I don't agree with your comment on if it is common enough that we > > > leave this optimization entirely for the truncate path. > > > > Mmm. At least btree doesn't need to call smgrnblocks except at > > expansion, so we cannot get to the optimized path in major cases of > > truncation involving btree (and/or maybe other indexes). > > > > AFAICS, btree insert should call smgrnblocks via > btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks. > Similarly delete should also call smgrnblocks. Can you be bit more > specific related to the btree case you have in mind? Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is called during buffer loading while recovery. So, smgrnblock is called for indexes if any update happens on the heap relation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi wrote: > > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila > wrote in > > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com > > wrote: > > > > > > From: Jamison, Kirk/ジャミソン カーク > > > > Because one of the rel's cached value was false, it forced the > > > > full-scan path for TRUNCATE. > > > > Is there a possible workaround for this? > > > > > > Hmm, the other two relfilenodes are for the TOAST table and index of the > > > target table. I think the INSERT didn't access those TOAST relfilenodes > > > because the inserted data was stored in the main storage. But TRUNCATE > > > always truncates all the three relfilenodes. So, the standby had not > > > opened the relfilenode for the TOAST stuff or cached its size when > > > replaying the TRUNCATE. > > > > > > I'm afraid this is more common than we can ignore and accept the slow > > > traditional path, but I don't think of a good idea to use the cached flag. > > > > > > > I also can't think of a way to use an optimized path for such cases > > but I don't agree with your comment on if it is common enough that we > > leave this optimization entirely for the truncate path. > > Mmm. At least btree doesn't need to call smgrnblocks except at > expansion, so we cannot get to the optimized path in major cases of > truncation involving btree (and/or maybe other indexes). > AFAICS, btree insert should call smgrnblocks via btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExtended->XLogReadBufferExtended->smgrnblocks. Similarly delete should also call smgrnblocks. Can you be bit more specific related to the btree case you have in mind? -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wednesday, December 9, 2020 10:58 AM, Tsunakawa, Takayuki wrote: > From: Kyotaro Horiguchi > > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila > > wrote in > > I also can't think of a way to use an optimized path for such cases > > > but I don't agree with your comment on if it is common enough that > > > we leave this optimization entirely for the truncate path. > > > > An ugly way to cope with it would be to let other smgr functions > > manage the cached value, for example, by calling smgrnblocks while > > InRecovery. Or letting smgr remember the maximum block number ever > > accessed. But we cannot fully rely on that since smgr can be closed > > midst of a session and smgr doesn't offer such persistence. In the > > first place smgr doesn't seem to be the place to store such persistent > > information. > > Yeah, considering the future evolution of this patch to operations during > normal running, I don't think that would be a good fit, either. > > Then, the as we're currently targeting just recovery, the options we can take > are below. Which would vote for? My choice would be (3) > (2) > (1). > > > (1) > Use the cached flag in both VACUUM (0003) and TRUNCATE (0004). > This brings the most safety and code consistency. > But this would not benefit from optimization for TRUNCATE in unexpectedly > many cases -- when TOAST storage exists but it's not written, or FSM/VM is > not updated after checkpoint. > > > (2) > Use the cached flag in VACUUM (0003), but use InRecovery instead of the > cached flag in TRUNCATE (0004). > This benefits from the optimization in all cases. > But this lacks code consistency. > You may be afraid of safety if the startup process smgrclose()s the relation > after the shared buffer flushing hits disk full. However, startup process > doesn't smgrclose(), so it should be safe. Just in case the startup process > smgrclose()s, the worst consequence is PANIC shutdown after repeated > failure of checkpoints due to lingering orphaned dirty shared buffers. Accept > it as Thomas-san's devil's suggestion. > > > (3) > Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004). > This benefits from the optimization in all cases. > The code is consistent and smaller. > As for the safety, this is the same as (2), but it applies to VACUUM as well. If we want code consistency, then we'd fall in either 1 or 3. And if we want to take the benefits of optimization for both DropRelFileNodeBuffers and DropRelFileNodesAllBuffers, then I'd choose 3. However, if the reviewers and committer want to make use of the "cached" flag, then we can live with "cached" value in place there even if it's not common to get the optimization for TRUNCATE path. So only VACUUM would take the most benefit. My vote is also (3), then (2), and (1). Regards, Kirk Jamison
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila > wrote in > I also can't think of a way to use an optimized path for such cases > > but I don't agree with your comment on if it is common enough that we > > leave this optimization entirely for the truncate path. > > An ugly way to cope with it would be to let other smgr functions > manage the cached value, for example, by calling smgrnblocks while > InRecovery. Or letting smgr remember the maximum block number ever > accessed. But we cannot fully rely on that since smgr can be closed > midst of a session and smgr doesn't offer such persistence. In the > first place smgr doesn't seem to be the place to store such persistent > information. Yeah, considering the future evolution of this patch to operations during normal running, I don't think that would be a good fit, either. Then, the as we're currently targeting just recovery, the options we can take are below. Which would vote for? My choice would be (3) > (2) > (1). (1) Use the cached flag in both VACUUM (0003) and TRUNCATE (0004). This brings the most safety and code consistency. But this would not benefit from optimization for TRUNCATE in unexpectedly many cases -- when TOAST storage exists but it's not written, or FSM/VM is not updated after checkpoint. (2) Use the cached flag in VACUUM (0003), but use InRecovery instead of the cached flag in TRUNCATE (0004). This benefits from the optimization in all cases. But this lacks code consistency. You may be afraid of safety if the startup process smgrclose()s the relation after the shared buffer flushing hits disk full. However, startup process doesn't smgrclose(), so it should be safe. Just in case the startup process smgrclose()s, the worst consequence is PANIC shutdown after repeated failure of checkpoints due to lingering orphaned dirty shared buffers. Accept it as Thomas-san's devil's suggestion. (3) Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004). This benefits from the optimization in all cases. The code is consistent and smaller. As for the safety, this is the same as (2), but it applies to VACUUM as well. Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila wrote in > On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com > wrote: > > > > From: Jamison, Kirk/ジャミソン カーク > > > Because one of the rel's cached value was false, it forced the > > > full-scan path for TRUNCATE. > > > Is there a possible workaround for this? > > > > Hmm, the other two relfilenodes are for the TOAST table and index of the > > target table. I think the INSERT didn't access those TOAST relfilenodes > > because the inserted data was stored in the main storage. But TRUNCATE > > always truncates all the three relfilenodes. So, the standby had not > > opened the relfilenode for the TOAST stuff or cached its size when > > replaying the TRUNCATE. > > > > I'm afraid this is more common than we can ignore and accept the slow > > traditional path, but I don't think of a good idea to use the cached flag. > > > > I also can't think of a way to use an optimized path for such cases > but I don't agree with your comment on if it is common enough that we > leave this optimization entirely for the truncate path. Mmm. At least btree doesn't need to call smgrnblocks except at expansion, so we cannot get to the optimized path in major cases of truncation involving btree (and/or maybe other indexes). TOAST relations are not accessed until we insert/update/retrive the values in it. An ugly way to cope with it would be to let other smgr functions manage the cached value, for example, by calling smgrnblocks while InRecovery. Or letting smgr remember the maximum block number ever accessed. But we cannot fully rely on that since smgr can be closed midst of a session and smgr doesn't offer such persistence. In the first place smgr doesn't seem to be the place to store such persistent information. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com wrote: > > From: Jamison, Kirk/ジャミソン カーク > > Because one of the rel's cached value was false, it forced the > > full-scan path for TRUNCATE. > > Is there a possible workaround for this? > > Hmm, the other two relfilenodes are for the TOAST table and index of the > target table. I think the INSERT didn't access those TOAST relfilenodes > because the inserted data was stored in the main storage. But TRUNCATE > always truncates all the three relfilenodes. So, the standby had not opened > the relfilenode for the TOAST stuff or cached its size when replaying the > TRUNCATE. > > I'm afraid this is more common than we can ignore and accept the slow > traditional path, but I don't think of a good idea to use the cached flag. > I also can't think of a way to use an optimized path for such cases but I don't agree with your comment on if it is common enough that we leave this optimization entirely for the truncate path. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > Because one of the rel's cached value was false, it forced the > full-scan path for TRUNCATE. > Is there a possible workaround for this? Hmm, the other two relfilenodes are for the TOAST table and index of the target table. I think the INSERT didn't access those TOAST relfilenodes because the inserted data was stored in the main storage. But TRUNCATE always truncates all the three relfilenodes. So, the standby had not opened the relfilenode for the TOAST stuff or cached its size when replaying the TRUNCATE. I'm afraid this is more common than we can ignore and accept the slow traditional path, but I don't think of a good idea to use the cached flag. Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Tuesday, December 8, 2020 2:35 PM, Amit Kapila wrote: > On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi > wrote: > > > > At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila > > wrote in > > > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi > > > wrote: > > > > We drop > > > > buffers for the old relfilenode on truncation anyway. > > > > > > > > What I did is: > > > > > > > > a: Create a physical replication pair. > > > > b: On the master, create a table. (without explicitly starting a > > > > tx) > > > > c: On the master, insert a tuple into the table. > > > > d: On the master truncate the table. > > > > > > > > On the standby, smgrnblocks is called for the old relfilenode of > > > > the table at c, then the same function is called for the same > > > > relfilenode at d and the function takes the cached path. > > > > > > > > > > This is on the lines I have tried for recovery. So, it seems we are > > > in agreement that we can use the 'cached' flag in > > > DropRelFileNodesAllBuffers and it will take the optimized path in > > > many such cases, right? > > > > > > Mmm. There seems to be a misunderstanding.. What I opposed to is > > referring only to InRecovery and ignoring the value of "cached". > > > > Okay, I think it was Kirk-San who proposed to use InRecovery and ignoring > the value of "cached" based on the theory that even if Insert (or other DMLs) > are done before Truncate, it won't use an optimized path and I don't agree > with the same. So, I did a small test to check the same and found that it > should use the optimized path and the same is true for the experiment done > by you. I am not sure why Kirk-San is seeing something different? > > > The remaining issue is we don't get to the optimized path when a > > standby makes the first call to smgrnblocks() when truncating a > > relation. Still we can get to the optimized path as far as any > > update(+insert) or select is performed earlier on the relation so I > > think it doesn't matter so match. > > > > +1. My question/proposal before was to either use InRecovery, or completely drop the smgrnblocks' "cached" flag. But that is coming from the results of my investigation below when I used "cached" in DropRelFileNodesAllBuffers(). The optimization path was skipped because one of the Rels' "cached" value was "false". Test Case. (shared_buffer = 1GB) 0. Set physical replication to both master and standby. 1. Create 1 table. 2. Insert Data (1MB) to TABLE. 16385 is the relnode for insert (both Master and Standby). 3. Pause WAL on Standby. 4. TRUNCATE table on Primary. nrels = 3. relNodes 16389, 16388, 16385. 5. Stop Primary. 6. Promote standby and resume WAL recovery. nrels = 3 1st rel's check for optimization: "cached" is TRUE. relNode = 16389. 2nd rel's check for optimization. "cached" was returned FALSE by smgrnblocks). relNode = 16388. Since one of the rels' cached is "FALSE", the optimization check for 3rd relation and the whole optimization itself is skipped. Go to full-scan path in DropRelFileNodesAllBuffers(). Then smgrclose for relNodes 16389, 16388, 16385. Because one of the rel's cached value was false, it forced the full-scan path for TRUNCATE. Is there a possible workaround for this? Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi wrote: > > At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila > wrote in > > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi > > wrote: > > > We drop > > > buffers for the old relfilenode on truncation anyway. > > > > > > What I did is: > > > > > > a: Create a physical replication pair. > > > b: On the master, create a table. (without explicitly starting a tx) > > > c: On the master, insert a tuple into the table. > > > d: On the master truncate the table. > > > > > > On the standby, smgrnblocks is called for the old relfilenode of the > > > table at c, then the same function is called for the same relfilenode > > > at d and the function takes the cached path. > > > > > > > This is on the lines I have tried for recovery. So, it seems we are in > > agreement that we can use the 'cached' flag in > > DropRelFileNodesAllBuffers and it will take the optimized path in many > > such cases, right? > > > Mmm. There seems to be a misunderstanding.. What I opposed to is > referring only to InRecovery and ignoring the value of "cached". > Okay, I think it was Kirk-San who proposed to use InRecovery and ignoring the value of "cached" based on the theory that even if Insert (or other DMLs) are done before Truncate, it won't use an optimized path and I don't agree with the same. So, I did a small test to check the same and found that it should use the optimized path and the same is true for the experiment done by you. I am not sure why Kirk-San is seeing something different? > The remaining issue is we don't get to the optimized path when a > standby makes the first call to smgrnblocks() when truncating a > relation. Still we can get to the optimized path as far as any > update(+insert) or select is performed earlier on the relation so I > think it doesn't matter so match. > +1. With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila wrote in > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi > wrote: > > We drop > > buffers for the old relfilenode on truncation anyway. > > > > What I did is: > > > > a: Create a physical replication pair. > > b: On the master, create a table. (without explicitly starting a tx) > > c: On the master, insert a tuple into the table. > > d: On the master truncate the table. > > > > On the standby, smgrnblocks is called for the old relfilenode of the > > table at c, then the same function is called for the same relfilenode > > at d and the function takes the cached path. > > > > This is on the lines I have tried for recovery. So, it seems we are in > agreement that we can use the 'cached' flag in > DropRelFileNodesAllBuffers and it will take the optimized path in many > such cases, right? Mmm. There seems to be a misunderstanding.. What I opposed to is referring only to InRecovery and ignoring the value of "cached". The remaining issue is we don't get to the optimized path when a standby makes the first call to smgrnblocks() when truncating a relation. Still we can get to the optimized path as far as any update(+insert) or select is performed earlier on the relation so I think it doesn't matter so match. But I'm not sure what others think. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi wrote: > > I'm out of it more than usual.. > > At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila > > wrote in > > > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com > > > wrote: > > > > > > > > On Friday, December 4, 2020 8:27 PM, Amit Kapila > > > > wrote: > > > > Hi, > > > > I have reported before that it is not always the case that the "cached" > > > > flag of > > > > srnblocks() return true. So when I checked the truncate test case used > > > > in my > > > > patch, it does not enter the optimization path despite doing INSERT > > > > before > > > > truncation of table. > > > > The reason for that is because in TRUNCATE, a new RelFileNode is > > > > assigned > > > > to the relation when creating a new file. In recovery, > > > > XLogReadBufferExtended() > > > > always opens the RelFileNode and calls smgrnblocks() for that > > > > RelFileNode for the > > > > first time. And for recovery processing, different RelFileNodes are > > > > used for the > > > > INSERTs to the table and TRUNCATE to the same table. > > > > > > > > > > Hmm, how is it possible if Insert is done before Truncate? The insert > > > should happen in old RelFileNode only. I have verified by adding a > > > break-in (while (1), so that it stops there) heap_xlog_insert and > > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode. > > > How have you verified what you are saying? > > It's irrelvant that the insert happens on the old relfilenode. > I think it is relevant because it will allow the 'blocks' value to be cached. > We drop > buffers for the old relfilenode on truncation anyway. > > What I did is: > > a: Create a physical replication pair. > b: On the master, create a table. (without explicitly starting a tx) > c: On the master, insert a tuple into the table. > d: On the master truncate the table. > > On the standby, smgrnblocks is called for the old relfilenode of the > table at c, then the same function is called for the same relfilenode > at d and the function takes the cached path. > This is on the lines I have tried for recovery. So, it seems we are in agreement that we can use the 'cached' flag in DropRelFileNodesAllBuffers and it will take the optimized path in many such cases, right? -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
I'm out of it more than usual.. At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi wrote in > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila > wrote in > > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com > > wrote: > > > > > > On Friday, December 4, 2020 8:27 PM, Amit Kapila > > > wrote: > > > Hi, > > > I have reported before that it is not always the case that the "cached" > > > flag of > > > srnblocks() return true. So when I checked the truncate test case used in > > > my > > > patch, it does not enter the optimization path despite doing INSERT before > > > truncation of table. > > > The reason for that is because in TRUNCATE, a new RelFileNode is assigned > > > to the relation when creating a new file. In recovery, > > > XLogReadBufferExtended() > > > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode > > > for the > > > first time. And for recovery processing, different RelFileNodes are used > > > for the > > > INSERTs to the table and TRUNCATE to the same table. > > > > > > > Hmm, how is it possible if Insert is done before Truncate? The insert > > should happen in old RelFileNode only. I have verified by adding a > > break-in (while (1), so that it stops there) heap_xlog_insert and > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode. > > How have you verified what you are saying? It's irrelvant that the insert happens on the old relfilenode. We drop buffers for the old relfilenode on truncation anyway. What I did is: a: Create a physical replication pair. b: On the master, create a table. (without explicitly starting a tx) c: On the master, insert a tuple into the table. d: On the master truncate the table. On the standby, smgrnblocks is called for the old relfilenode of the table at c, then the same function is called for the same relfilenode at d and the function takes the cached path. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Tue, Dec 8, 2020 at 6:23 AM Kyotaro Horiguchi wrote: > > At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila > > wrote in > > > Hmm, how is it possible if Insert is done before Truncate? The insert > > > should happen in old RelFileNode only. I have verified by adding a > > > break-in (while (1), so that it stops there) heap_xlog_insert and > > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode. > > > How have you verified what you are saying? > > > > You might be thinking of in-transaction sequence of > > Inert-truncate. What *I* mention before is truncation of a relation > > that smgrnblocks() has already been called for. The most common way > > to make it happen was INSERTs *before* the truncating transaction > > starts. What I have tried is Insert and Truncate in separate transactions like below: postgres=# insert into mytbl values(1); INSERT 0 1 postgres=# truncate mytbl; TRUNCATE TABLE After above, manually killed the server, and then during recovery, we have called heap_xlog_insert() and DropRelFileNodesAllBuffers() and at both places, RelFileNode is the same and I don't see any reason for it to be different. > > It may be a SELECT on a hot-standby. Sorry for the confusing > > expression. > > And ,to make sure, it is a bit off from the point of the discussion as > I noted. I just meant that the proposition that "smgrnblokcs() always > returns false for "cached" when it is called in > DropRelFileNodesAllBuffers()" doesn't always holds. > Right, I feel in some cases the 'cached' won't be true like if we would have done Checkpoint after Insert in the above case (say when the only WAL to replay during recovery is of Truncate) but I think that should be fine. What do you think? -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi wrote in > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila > wrote in > > Hmm, how is it possible if Insert is done before Truncate? The insert > > should happen in old RelFileNode only. I have verified by adding a > > break-in (while (1), so that it stops there) heap_xlog_insert and > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode. > > How have you verified what you are saying? > > You might be thinking of in-transaction sequence of > Inert-truncate. What *I* mention before is truncation of a relation > that smgrnblocks() has already been called for. The most common way > to make it happen was INSERTs *before* the truncating transaction > starts. It may be a SELECT on a hot-standby. Sorry for the confusing > expression. And ,to make sure, it is a bit off from the point of the discussion as I noted. I just meant that the proposition that "smgrnblokcs() always returns false for "cached" when it is called in DropRelFileNodesAllBuffers()" doesn't always holds. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila wrote in > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com > wrote: > > > > On Friday, December 4, 2020 8:27 PM, Amit Kapila > > wrote: > > Hi, > > I have reported before that it is not always the case that the "cached" > > flag of > > srnblocks() return true. So when I checked the truncate test case used in my > > patch, it does not enter the optimization path despite doing INSERT before > > truncation of table. > > The reason for that is because in TRUNCATE, a new RelFileNode is assigned > > to the relation when creating a new file. In recovery, > > XLogReadBufferExtended() > > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode > > for the > > first time. And for recovery processing, different RelFileNodes are used > > for the > > INSERTs to the table and TRUNCATE to the same table. > > > > Hmm, how is it possible if Insert is done before Truncate? The insert > should happen in old RelFileNode only. I have verified by adding a > break-in (while (1), so that it stops there) heap_xlog_insert and > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode. > How have you verified what you are saying? You might be thinking of in-transaction sequence of Inert-truncate. What *I* mention before is truncation of a relation that smgrnblocks() has already been called for. The most common way to make it happen was INSERTs *before* the truncating transaction starts. It may be a SELECT on a hot-standby. Sorry for the confusing expression. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com wrote: > > On Friday, December 4, 2020 8:27 PM, Amit Kapila > wrote: > > On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi > > wrote: > > > > > > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" > > > wrote in > > > > > From: Kyotaro Horiguchi Hello, Kirk. > > > > > Thank you for the new version. > > > > > > > > Hi, Horiguchi-san. Thank you for your very helpful feedback. > > > > I'm updating the patches addressing those. > > > > > > > > > + if (!smgrexists(rels[i], j)) > > > > > + continue; > > > > > + > > > > > + /* Get the number of blocks for a relation's fork > > > > > */ > > > > > + blocks[i][numForks] = smgrnblocks(rels[i], j, > > > > > NULL); > > > > > > > > > > If we see a fork which its size is not cached we must give up this > > > > > optimization for all target relations. > > > > > > > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and > > > > use InRecovery when deciding for optimization because of the following > > reasons: > > > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to > > > > relation page contents. So in DropRelFileNodeBuffers(), > > > > XLogReadBufferExtended() is called during VACUUM replay because > > VACUUM changes the page content. > > > > OTOH, TRUNCATE doesn't change the relation content, it just > > > > truncates relation pages without changing the page contents. So > > > > XLogReadBufferExtended() is not called, and the "cached" flag will > > > > always return false. I tested with "cached" flags before, and this > > > > > > A bit different from the point, but if some tuples have been inserted > > > to the truncated table, XLogReadBufferExtended() is called for the > > > table and the length is cached. > > > > > > > always return false, at least in DropRelFileNodesAllBuffers. Due to > > > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers(). > > > > However, I think we can still rely on smgrnblocks to get the file > > > > size as long as we're InRecovery. That cached nblocks is still > > > > guaranteed > > to be the maximum in the shared buffer. > > > > Thoughts? > > > > > > That means that we always think as if smgrnblocks returns "cached" (or > > > "safe") value during recovery, which is out of our current consensus. > > > If we go on that side, we don't need to consult the "cached" returned > > > from smgrnblocks at all and it's enough to see only InRecovery. > > > > > > I got confused.. > > > > > > We are relying on the "fact" that the first lseek() call of a > > > (startup) process tells the truth. We added an assertion so that we > > > make sure that the cached value won't be cleared during recovery. A > > > possible remaining danger would be closing of an smgr object of a live > > > relation just after a file extension failure. I think we are thinking > > > that that doesn't happen during recovery. Although it seems to me > > > true, I'm not confident. > > > > > > > Yeah, I also think it might not be worth depending upon whether smgr close > > has been done before or not. I feel the current idea of using 'cached' > > parameter is relatively solid and we should rely on that. > > Also, which means that in DropRelFileNodesAllBuffers() we should rely on > > the same, I think doing things differently in this regard will lead to > > confusion. I > > agree in some cases we might not get benefits but it is more important to be > > correct and keep the code consistent to avoid introducing bugs now or in the > > future. > > > Hi, > I have reported before that it is not always the case that the "cached" flag > of > srnblocks() return true. So when I checked the truncate test case used in my > patch, it does not enter the optimization path despite doing INSERT before > truncation of table. > The reason for that is because in TRUNCATE, a new RelFileNode is assigned > to the relation when creating a new file. In recovery, > XLogReadBufferExtended() > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for > the > first time. And for recovery processing, different RelFileNodes are used for > the > INSERTs to the table and TRUNCATE to the same table. > Hmm, how is it possible if Insert is done before Truncate? The insert should happen in old RelFileNode only. I have verified by adding a break-in (while (1), so that it stops there) heap_xlog_insert and DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode. How have you verified what you are saying? -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Friday, December 4, 2020 8:27 PM, Amit Kapila wrote: > On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi > wrote: > > > > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" > > wrote in > > > > From: Kyotaro Horiguchi Hello, Kirk. > > > > Thank you for the new version. > > > > > > Hi, Horiguchi-san. Thank you for your very helpful feedback. > > > I'm updating the patches addressing those. > > > > > > > + if (!smgrexists(rels[i], j)) > > > > + continue; > > > > + > > > > + /* Get the number of blocks for a relation's fork */ > > > > + blocks[i][numForks] = smgrnblocks(rels[i], j, > > > > NULL); > > > > > > > > If we see a fork which its size is not cached we must give up this > > > > optimization for all target relations. > > > > > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and > > > use InRecovery when deciding for optimization because of the following > reasons: > > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to > > > relation page contents. So in DropRelFileNodeBuffers(), > > > XLogReadBufferExtended() is called during VACUUM replay because > VACUUM changes the page content. > > > OTOH, TRUNCATE doesn't change the relation content, it just > > > truncates relation pages without changing the page contents. So > > > XLogReadBufferExtended() is not called, and the "cached" flag will > > > always return false. I tested with "cached" flags before, and this > > > > A bit different from the point, but if some tuples have been inserted > > to the truncated table, XLogReadBufferExtended() is called for the > > table and the length is cached. > > > > > always return false, at least in DropRelFileNodesAllBuffers. Due to > > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers(). > > > However, I think we can still rely on smgrnblocks to get the file > > > size as long as we're InRecovery. That cached nblocks is still guaranteed > to be the maximum in the shared buffer. > > > Thoughts? > > > > That means that we always think as if smgrnblocks returns "cached" (or > > "safe") value during recovery, which is out of our current consensus. > > If we go on that side, we don't need to consult the "cached" returned > > from smgrnblocks at all and it's enough to see only InRecovery. > > > > I got confused.. > > > > We are relying on the "fact" that the first lseek() call of a > > (startup) process tells the truth. We added an assertion so that we > > make sure that the cached value won't be cleared during recovery. A > > possible remaining danger would be closing of an smgr object of a live > > relation just after a file extension failure. I think we are thinking > > that that doesn't happen during recovery. Although it seems to me > > true, I'm not confident. > > > > Yeah, I also think it might not be worth depending upon whether smgr close > has been done before or not. I feel the current idea of using 'cached' > parameter is relatively solid and we should rely on that. > Also, which means that in DropRelFileNodesAllBuffers() we should rely on > the same, I think doing things differently in this regard will lead to > confusion. I > agree in some cases we might not get benefits but it is more important to be > correct and keep the code consistent to avoid introducing bugs now or in the > future. > Hi, I have reported before that it is not always the case that the "cached" flag of srnblocks() return true. So when I checked the truncate test case used in my patch, it does not enter the optimization path despite doing INSERT before truncation of table. The reason for that is because in TRUNCATE, a new RelFileNode is assigned to the relation when creating a new file. In recovery, XLogReadBufferExtended() always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for the first time. And for recovery processing, different RelFileNodes are used for the INSERTs to the table and TRUNCATE to the same table. As we cannot use "cached" flag for both DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() based from above. I am thinking that if we want consistency, correctness, and to still make use of the optimization, we can completely drop the "cached" flag parameter in smgrnblocks, and use InRecovery. Tsunakawa-san mentioned in [1] that it is safe because smgrclose is not called by the startup process in recovery. Shared-inval messages are not sent to startup process. Otherwise, we use the current patch form as it is: using "cached" in DropRelFileNodeBuffers() and InRecovery for DropRelFileNodesAllBuffers(). However, that does not seem to be what is wanted in this thread. Thoughts? Regards, Kirk Jamison [1] https://www.postgresql.org/message-id/TYAPR01MB2990B42570A5FAC349EE983AFEF40%40TYAPR01MB2990.jpnprd01.prod.outlook.com
Re: [Patch] Optimize dropping of relation buffers using dlist
On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi wrote: > > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" > wrote in > > > From: Kyotaro Horiguchi > > > Hello, Kirk. Thank you for the new version. > > > > Hi, Horiguchi-san. Thank you for your very helpful feedback. > > I'm updating the patches addressing those. > > > > > + if (!smgrexists(rels[i], j)) > > > + continue; > > > + > > > + /* Get the number of blocks for a relation's fork */ > > > + blocks[i][numForks] = smgrnblocks(rels[i], j, > > > NULL); > > > > > > If we see a fork which its size is not cached we must give up this > > > optimization > > > for all target relations. > > > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and use > > InRecovery > > when deciding for optimization because of the following reasons: > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation > > page > > contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called > > during VACUUM replay because VACUUM changes the page content. > > OTOH, TRUNCATE doesn't change the relation content, it just truncates > > relation pages > > without changing the page contents. So XLogReadBufferExtended() is not > > called, and > > the "cached" flag will always return false. I tested with "cached" flags > > before, and this > > A bit different from the point, but if some tuples have been inserted > to the truncated table, XLogReadBufferExtended() is called for the > table and the length is cached. > > > always return false, at least in DropRelFileNodesAllBuffers. Due to this, > > we cannot use > > the cached flag in DropRelFileNodesAllBuffers(). However, I think we can > > still rely on > > smgrnblocks to get the file size as long as we're InRecovery. That cached > > nblocks is still > > guaranteed to be the maximum in the shared buffer. > > Thoughts? > > That means that we always think as if smgrnblocks returns "cached" (or > "safe") value during recovery, which is out of our current > consensus. If we go on that side, we don't need to consult the > "cached" returned from smgrnblocks at all and it's enough to see only > InRecovery. > > I got confused.. > > We are relying on the "fact" that the first lseek() call of a > (startup) process tells the truth. We added an assertion so that we > make sure that the cached value won't be cleared during recovery. A > possible remaining danger would be closing of an smgr object of a live > relation just after a file extension failure. I think we are thinking > that that doesn't happen during recovery. Although it seems to me > true, I'm not confident. > Yeah, I also think it might not be worth depending upon whether smgr close has been done before or not. I feel the current idea of using 'cached' parameter is relatively solid and we should rely on that. Also, which means that in DropRelFileNodesAllBuffers() we should rely on the same, I think doing things differently in this regard will lead to confusion. I agree in some cases we might not get benefits but it is more important to be correct and keep the code consistent to avoid introducing bugs now or in the future. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
On Friday, December 4, 2020 12:42 PM, Tang, Haiying wrote: > Hello, Kirk > > Thanks for providing the new patches. > I did the recovery performance test on them, the results look good. I'd like > to > share them with you and everyone else. > (I also record VACUUM and TRUNCATE execution time on master/primary in > case you want to have a look.) Hi, Tang. Thank you very much for verifying the performance using the latest set of patches. Although it's not supposed to affect the non-recovery path (execution on primary), It's good to see those results too. > 1. VACUUM and Failover test results(average of 15 times) [VACUUM] > ---execution time on master/primary > shared_buffers master(sec) > patched(sec) %reg=((patched-master)/master) > - > - > 128M9.440 9.483 0% > 10G74.689 76.219 2% > 20G 152.538138.292 -9% > > [Failover] ---execution time on standby > shared_buffers master(sec) > patched(sec) %reg=((patched-master)/master) > - > - > 128M3.6292.961-18% > 10G82.4432.627-97% > 20G 171.3882.607-98% > > 2. TRUNCATE and Failover test results(average of 15 times) [TRUNCATE] > ---execution time on master/primary > shared_buffers master(sec) > patched(sec) %reg=((patched-master)/master) > - > - > 128M 49.271 49.867 1% > 10G 172.437 175.197 2% > 20G 279.658 278.752 0% > > [Failover] ---execution time on standby > shared_buffersmaster(sec) > patched(sec) %reg=((patched-master)/master) > - > - > 128M 4.8773.989-18% > 10G 92.6803.975-96% > 20G 182.0353.962-98% > > [Machine spec] > CPU : 40 processors (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz) > Memory: 64G > OS: CentOS 8 > > [Failover test data] > Total table Size: 700M > Table: 1 tables (1000 rows per table) > > If you have question on my test, please let me know. Looks great. That was helpful to see if there were any performance differences than the previous versions' results. But I am glad it turned out great too. Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
At Thu, 3 Dec 2020 07:18:16 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Jamison, Kirk/ジャミソン カーク > > Apologies for the delay, but attached are the updated versions to simplify > > the > > patches. > > Looks good for me. Thanks to Horiguchi-san and Andres-san, the code bebecame > further compact and easier to read. I've marked this ready for committer. > > > To the committer: > I don't think it's necessary to refer to COMMIT/ROLLBACK PREPARED in the > following part of the 0003 commit message. They surely call > DropRelFileNodesAllBuffers(), but COMMIT/ROLLBACK also call it. > > the full scan threshold. This improves the DropRelationFiles() > performance when the TRUNCATE command truncated off any of the empty > pages at the end of relation, and when dropping relation buffers if a > commit/rollback transaction has been prepared in FinishPreparedTransaction(). I think whether we can use this optimization only by looking InRecovery is still in doubt. Or if we can decide that on that criteria, 0003 also can be simplivied using the same assumption. Separate from the maybe-remaining discussion, I have a comment on the revised code in 0004. +* equal to the full scan threshold. +*/ + if (nBlocksToInvalidate >= BUF_DROP_FULL_SCAN_THRESHOLD) + { + pfree(block); + goto buffer_full_scan; + } I don't particularily hate goto statement but we can easily avoid that by reversing the condition here. You might consider the length of the line calling "FindAndDropRelFileNodeBuffers" but the indentation can be lowered by inverting the condition on BLockNumberIsValid. !| if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD) | { | for (i = 0; i < n; i++) | { | /* | * If block to drop is valid, drop the buffers of the fork. | * Zero the firstDelBlock because all buffers will be | * dropped anyway. | */ | for (j = 0; j <= MAX_FORKNUM; j++) | { !| if (!BlockNumberIsValid(block[i][j])) !| continue; | | FindAndDropRelFileNodeBuffers(smgr_reln[i]->smgr_rnode.node, | j, block[i][j], 0); | } | } | pfree(block); | return; | } | | pfree(block); Or we can separate the calcualtion part and the execution part by introducing a flag "do_fullscan". | /* | * We enter the optimization iff we are in recovery. Otherwise, | * we proceed to full scan of the whole buffer pool. | */ | if (InRecovery) | { ... !| if (nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD) !| do_fullscan = false; !| } !| !| if (!do_fullscan) !| { | for (i = 0; i < n; i++) | { | /* | * If block to drop is valid, drop the buffers of the fork. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
Thanks for the new version. This contains only replies. I'll send some further comments in another mail later. At Thu, 3 Dec 2020 03:49:27 +, "k.jami...@fujitsu.com" wrote in > On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote: > > Hello, Kirk. Thank you for the new version. > > Apologies for the delay, but attached are the updated versions to simplify > the patches. > The changes reflected most of your comments/suggestions. > > Summary of changes in the latest versions. > 1. Updated the function description of DropRelFileNodeBuffers in 0003. > 2. Updated the commit logs of 0003 and 0004. > 3, FindAndDropRelFileNodeBuffers is now called for each relation fork, > instead of for all involved forks. > 4. Removed the unnecessary palloc() and subscripts like forks[][], >firstDelBlock[], nforks, as advised by Horiguchi-san. The memory >allocation for block[][] was also simplified. >So 0004 became simpler and more readable. ... > > > a reliable size of nblocks for supplied relation's fork at that time, > > > and it's safe because DropRelFileNodeBuffers() relies on the behavior > > > that cached nblocks will not be invalidated by file extension during > > > recovery. Otherwise, or if not in recovery, proceed to sequential > > > search of the whole buffer pool. > > > > This sentence seems involving confusion. It reads as if "we can rely on it > > because we're relying on it". And "the cached value won't be invalidated" > > doesn't explain the reason precisely. The reason I think is that the cached > > value is guaranteed to be the maximum page we have in shared buffer at least > > while recovery, and that guarantee is holded by not asking fseek once we > > cached the value. > > Fixed the commit log of 0003. Thanks! ... > > + nforks = palloc(sizeof(int) * n); > > + forks = palloc(sizeof(ForkNumber *) * n); > > + blocks = palloc(sizeof(BlockNumber *) * n); > > + firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM > > + 1)); > > + for (i = 0; i < n; i++) > > + { > > + forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM + > > 1)); > > + blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM > > + 1)); > > + } > > > > We can allocate the whole array at once like this. > > > > BlockNumber (*blocks)[MAX_FORKNUM+1] = > > (BlockNumber (*)[MAX_FORKNUM+1]) > > palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1)) > > Thank you for suggesting to reduce the lines for the 2d dynamic memory alloc. > I followed this way in 0004, but it's my first time to see it written this > way. > I am very glad it works, though is it okay to write it this way since I > cannot find > a similar code of declaring and allocating 2D arrays like this in Postgres > source code? Actually it would be somewhat novel for a certain portion of people, but it is fundamentally the same with function pointers. Hard to make it from scratch, but I suppose not so hard to read:) int (*func_char_to_int)(char x) = some_func; FWIW isn.c has the following part: > static bool > check_table(const char *(*TABLE)[2], const unsigned TABLE_index[10][2]) > > + nBlocksToInvalidate += blocks[i][numForks]; > > + > > + forks[i][numForks++] = j; > > > > We can signal to the later code the absense of a fork by setting > > InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be > > removed. > > Followed it in 0004. Looks fine to me, thanks. > > + /* Zero the array of blocks because these will all be dropped anyway > > */ > > + MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n * > > (MAX_FORKNUM + > > +1)); > > > > We don't need to prepare nforks, forks and firstDelBlocks for all relations > > before looping over relations. In other words, we can fill in the arrays > > for a > > relation at every iteration of relations. > > Followed your advice. Although I now drop the buffers per fork, which now > removes forks[][], nforks, firstDelBlocks[]. That's fine for me. > > +* We enter the optimization iff we are in recovery and the number of > > +blocks to > > > > This comment ticks out of 80 columns. (I'm not sure whether that convention > > is still valid..) > > Fixed. > > > + if (InRecovery && nBlocksToInvalidate < > > BUF_DROP_FULL_SCAN_THRESHOLD) > > > > We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't do > > that. > > > As for DropRelFileNodesAllBuffers use case, I used InRecovery > so that the optimization still works. > Horiguchi-san also wrote in another mail: > > A bit different from the point, but if some tuples have been inserted to the > > truncated table, XLogReadBufferExtended() is called for the table and the > > length is cached. > I was wrong in my previous claim that the "cached" value always return false. > When I checked the recovery test log from recovery tap test, there was only > one example when "cached" became true (script below) and entered
RE: [Patch] Optimize dropping of relation buffers using dlist
Hello, Kirk Thanks for providing the new patches. I did the recovery performance test on them, the results look good. I'd like to share them with you and everyone else. (I also record VACUUM and TRUNCATE execution time on master/primary in case you want to have a look.) 1. VACUUM and Failover test results(average of 15 times) [VACUUM] ---execution time on master/primary shared_buffers master(sec) patched(sec) %reg=((patched-master)/master) -- 128M9.440 9.483 0% 10G74.689 76.219 2% 20G 152.538138.292 -9% [Failover] ---execution time on standby shared_buffers master(sec)patched(sec) %reg=((patched-master)/master) -- 128M3.6292.961-18% 10G82.4432.627-97% 20G 171.3882.607-98% 2. TRUNCATE and Failover test results(average of 15 times) [TRUNCATE] ---execution time on master/primary shared_buffers master(sec)patched(sec) %reg=((patched-master)/master) -- 128M 49.271 49.867 1% 10G 172.437 175.197 2% 20G 279.658 278.752 0% [Failover] ---execution time on standby shared_buffersmaster(sec) patched(sec) %reg=((patched-master)/master) -- 128M 4.8773.989-18% 10G 92.6803.975-96% 20G 182.0353.962-98% [Machine spec] CPU : 40 processors (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz) Memory: 64G OS: CentOS 8 [Failover test data] Total table Size: 700M Table: 1 tables (1000 rows per table) If you have question on my test, please let me know. Regards, Tang
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > Apologies for the delay, but attached are the updated versions to simplify the > patches. Looks good for me. Thanks to Horiguchi-san and Andres-san, the code bebecame further compact and easier to read. I've marked this ready for committer. To the committer: I don't think it's necessary to refer to COMMIT/ROLLBACK PREPARED in the following part of the 0003 commit message. They surely call DropRelFileNodesAllBuffers(), but COMMIT/ROLLBACK also call it. the full scan threshold. This improves the DropRelationFiles() performance when the TRUNCATE command truncated off any of the empty pages at the end of relation, and when dropping relation buffers if a commit/rollback transaction has been prepared in FinishPreparedTransaction(). Regards Takayuki Tsunakawa
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote: > Hello, Kirk. Thank you for the new version. Apologies for the delay, but attached are the updated versions to simplify the patches. The changes reflected most of your comments/suggestions. Summary of changes in the latest versions. 1. Updated the function description of DropRelFileNodeBuffers in 0003. 2. Updated the commit logs of 0003 and 0004. 3, FindAndDropRelFileNodeBuffers is now called for each relation fork, instead of for all involved forks. 4. Removed the unnecessary palloc() and subscripts like forks[][], firstDelBlock[], nforks, as advised by Horiguchi-san. The memory allocation for block[][] was also simplified. So 0004 became simpler and more readable. > At Thu, 26 Nov 2020 03:04:10 +, "k.jami...@fujitsu.com" > wrote in > > On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote: > > > From: Andres Freund > > > > DropRelFileNodeBuffers() in recovery? The most common path is > > > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers > > > > (), which 3/4 doesn't address and 4/4 doesn't mention. > > > > > > > > 4/4 seems to address DropRelationFiles(), but only talks about > > > TRUNCATE? > > > > > > Yes. DropRelationFiles() is used in the following two paths: > > > > > > [Replay of TRUNCATE during recovery] > > > xact_redo_commit/abort() -> DropRelationFiles() -> > > > smgrdounlinkall() -> > > > DropRelFileNodesAllBuffers() > > > > > > [COMMIT/ROLLBACK PREPARED] > > > FinishPreparedTransaction() -> DropRelationFiles() -> > > > smgrdounlinkall() > > > -> DropRelFileNodesAllBuffers() > > > > Yes. The concern is that it was not clear in the function descriptions > > and commit logs what the optimizations for the > > DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() are for. So > > I revised only the function description of DropRelFileNodeBuffers() and the > commit logs of the 0003-0004 patches. Please check if the brief descriptions > would suffice. > > I read the commit message of 3/4. (Though this is not involved literally in > the > final commit.) > > > While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum > or > > autovacuum are replayed, the buffers are dropped when the sizes of all > > involved forks of a relation are already "cached". We can get > > This sentence seems missing "dropped by (or using) what". > > > a reliable size of nblocks for supplied relation's fork at that time, > > and it's safe because DropRelFileNodeBuffers() relies on the behavior > > that cached nblocks will not be invalidated by file extension during > > recovery. Otherwise, or if not in recovery, proceed to sequential > > search of the whole buffer pool. > > This sentence seems involving confusion. It reads as if "we can rely on it > because we're relying on it". And "the cached value won't be invalidated" > doesn't explain the reason precisely. The reason I think is that the cached > value is guaranteed to be the maximum page we have in shared buffer at least > while recovery, and that guarantee is holded by not asking fseek once we > cached the value. Fixed the commit log of 0003. > > > > I also don't get why 4/4 would be a good idea on its own. It uses > > > > BUF_DROP_FULL_SCAN_THRESHOLD to guard > > > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since > > > > DropRelFileNodesAllBuffers() can be used for many relations at > > > > once, this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD > - 1 > > > lookups a lot > > > > of times, once for each of nnodes relations? > > > > > > So, the threshold value should be compared with the total number of > > > blocks of all target relations, not each relation. You seem to be right, > > > got > it. > > > > Fixed this in 0004 patch. Now we compare the total number of > > buffers-to-be-invalidated For ALL relations to the > BUF_DROP_FULL_SCAN_THRESHOLD. > > I didn't see the previous version, but the row of additional palloc/pfree's in > this version looks uneasy. Fixed this too. > int i, > + j, > + *nforks, > n = 0; > > Perhaps I think we don't define variable in different types at once. > (I'm not sure about defining multple variables at once.) Fixed this too. > @@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend > *rnodes, int nnodes) > > DropRelFileNodeAllLocalBuffers(rnodes[i].node); > } > else > + { > + rels[n] = smgr_reln[i]; > nodes[n++] = rnodes[i].node; > + } > } > > We don't need to remember nodes and rnodes here since rnodes[n] is > rels[n]->smgr_rnode here. Or we don't even need to store rels since we can > scan the smgr_reln later again. > > nodes is needed in the full-scan path but it is enough to collect it after > finding > that we do full-scan. I
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > We are relying on the "fact" that the first lseek() call of a > (startup) process tells the truth. We added an assertion so that we > make sure that the cached value won't be cleared during recovery. A > possible remaining danger would be closing of an smgr object of a live > relation just after a file extension failure. I think we are thinking > that that doesn't happen during recovery. Although it seems to me > true, I'm not confident. > > If that's true, we don't even need to look at the "cached" flag at all > and always be able to rely on the returned value from msgrnblocks() > during recovery. Otherwise, we need to avoid the danger situation. Hmm, I've gotten to think that smgrnblocks() doesn't need the cached parameter, too. DropRel*Buffers() can just check InRecovery. Regarding the only concern about smgrclose() by startup process, I was afraid of the cache invalidation by CacheInvalidateSmgr(), but startup process doesn't receive shared inval messages. So, it doesn't call smgrclose*() due to shared cache invalidation. [InitRecoveryTransactionEnvironment()] /* Initialize shared invalidation management for Startup process, being * Initialize shared invalidation management for Startup process, being * careful to register ourselves as a sendOnly process so we don't need to * read messages, nor will we get signaled when the queue starts filling * up. */ SharedInvalBackendInit(true); Kirk-san, Can you check to see if smgrclose() and its friends are not called during recovery using the regression test? Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
Hi! I've found this patch is RFC on commitfest application. I've quickly checked if it's really ready for commit. It seems there are still unaddressed review notes. I'm going to switch it to WFA. -- Regards, Alexander Korotkov
Re: [Patch] Optimize dropping of relation buffers using dlist
At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com" wrote in > > From: Kyotaro Horiguchi > > Hello, Kirk. Thank you for the new version. > > Hi, Horiguchi-san. Thank you for your very helpful feedback. > I'm updating the patches addressing those. > > > + if (!smgrexists(rels[i], j)) > > + continue; > > + > > + /* Get the number of blocks for a relation's fork */ > > + blocks[i][numForks] = smgrnblocks(rels[i], j, > > NULL); > > > > If we see a fork which its size is not cached we must give up this > > optimization > > for all target relations. > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and use > InRecovery > when deciding for optimization because of the following reasons: > XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page > contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called > during VACUUM replay because VACUUM changes the page content. > OTOH, TRUNCATE doesn't change the relation content, it just truncates > relation pages > without changing the page contents. So XLogReadBufferExtended() is not > called, and > the "cached" flag will always return false. I tested with "cached" flags > before, and this A bit different from the point, but if some tuples have been inserted to the truncated table, XLogReadBufferExtended() is called for the table and the length is cached. > always return false, at least in DropRelFileNodesAllBuffers. Due to this, we > cannot use > the cached flag in DropRelFileNodesAllBuffers(). However, I think we can > still rely on > smgrnblocks to get the file size as long as we're InRecovery. That cached > nblocks is still > guaranteed to be the maximum in the shared buffer. > Thoughts? That means that we always think as if smgrnblocks returns "cached" (or "safe") value during recovery, which is out of our current consensus. If we go on that side, we don't need to consult the "cached" returned from smgrnblocks at all and it's enough to see only InRecovery. I got confused.. We are relying on the "fact" that the first lseek() call of a (startup) process tells the truth. We added an assertion so that we make sure that the cached value won't be cleared during recovery. A possible remaining danger would be closing of an smgr object of a live relation just after a file extension failure. I think we are thinking that that doesn't happen during recovery. Although it seems to me true, I'm not confident. If that's true, we don't even need to look at the "cached" flag at all and always be able to rely on the returned value from msgrnblocks() during recovery. Otherwise, we need to avoid the danger situation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: [Patch] Optimize dropping of relation buffers using dlist
> From: Kyotaro Horiguchi > Hello, Kirk. Thank you for the new version. Hi, Horiguchi-san. Thank you for your very helpful feedback. I'm updating the patches addressing those. > + if (!smgrexists(rels[i], j)) > + continue; > + > + /* Get the number of blocks for a relation's fork */ > + blocks[i][numForks] = smgrnblocks(rels[i], j, > NULL); > > If we see a fork which its size is not cached we must give up this > optimization > for all target relations. I did not use the "cached" flag in DropRelFileNodesAllBuffers and use InRecovery when deciding for optimization because of the following reasons: XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called during VACUUM replay because VACUUM changes the page content. OTOH, TRUNCATE doesn't change the relation content, it just truncates relation pages without changing the page contents. So XLogReadBufferExtended() is not called, and the "cached" flag will always return false. I tested with "cached" flags before, and this always return false, at least in DropRelFileNodesAllBuffers. Due to this, we cannot use the cached flag in DropRelFileNodesAllBuffers(). However, I think we can still rely on smgrnblocks to get the file size as long as we're InRecovery. That cached nblocks is still guaranteed to be the maximum in the shared buffer. Thoughts? Regards, Kirk Jamison
Re: [Patch] Optimize dropping of relation buffers using dlist
At Thu, 26 Nov 2020 16:18:55 +0900 (JST), Kyotaro Horiguchi wrote in > + /* Zero the array of blocks because these will all be dropped anyway */ > + MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n * (MAX_FORKNUM + 1)); > > We don't need to prepare nforks, forks and firstDelBlocks for all > relations before looping over relations. In other words, we can fill > in the arrays for a relation at every iteration of relations. Or even we could call FindAndDropRelFileNodeBuffers() for each forks. It dones't matter in the performance perspective whether the function loops over forks or the function is called for each forks. regrds. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
Hello, Kirk. Thank you for the new version. At Thu, 26 Nov 2020 03:04:10 +, "k.jami...@fujitsu.com" wrote in > On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote: > > From: Andres Freund > > > DropRelFileNodeBuffers() in recovery? The most common path is > > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(), > > > which 3/4 doesn't address and 4/4 doesn't mention. > > > > > > 4/4 seems to address DropRelationFiles(), but only talks about > > TRUNCATE? > > > > Yes. DropRelationFiles() is used in the following two paths: > > > > [Replay of TRUNCATE during recovery] > > xact_redo_commit/abort() -> DropRelationFiles() -> smgrdounlinkall() -> > > DropRelFileNodesAllBuffers() > > > > [COMMIT/ROLLBACK PREPARED] > > FinishPreparedTransaction() -> DropRelationFiles() -> smgrdounlinkall() > > -> DropRelFileNodesAllBuffers() > > Yes. The concern is that it was not clear in the function descriptions and > commit logs > what the optimizations for the DropRelFileNodeBuffers() and > DropRelFileNodesAllBuffers() > are for. So I revised only the function description of > DropRelFileNodeBuffers() and the > commit logs of the 0003-0004 patches. Please check if the brief descriptions > would suffice. I read the commit message of 3/4. (Though this is not involved literally in the final commit.) > While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum > or autovacuum are replayed, the buffers are dropped when the sizes > of all involved forks of a relation are already "cached". We can get This sentence seems missing "dropped by (or using) what". > a reliable size of nblocks for supplied relation's fork at that time, > and it's safe because DropRelFileNodeBuffers() relies on the behavior > that cached nblocks will not be invalidated by file extension during > recovery. Otherwise, or if not in recovery, proceed to sequential > search of the whole buffer pool. This sentence seems involving confusion. It reads as if "we can rely on it because we're relying on it". And "the cached value won't be invalidated" doesn't explain the reason precisely. The reason I think is that the cached value is guaranteed to be the maximum page we have in shared buffer at least while recovery, and that guarantee is holded by not asking fseek once we cached the value. > > > I also don't get why 4/4 would be a good idea on its own. It uses > > > BUF_DROP_FULL_SCAN_THRESHOLD to guard > > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since > > > DropRelFileNodesAllBuffers() can be used for many relations at once, > > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1 > > lookups a lot > > > of times, once for each of nnodes relations? > > > > So, the threshold value should be compared with the total number of blocks > > of all target relations, not each relation. You seem to be right, got it. > > Fixed this in 0004 patch. Now we compare the total number of > buffers-to-be-invalidated > For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD. I didn't see the previous version, but the row of additional palloc/pfree's in this version looks uneasy. int i, + j, + *nforks, n = 0; Perhaps I think we don't define variable in different types at once. (I'm not sure about defining multple variables at once.) @@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes) DropRelFileNodeAllLocalBuffers(rnodes[i].node); } else + { + rels[n] = smgr_reln[i]; nodes[n++] = rnodes[i].node; + } } We don't need to remember nodes and rnodes here since rnodes[n] is rels[n]->smgr_rnode here. Or we don't even need to store rels since we can scan the smgr_reln later again. nodes is needed in the full-scan path but it is enough to collect it after finding that we do full-scan. /* @@ -3120,6 +3138,68 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes) if (n == 0) { pfree(nodes); + pfree(rels); + pfree(rnodes); + return; + } + + nforks = palloc(sizeof(int) * n); + forks = palloc(sizeof(ForkNumber *) * n); + blocks = palloc(sizeof(BlockNumber *) * n); + firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1)); + for (i = 0; i < n; i++) + { + forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM + 1)); + blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM + 1)); + } We can allocate the whole array at once like this. BlockNumber (*blocks)[MAX_FORKNUM+1] = (BlockNumber (*)[MAX_FORKNUM+1]) palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1)) The elements of forks[][] and blocks[][]